feature selection with correlation heatmap made with seaborn

Feature Selection in Machine Learning Predictive Models

Please follow and like us:

For this example, I will be using data from Kaggle about miles per gallon and various engine metrics such as displacement, number of cylinders, horsepower, acceleration, etc…

The data is in the form of a CSV file. Missing values in the data are designated with “?”. While one solution is to erase all the rows with missing values, such a solution is not ideal for real-world data analysis. Frequently, data values are missing. Any successful machine learning algorithm needs to take that into account.

For this tutorial, I will use Jupyter Lab, pandas, scikit-learn, numpy, and matplotlib.

First, to create the kernel to run these modules, type this command into the terminal prompt:

pipenv install ipykernel pandas numpy scikit-learn scipy seaborn

This command creates an ipykernel with pandas, numpy, scikit-learn, scipy and seaborn installed. Next, type in,

pipenv shell

Finally, use the command,

python -m ipykernel install –user –display-name machine_learning_tutorial –name machine_learning_tutorial

Then type,

jupyter lab

This command will start Jupyter Lab in your default browser window. Choose new notebook from the file menu. Then select machine_learning_tutorial as the kernel.

A list of items is needed for this full tutorial. I like to be explicit in what is being imported from sklearn (scikit-learn when first installed).

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score, cross_val_predict, ShuffleSplit, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, roc_curve, auc
from sklearn.metrics import accuracy_score, recall_score, precision_score
from sklearn.feature_selection import chi2, SelectKBest, f_classif
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

Matplotlib is a standard python module, and therefore, does not need to be imported. Save the notebook file to a directory called machine_learning.

The command,

%matplotlib inline

allows the Jupyter Lab to show matplotlib plots in the browser window.

In the next cell, type the following code to get the CSV data into a pandas dataframe:

df=pd.read_csv(“auto_mpg.csv”, index_col=False)

For simplicity, I put the auto_mpg.csv file into the same directory as the notebook file.

The standard replace function for dataframes and series only replaces strings with strings. However, if .astype(float) is added, the following statement replaces every ‘?’ with ‘999999’. Then, it makes the result into a float. Later calculations require floats.

df[‘horsepower’]=df[‘horsepower’].replace(‘?’, ‘999999’).astype(float)

Obviously, changing the value to 999999 changes the average. In regression, the value is treated as an outlier. To ensure that the replacement worked, print the first 35 values of the dataframe with the statement,

print(df.head(35))

The output should show one 999999.0.

Now, go to a new cell in Jupyter Lab. Important numeric features are in columns 1-5. Column 6 is excluded in following statement, which defines X as columns 1 through 5 of the dataframe.

X=df.iloc[:,1:6]

The command selects all of the rows of columns 1 through 5.

What you are trying to predict, mpg, the y values, are in the first (the 0) column.

y=df.iloc[:, 0]

Again you want all of the rows, so you have the colon first.

Now you want to find the features of X that are most closely correlated with y. So type the following:

best_features = SelectKBest(score_func=f_classif, k=’all’)
fit = best_features.fit(X,y)
print(X.columns)
scores=fit.scores_
print(scores)

**Note that the underscore after fit.scores_ is intentional.

There are only five features that I want to choose from, so in this case I choose k=’all’.

Cylinders, displacement, and weight are shown to have the highest correlation with mpg. Just to confirm, you can create a correlation heatmap with matplotlib and seaborn. Notice the abbreviation for seaborn is sns, and the one for matplotlib.pyplot is plt.

df2=df[[‘mpg’, ‘cylinders’, ‘displacement’, ‘horsepower’, ‘weight’, ‘acceleration’]]
plt.figure(figsize=(12,10))
cor = df2.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()

As expected, cylinders, displacement, and weight are most closely correlated with mpg in the first column of the correlation heatmap.

Another way to check this is to use the following commands:

corr_target = abs(cor[“mpg”])
key_features = corr_target[cor_target>0.5]
key_features

For this reason, any regression model to predict mpg from this data should use these features. Notably, these features are not independent and are correlated with each other.

In the next article, I will continue with this tutorial and will delve into sklearn’s train_test_split, cross_validation.

References

Auto-mpg dataset, Mileage per gallon performances of various cars. Kaggle. https://www.kaggle.com/uciml/autompg-dataset, Retrieved on 2020, June 7.

Brownlee, J. (2016, May 20) Feature Selection for Machine Learning in Python. Machine Learning Mastery. https://machinelearningmastery.com/feature-selection-machine-learning-python/ Retrieved on 2020, June 8.

Shaikh, R. (2018, Oct 28) Feature Selection Techniques in Machine Learning with Python. towards data science. https://towardsdatascience.com/feature-selection-techniques-in-machine-learning-with-python-f24e7da3f36e

Shetye, A. (2019, Feb 11). Feature Selection with sklearn and Pandas, Introduction to Feature Selection methods and their implementation in Python. towards data science. https://towardsdatascience.com/feature-selection-with-pandas-e3690ad8504b

Please follow and like us: