For this example, I will be using data from Kaggle about miles per gallon and various engine metrics such as displacement, number of cylinders, horsepower, acceleration, etc…
The data is in the form of a CSV file. Missing values in the data are designated with “?”. While one solution is to erase all the rows with missing values, such a solution is not ideal for real-world data analysis. Frequently, data values are missing. Any successful machine learning algorithm needs to take that into account.
For this tutorial, I will use Jupyter Lab, pandas, scikit-learn, numpy, and matplotlib.
First, to create the kernel to run these modules, type this command into the terminal prompt:
pipenv install ipykernel pandas numpy scikit-learn scipy seaborn
This command creates an ipykernel with pandas, numpy, scikit-learn, scipy and seaborn installed. Next, type in,
pipenv shell
Finally, use the command,
python -m ipykernel install –user –display-name machine_learning_tutorial –name machine_learning_tutorial
Then type,
jupyter lab
This command will start Jupyter Lab in your default browser window. Choose new notebook from the file menu. Then select machine_learning_tutorial as the kernel.
A list of items is needed for this full tutorial. I like to be explicit in what is being imported from sklearn (scikit-learn when first installed).
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score, cross_val_predict, ShuffleSplit, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, roc_curve, auc
from sklearn.metrics import accuracy_score, recall_score, precision_score
from sklearn.feature_selection import chi2, SelectKBest, f_classif
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
Matplotlib is a standard python module, and therefore, does not need to be imported. Save the notebook file to a directory called machine_learning.
The command,
%matplotlib inline
allows the Jupyter Lab to show matplotlib plots in the browser window.
In the next cell, type the following code to get the CSV data into a pandas dataframe:
df=pd.read_csv(“auto_mpg.csv”, index_col=False)
For simplicity, I put the auto_mpg.csv file into the same directory as the notebook file.
The standard replace function for dataframes and series only replaces strings with strings. However, if .astype(float) is added, the following statement replaces every ‘?’ with ‘999999’. Then, it makes the result into a float. Later calculations require floats.
df[‘horsepower’]=df[‘horsepower’].replace(‘?’, ‘999999’).astype(float)
Obviously, changing the value to 999999 changes the average. In regression, the value is treated as an outlier. To ensure that the replacement worked, print the first 35 values of the dataframe with the statement,
print(df.head(35))
The output should show one 999999.0.
Now, go to a new cell in Jupyter Lab. Important numeric features are in columns 1-5. Column 6 is excluded in following statement, which defines X as columns 1 through 5 of the dataframe.
X=df.iloc[:,1:6]
The command selects all of the rows of columns 1 through 5.
What you are trying to predict, mpg, the y values, are in the first (the 0) column.
y=df.iloc[:, 0]
Again you want all of the rows, so you have the colon first.
Now you want to find the features of X that are most closely correlated with y. So type the following:
best_features = SelectKBest(score_func=f_classif, k=’all’)
fit = best_features.fit(X,y)
print(X.columns)
scores=fit.scores_
print(scores)
**Note that the underscore after fit.scores_ is intentional.
There are only five features that I want to choose from, so in this case I choose k=’all’.
Cylinders, displacement, and weight are shown to have the highest correlation with mpg. Just to confirm, you can create a correlation heatmap with matplotlib and seaborn. Notice the abbreviation for seaborn is sns, and the one for matplotlib.pyplot is plt.
df2=df[[‘mpg’, ‘cylinders’, ‘displacement’, ‘horsepower’, ‘weight’, ‘acceleration’]]
plt.figure(figsize=(12,10))
cor = df2.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()
As expected, cylinders, displacement, and weight are most closely correlated with mpg in the first column of the correlation heatmap.
Another way to check this is to use the following commands:
corr_target = abs(cor[“mpg”])
key_features = corr_target[cor_target>0.5]
key_features
For this reason, any regression model to predict mpg from this data should use these features. Notably, these features are not independent and are correlated with each other.
In the next article, I will continue with this tutorial and will delve into sklearn’s train_test_split, cross_validation.
References
Auto-mpg dataset, Mileage per gallon performances of various cars. Kaggle. https://www.kaggle.com/uciml/autompg-dataset, Retrieved on 2020, June 7.
Brownlee, J. (2016, May 20) Feature Selection for Machine Learning in Python. Machine Learning Mastery. https://machinelearningmastery.com/feature-selection-machine-learning-python/ Retrieved on 2020, June 8.
Shaikh, R. (2018, Oct 28) Feature Selection Techniques in Machine Learning with Python. towards data science. https://towardsdatascience.com/feature-selection-techniques-in-machine-learning-with-python-f24e7da3f36e
Shetye, A. (2019, Feb 11). Feature Selection with sklearn and Pandas, Introduction to Feature Selection methods and their implementation in Python. towards data science. https://towardsdatascience.com/feature-selection-with-pandas-e3690ad8504b