Machine Learning Tutorial part II
After you evaluate the scores for each column, select the columns with the highest scores. These columns should be ‘displacement’, ‘horsepower’, and ‘weight’. In this case, they will be columns 2, 3, and 4.
So, you will need to change X. Use an iloc statement to refer to the columns by number. The second number in such statements is excluded.
So, the following,
X=df.iloc[:,2:5]
refers to the proper columns.
Next, you will want to change the dataframe into a numpy array for easier calculations. While this step is not strictly necessary, vectorized numpy arrays enable faster calculations than dataframes.
So change the statement, X=iloc[:,2:5] to,
X=df.iloc[:,2:5].to_numpy()
Also, edit the command y=[:,0] to,
y=df.iloc[:,0].to_numpy()
Now, you need to do the critical step, train_test_split. This command shuffles the full training set and selects which samples will be among the training and testing sets. The training and testing sets are portions of the full training set. The split is essentially the portion you specify of up to 1. In this case, I chose 0.1 for the test_size, which is 10 percent of the full training data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)
Random state acts like a random number seed that allows reproducibility. According to scikit-learn.org, random state, “Controls the shuffling applied to the data before applying the split.”
Any time you set random_state=0 in this command, you will get the same result.
To ensure that the train_test_split worked, it is a good idea to print part of the results.
print(X_train[:5])
print(y_train[:5])
[[ 454. 220. 4354.]
[ 350. 165. 3693.]
[ 121. 80. 2670.]
[ 70. 100. 2420.]
[ 144. 96. 2665.]]
[14. 15. 27.4 23.7 32. ]
You will notice that although these commands look the same, they are treated differently. The first command prints row 0, 1, 2, 3, and 4 of X_train, a 2D numpy array. The second command prints items 0, 1, 2, 3, and 4 of the 1D numpy array. Therefore, the second command results in a 1D array (not a 2D array) of the first five items in y_train.
I like to take one more step to make sure that the train_test_split worked as expected. Print out the shape of the training and testing arrays.
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
(358, 3)
(358,)
(40, 3)
(40,)
While the full training set has 398 samples, the 0.1 test size cannot very well be 39.8, samples, so it rounds up to 40.
References
Sklearn.model_selection.train_test_split. scikit-learn.org. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html on 2020, June 10.