cross validation with sklearn

Cross Validation and Model Scoring with Sklearn

Please follow and like us:

Machine Learning Tutorial Part III

One of the most important aspects of machine learning is choosing the best model. The easiest way to pick among a list of models is to make each model instance a member of a list of models. Then, you can iterate through that list and score each model with sklearn’s cross validation.

A model that makes predictions about a continuous variable should use some form of regression. Make a list of regression models including, LinearRegression and Lasso, among others. I like to import the models explicitly.

Sklearn Imports

from sklearn.linear_model import LinearRegression, Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

Lists of models need to have parenthesis after each model form to make the model callable. The model names must be capitalized the same way as the imported model names are. On the other hand, svm is lowercase while the SVR afterwards is capitalized.

models=[LinearRegression(),
Lasso(),
KNeighborsRegressor(),
DecisionTreeRegressor(),
RandomForestRegressor(),
svm.SVR()]

You can include other regression models in your list.

You will need to iterate through the list and cross-validate each model. So, you will need the following imports:

Iterating through a List of Models

for model in models:
model.fit(X_train, y_train)
scores = cross_val_score(model, X_train, y_train, cv=5)
print(scores)
scores_average = (sum(scores))/(len(scores))
print(scores_average)

In this case, cross-validation scores three different samples from the training set. Then, these commands take the average of all the scores for each model. The highest score is the best model. Average cross validation score is essentially a measure of the model’s accuracy.

[0.72134382 0.64887794 0.64134845 0.73704719 0.69361211]
0.6884459020996234
[0.72080033 0.64867155 0.64122091 0.73734226 0.69418965]
0.6884449406141131
[0.6917036 0.68284428 0.6543305 0.69444274 0.62876118]
0.6704164584529726
[0.6299635 0.65569718 0.57855542 0.65335421 0.51849398]
0.6072128576624796
[0.78912277 0.73661326 0.77499172 0.73509623 0.74291208]
0.7557472125735771
[-0.02066103 -0.00848656 0.00480716 -0.06634662 0.00565757]
-0.017005895408616568

According to the training set’s cross-validation scores, RandomForestRegressor appears to be the best model.

Now, you will need to test the model predictions based on the testing set. Finally, as you need to print the score for each model.

for model in models:
model.fit(X_train, y_train)
y_pred=model.predict(X_test)
print(model.score(X_test, y_test),
mean_squared_error(y_test, y_pred))

You will most likely, find that the same model makes the best prediction, has the highest score, and has the lowest mean squared error. However, because statistics and machine learning are both based on probability, there is a small chance of coming up with a different result.

Please follow and like us: