Machine learning and predictive modeling require you to choose the right algorithm. Data scientists and data analysts have to train, test, and validate algorithms to find the best one for a predictive model. Even the most experienced data scientist will have to test a model’s predictive accuracy compared to other models.
Among the most common machine learning packages for Python, scikit-learn, has methods for testing, training, and validation built-in. Scikit-learn has a different name for installation than for importing into a Python program. The module requires the usual command line command, pip install scikit-learn. However, it is referred to as sklearn when importing within Python code.
Machine Learning Book Recommendation
In this realm, one book that I highly recommend for mastering the programming side of machine learning is Machine Learning Mastery with Python by Jason Brownlee, Ph.D. The book does not go into theory or how the various algorithms work. It strictly covers the python programming required. In this way, it is great for beginners. From the book, I found a list of commonly used machine learning datasets to help teach machine learning.
Predictive Model Algorithm Choice
In the coming weeks, I will cover numerous algorithms and explain the test/train split, as well as several kinds of validation methods.
Predictive modeling is mostly trying algorithms, comparing them, and finding the best one for your purposes. In addition to these testing methods, the right model should also consider the platform. Do you have virtually unlimited computing power and time, do you have a small webserver, or do you have to run the model from a mobile phone? Where you implement the model is another aspect of model selection that is frequently overlooked initially.
Another factor in algorithm and model selection is the dataset itself. How big is it? How much data cleaning is needed? Do you get data from a CSV file, or do you have to access it through an API, which returns a json. It might also come from a database that you need to query directly.
Furthermore, where and how it will run is an important consideration. Will you have to run it in a web app that will need to test and train the model on-the-fly? Do you want to save the model or some of the results for use later? Again, where and how you implement the model may effect you model choice.
A handing reference for scikit-learn algorithms is in the form a flow chart that can help guide which model to choose.
Brownlee, J. (2019). Machine Learning Mastery with Python: Understand Your Data, Create Accurate Models and Work Projects from End-to-End. Jason Brownlee.
Choosing the Right Estimator. scikit-learn.org. article and image from: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html#, viewed on May 28, 2020.