Pipeline Optimization With scikit-learn

This tutorial presents two essential concepts in data science and automated learning. One is the machine learning pipeline, and the second is its optimization. These two principles are the key to implementing any successful intelligent system based on machine learning.

A machine learning pipeline can be created by putting together a sequence of steps involved in training a machine learning model. It can be used to automate a machine learning workflow. The pipeline can involve pre-processing, feature selection, classification/regression, and post-processing. More complex applications may need to fit in other necessary steps within this pipeline.

By optimization, we mean tuning the model for the best performance. The success of any learning model rests on the selection of the best parameters that give the best possible results. Optimization can be looked at in terms of a search algorithm, which walks through a space of parameters and hunts down the best out of them.

After completing this tutorial, you should:

  • Appreciate the significance of a pipeline and its optimization.
  • Be able to set up a machine learning pipeline.
  • Be able to optimize the pipeline.
  • Know techniques to analyze the results of optimization.

The tutorial is simple and easy to follow. It should not take you too long to go through it. So enjoy!

Tutorial Overview

This tutorial will show you how to

  1. Set up a pipeline using the Pipeline object from sklearn.pipeline.
  2. Perform a grid search for the best parameters using GridSearchCV() from sklearn.model_selection
  3. Analyze the results from the GridSearchCV() and visualize them

Before we demonstrate all the above, let’s write the import section: