When working on a machine learning project, we frequently need to test multiple alternatives. Some Python features allow us to experiment with various options with little effort. In this tutorial, we’ll look at some tricks for speeding up our experiments.
After completing this tutorial, we will understand
- How to use the duck-typing feature to quickly swap functions and objects
- How making components interchangeable can help accelerate experiments
Synopsis
This tutorial is divided into three sections, which are as follows:
- A machine learning project’s workflow
- Functions as objects
- Caveats
Machine Learning project’s workflow
Consider the following very simple machine learning project:
from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.svSVMmport SVC # Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv" names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] dataset = read_csv(url, names=names) # Split-out validation dataset array = dataset.values X = array[:,0:4] y = array[:,4] X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True) # Train clf = SVC() clf.fit(X_train, y_train) # Test score = clf.score(X_val, y_val) print("Validation accuracy", score)
This is an example of a typical machine learning project workflow. We have a stage of data preprocessing, then training a model, and finally evaluating our results. However, at each stage, we may want to try something new. For example, we might wonder if normalizing the data would improve it. As a result, we can rewrite the code above as follows:
from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.svm import SVC from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler # Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv" names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] dataset = read_csv(url, names=names) # Split-out validation dataset array = dataset.values X = array[:,0:4] y = array[:,4] X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True) # Train clf = Pipeline([('scaler',StandardScaler()), ('classifier',SVC())]) clf.fit(X_train, y_train) # Test score = clf.score(X_val, y_val) print("Validation accuracy", score)
So far, everything is going well. But what if we keep experimenting with different datasets, models, and score functions? Changing back and forth between using and not using a scaler would require a lot of code changes and would make it very easy to make mistakes.
We can see that the following two classifier models implemented the same interface because Python supports duck-typing
clf = SVC()
clf = Pipeline([('scaler',StandardScaler()), ('classifier',SVC())])
As a result, we can simply choose between these two versions and keep everything the same. We can say that these two models are interchangeable.
We can use this property to create a toggle variable to control the design decision we make:
USE_SCALER = True if USE_SCALER: clf = Pipeline([('scaler',StandardScaler()), ('classifier',SVC())]) else: clf = SVC()
We can choose whether or not to use a scaler by switching the variable USE_SCALER between True and False. A more complex example would be to choose between various scaler and classifier models, such as
SCALER = "standard" CLASSIFIER = "svc" if CLASSIFIER == "svc": model = SVC() elif CLASSIFIER == "cart": model = DecisionTreeClassifier() else: raise NotImplementedError if SCALER == "standard": clf = Pipeline([('scaler',StandardScaler()), ('classifier',model)]) elif SCALER == "maxmin": clf = Pipeline([('scaler',MaxMinScaler()), ('classifier',model)]) elif SCALER == None: clf = model else: raise NotImplementedError
Here is a complete example:
from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.svm import SVC from sklearn.tree import DecisionTreeClassifier from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, MinMaxScaler # toggle between options SCALER = "maxmin" # "standard", "maxmin", or None CLASSIFIER = "cart" # "svc" or "cart" # Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv" names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] dataset = read_csv(url, names=names) # Split-out validation dataset array = dataset.values X = array[:,0:4] y = array[:,4] X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True) # Create model if CLASSIFIER == "svc": model = SVC() elif CLASSIFIER == "cart": model = DecisionTreeClassifier() else: raise NotImplementedError if SCALER == "standard": clf = Pipeline([('scaler',StandardScaler()), ('classifier',model)]) elif SCALER == "maxmin": clf = Pipeline([('scaler',MinMaxScaler()), ('classifier',model)]) elif SCALER == None: clf = model else: raise NotImplementedError # Train clf.fit(X_train, y_train) # Test score = clf.score(X_val, y_val) print("Validation accuracy", score)
Functions as objects
Functions are first-class citizens in Python. A variable can be assigned functions. In Python, functions and classes are both objects (the classes themselves, not only incarnations of classes). As a result, we can use the same technique as before to experiment with similar functions.
import NumPy as np DIST = "normal" if DIST == "normal": rangen = np.random.normal elif DIST == "uniform": rangen = np.random.uniform else: raise NotImplementedError random_data = rangen(size=(10,5)) print(random_data)
The preceding is similar to calling np. random.normal(size=(10,5)), but we keep the function in a variable for the convenience of swapping functions. Because we are calling the functions with the same argument, we must ensure that all variations will accept it. If it isn’t, we may need to add a few lines of code to create a wrapper. In the case of generating the Student’s t distribution, for example, we need an additional parameter for the degree of freedom:
import NumPy as np DIST = "t" if DIST == "normal": rangen = np.random.normal elif DIST == "uniform": rangen = np.random.uniform elif DIST == "t": def t_wrapper(size): # Student's t distribution with 3 degree of freedom return np.random.standard_t(df=3, size=size) rangen = t_wrapper else: raise NotImplementedError random_data = rangen(size=(10,5)) print(random_data)
This works because, as defined above, np. random.normal, np.random.uniform, and t wrapper are all drop-in replacements for one another.
Caveats
Machine learning differs from other programming projects in that the workflow is more uncertain. When you create a web page or a game, you have an idea of what you want to achieve. However, some exploratory work is being done in machine learning projects.
To manage your source code development history in other projects, you’ll most likely use a source code control system like git or Mercurial. However, in machine learning projects, we experiment with various combinations of many steps. Using git to manage the various variations may not be appropriate, not to mention overkill in some cases. Using a toggle variable to control the flow should allow us to try out different things more quickly. This is especially useful when working on projects in Jupyter notebooks.
However, by combining multiple versions of code, we made the program clumsy and difficult to read. It is preferable to clean up after we have decided what to do. This will help us in the future with maintenance.
Conclusion
In this tutorial, we learned how Python’s duck-typing property can be used to create drop-in replacements. Specifically, we discovered
- In a machine learning workflow, duck-typing can help us easily switch between alternatives.
- We can make use of a toggle variable to experiment among alternatives
Source link