Experimenting with Python

When working on a machine learning project, we frequently need to test multiple alternatives. Some Python features allow us to experiment with various options with little effort. In this tutorial, we’ll look at some tricks for speeding up our experiments.

After completing this tutorial, we will understand

  1. How to use the duck-typing feature to quickly swap functions and objects
  2. How making components interchangeable can help accelerate experiments

 

Synopsis

This tutorial is divided into three sections, which are as follows:

  1. A machine learning project’s workflow
  2. Functions as objects
  3. Caveats

Machine Learning project’s workflow

Consider the following very simple machine learning project:

from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.svSVMmport SVC

# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = read_csv(url, names=names)

# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
y = array[:,4]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True)

# Train
clf = SVC()
clf.fit(X_train, y_train)

# Test
score = clf.score(X_val, y_val)
print("Validation accuracy", score)

This is an example of a typical machine learning project workflow. We have a stage of data preprocessing, then training a model, and finally evaluating our results. However, at each stage, we may want to try something new. For example, we might wonder if normalizing the data would improve it. As a result, we can rewrite the code above as follows:

from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = read_csv(url, names=names)

# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
y = array[:,4]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True)

# Train
clf = Pipeline([('scaler',StandardScaler()), ('classifier',SVC())])
clf.fit(X_train, y_train)

# Test
score = clf.score(X_val, y_val)
print("Validation accuracy", score)

So far, everything is going well. But what if we keep experimenting with different datasets, models, and score functions? Changing back and forth between using and not using a scaler would require a lot of code changes and would make it very easy to make mistakes.

We can see that the following two classifier models implemented the same interface because Python supports duck-typing

clf = SVC()
clf = Pipeline([('scaler',StandardScaler()), ('classifier',SVC())])

As a result, we can simply choose between these two versions and keep everything the same. We can say that these two models are interchangeable.

We can use this property to create a toggle variable to control the design decision we make:

USE_SCALER = True

if USE_SCALER:
clf = Pipeline([('scaler',StandardScaler()), ('classifier',SVC())])
else:
clf = SVC()

We can choose whether or not to use a scaler by switching the variable USE_SCALER between True and False. A more complex example would be to choose between various scaler and classifier models, such as

SCALER = "standard"
CLASSIFIER = "svc"

if CLASSIFIER == "svc":
model = SVC()
elif CLASSIFIER == "cart":
model = DecisionTreeClassifier()
else:
raise NotImplementedError

if SCALER == "standard":
clf = Pipeline([('scaler',StandardScaler()), ('classifier',model)])
elif SCALER == "maxmin":
clf = Pipeline([('scaler',MaxMinScaler()), ('classifier',model)])
elif SCALER == None:
clf = model
else:
raise NotImplementedError

Here is a complete example:

from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# toggle between options
SCALER = "maxmin" # "standard", "maxmin", or None
CLASSIFIER = "cart" # "svc" or "cart"

# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = read_csv(url, names=names)

# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
y = array[:,4]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True)

# Create model
if CLASSIFIER == "svc":
model = SVC()
elif CLASSIFIER == "cart":
model = DecisionTreeClassifier()
else:
raise NotImplementedError

if SCALER == "standard":
clf = Pipeline([('scaler',StandardScaler()), ('classifier',model)])
elif SCALER == "maxmin":
clf = Pipeline([('scaler',MinMaxScaler()), ('classifier',model)])
elif SCALER == None:
clf = model
else:
raise NotImplementedError

# Train
clf.fit(X_train, y_train)

# Test
score = clf.score(X_val, y_val)
print("Validation accuracy", score)

Functions as objects

Functions are first-class citizens in Python. A variable can be assigned functions. In Python, functions and classes are both objects (the classes themselves, not only incarnations of classes). As a result, we can use the same technique as before to experiment with similar functions.

import NumPy as np

DIST = "normal"

if DIST == "normal":
rangen = np.random.normal
elif DIST == "uniform":
rangen = np.random.uniform
else:
raise NotImplementedError

random_data = rangen(size=(10,5))
print(random_data)

The preceding is similar to calling np. random.normal(size=(10,5)), but we keep the function in a variable for the convenience of swapping functions. Because we are calling the functions with the same argument, we must ensure that all variations will accept it. If it isn’t, we may need to add a few lines of code to create a wrapper. In the case of generating the Student’s t distribution, for example, we need an additional parameter for the degree of freedom:

import NumPy as np

DIST = "t"

if DIST == "normal":
rangen = np.random.normal
elif DIST == "uniform":
rangen = np.random.uniform
elif DIST == "t":
def t_wrapper(size):
# Student's t distribution with 3 degree of freedom
return np.random.standard_t(df=3, size=size)
rangen = t_wrapper
else:
raise NotImplementedError

random_data = rangen(size=(10,5))
print(random_data)

This works because, as defined above, np. random.normal, np.random.uniform, and t wrapper are all drop-in replacements for one another.

Caveats

Machine learning differs from other programming projects in that the workflow is more uncertain. When you create a web page or a game, you have an idea of what you want to achieve. However, some exploratory work is being done in machine learning projects.

To manage your source code development history in other projects, you’ll most likely use a source code control system like git or Mercurial. However, in machine learning projects, we experiment with various combinations of many steps. Using git to manage the various variations may not be appropriate, not to mention overkill in some cases. Using a toggle variable to control the flow should allow us to try out different things more quickly. This is especially useful when working on projects in Jupyter notebooks.

However, by combining multiple versions of code, we made the program clumsy and difficult to read. It is preferable to clean up after we have decided what to do. This will help us in the future with maintenance.

Conclusion

In this tutorial, we learned how Python’s duck-typing property can be used to create drop-in replacements. Specifically, we discovered

  1. In a machine learning workflow, duck-typing can help us easily switch between alternatives.
  2. We can make use of a toggle variable to experiment among alternatives

Source link