Machine learning algorithms have hyperparameters that allow the algorithms to be tailored to specific datasets.
Although the impact of hyperparameters may be understood generally, their specific effect on a dataset and their interactions during learning may not be known. Therefore, it is important to tune the values of algorithm hyperparameters as part of a machine learning project.
It is common to use naive optimization algorithms to tune hyperparameters, such as a grid search and a random search. An alternate approach is to use a stochastic optimization algorithm, like a stochastic hill climbing algorithm.
In this tutorial, you will discover how to manually optimize the hyperparameters of machine learning algorithms.
After completing this tutorial, you will know:
- Stochastic optimization algorithms can be used instead of grid and random search for hyperparameter optimization.
- How to use a stochastic hill climbing algorithm to tune the hyperparameters of the Perceptron algorithm.
- How to manually optimize the hyperparameters of the XGBoost gradient boosting algorithm.
Let’s get started.
Tutorial Overview
This tutorial is divided into three parts; they are:
- Manual Hyperparameter Optimization
- Perceptron Hyperparameter Optimization
- XGBoost Hyperparameter Optimization
Manual Hyperparameter Optimization
Machine learning models have hyperparameters that you must set in order to customize the model to your dataset.
Often, the general effects of hyperparameters on a model are known, but how to best set a hyperparameter and combinations of interacting hyperparameters for a given dataset is challenging.
A better approach is to objectively search different values for model hyperparameters and choose a subset that results in a model that achieves the best performance on a given dataset. This is called hyperparameter optimization, or hyperparameter tuning.
A range of different optimization algorithms may be used, although two of the simplest and most common methods are random search and grid search.
- Random Search. Define a search space as a bounded domain of hyperparameter values and randomly sample points in that domain.
- Grid Search. Define a search space as a grid of hyperparameter values and evaluate every position in the grid.
Grid search is great for spot-checking combinations that are known to perform well generally. Random search is great for discovery and getting hyperparameter combinations that you would not have guessed intuitively, although it often requires more time to execute.
For more on grid and random search for hyperparameter tuning, see the tutorial:
- Hyperparameter Optimization With Random Search and Grid Search
Grid and random search are primitive optimization algorithms, and it is possible to use any optimization we like to tune the performance of a machine learning algorithm. For example, it is possible to use stochastic optimization algorithms. This might be desirable when good or great performance is required and there are sufficient resources available to tune the model.
Next, let’s look at how we might use a stochastic hill climbing algorithm to tune the performance of the Perceptron algorithm.
Perceptron Hyperparameter Optimization
The Perceptron algorithm is the simplest type of artificial neural network.
It is a model of a single neuron that can be used for two-class classification problems and provides the foundation for later developing much larger networks.
In this section, we will explore how to manually optimize the hyperparameters of the Perceptron model.
First, let’s define a synthetic binary classification problem that we can use as the focus of optimizing the model.
We can use the make_classification() function to define a binary classification problem with 1,000 rows and five input variables.
The example below creates the dataset and summarizes the shape of the data.
# define a binary classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1)
# summarize the shape of the dataset
print(X.shape, y.shape)
Running the example prints the shape of the created dataset, confirming our expectations.
(1000, 5) (1000,)
The scikit-learn provides an implementation of the Perceptron model via the Perceptron class.
Before we tune the hyperparameters of the model, we can establish a baseline in performance using the default hyperparameters.
We will evaluate the model using good practices of repeated stratified k-fold cross-validation via the RepeatedStratifiedKFold class.
The complete example of evaluating the Perceptron model with default hyperparameters on our synthetic binary classification dataset is listed below.
# perceptron default hyperparameters for binary classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import Perceptron
# define dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1)
# define model
model = Perceptron()
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report result
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))
Running the example reports evaluates the model and reports the mean and standard deviation of the classification accuracy.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
In this case, we can see that the model with default hyperparameters achieved a classification accuracy of about 78.5 percent.
We would hope that we can achieve better performance than this with optimized hyperparameters.
Mean Accuracy: 0.786 (0.069)
Next, we can optimize the hyperparameters of the Perceptron model using a stochastic hill climbing algorithm.
There are many hyperparameters that we could optimize, although we will focus on two that perhaps have the most impact on the learning behavior of the model; they are:
- Learning Rate (eta0).
- Regularization (alpha).
The learning rate controls the amount the model is updated based on prediction errors and controls the speed of learning. The default value of eta is 1.0. reasonable values are larger than zero (e.g. larger than 1e-8 or 1e-10) and probably less than 1.0
By default, the Perceptron does not use any regularization, but we will enable “elastic net” regularization which applies both L1 and L2 regularization during learning. This will encourage the model to seek small model weights and, in turn, often better performance.
We will tune the “alpha” hyperparameter that controls the weighting of the regularization, e.g. the amount it impacts the learning. If set to 0.0, it is as though no regularization is being used. Reasonable values are between 0.0 and 1.0.
First, we need to define the objective function for the optimization algorithm. We will evaluate a configuration using mean classification accuracy with repeated stratified k-fold cross-validation. We will seek to maximize accuracy in the configurations.
The objective() function below implements this, taking the dataset and a list of config values. The config values (learning rate and regularization weighting) are unpacked, used to configure the model, which is then evaluated, and the mean accuracy is returned.
# objective function
def objective(X, y, cfg):
# unpack config
eta, alpha = cfg
# define model
model = Perceptron(penalty='elasticnet', alpha=alpha, eta0=eta)
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# calculate mean accuracy
result = mean(scores)
return result
Next, we need a function to take a step in the search space.
The search space is defined by two variables (eta and alpha). A step in the search space must have some relationship to the previous values and must be bound to sensible values (e.g. between 0 and 1).
We will use a “step size” hyperparameter that controls how far the algorithm is allowed to move from the existing configuration. A new configuration will be chosen probabilistically using a Gaussian distribution with the current value as the mean of the distribution and the step size as the standard deviation of the distribution.
We can use the randn() NumPy function to generate random numbers with a Gaussian distribution.
The step() function below implements this and will take a step in the search space and generate a new configuration using an existing configuration.
# take a step in the search space
def step(cfg, step_size):
# unpack the configuration
eta, alpha = cfg
# step eta
new_eta = eta + randn() * step_size
# check the bounds of eta
if new_eta <= 0.0:
new_eta = 1e-8
# step alpha
new_alpha = alpha + randn() * step_size
# check the bounds of alpha
if new_alpha < 0.0:
new_alpha = 0.0
# return the new configuration
return [new_eta, new_alpha]
Next, we need to implement the stochastic hill climbing algorithm that will call our objective() function to evaluate candidate solutions and our step() function to take a step in the search space.
The search first generates a random initial solution, in this case with eta and alpha values in the range 0 and 1. The initial solution is then evaluated and is taken as the current best working solution.
...
# starting point for the search
solution = [rand(), rand()]
# evaluate the initial point
solution_eval = objective(X, y, solution)
Next, the algorithm iterates for a fixed number of iterations provided as a hyperparameter to the search. Each iteration involves taking a step and evaluating the new candidate solution.
...
# take a step
candidate = step(solution, step_size)
# evaluate candidate point
candidte_eval = objective(X, y, candidate)
If the new solution is better than the current working solution, it is taken as the new current working solution.
...
# check if we should keep the new point
if candidte_eval >= solution_eval:
# store the new point
solution, solution_eval = candidate, candidte_eval
# report progress
print('>%d, cfg=%s %.5f' % (i, solution, solution_eval))
At the end of the search, the best solution and its performance are then returned.
Tying this together, the hillclimbing() function below implements the stochastic hill climbing algorithm for tuning the Perceptron algorithm, taking the dataset, objective function, number of iterations, and step size as arguments.
Source link