Audio version of the article
Machine learning involves using an algorithm to learn and generalize from historical data in order to make predictions on new data.
This problem can be described as approximating a function that maps examples of inputs to examples of outputs. Approximating a function can be solved by framing the problem as function optimization. This is where a machine learning algorithm defines a parameterized mapping function (e.g. a weighted sum of inputs) and an optimization algorithm is used to fund the values of the parameters (e.g. model coefficients) that minimize the error of the function when used to map inputs to outputs.
This means that each time we fit a machine learning algorithm on a training dataset, we are solving an optimization problem.
In this tutorial, you will discover the central role of optimization in machine learning.
After completing this tutorial, you will know:
- Machine learning algorithms perform function approximation, which is solved using function optimization.
- Function optimization is the reason why we minimize error, cost, or loss when fitting a machine learning algorithm.
- Optimization is also performed during data preparation, hyperparameter tuning, and model selection in a predictive modeling project.
Let’s get started.
This tutorial is divided into three parts; they are:
- Machine Learning and Optimization
- Learning as Optimization
- Optimization in a Machine Learning Project
- Data Preparation as Optimization
- Hyperparameter Tuning as Optimization
- Model Selection as Optimization
Machine Learning and Optimization
Function optimization is the problem of finding the set of inputs to a target objective function that result in the minimum or maximum of the function.
It can be a challenging problem as the function may have tens, hundreds, thousands, or even millions of inputs, and the structure of the function is unknown, and often non-differentiable and noisy.
- Function Optimization: Find the set of inputs that results in the minimum or maximum of an objective function.
Machine learning can be described as function approximation. That is, approximating the unknown underlying function that maps examples of inputs to outputs in order to make predictions on new data.
It can be challenging as there is often a limited number of examples from which we can approximate the function, and the structure of the function that is being approximated is often nonlinear, noisy, and may even contain contradictions.
- Function Approximation: Generalize from specific examples to a reusable mapping function for making predictions on new examples.
Function optimization is often simpler than function approximation.
Importantly, in machine learning, we often solve the problem of function approximation using function optimization.
At the core of nearly all machine learning algorithms is an optimization algorithm.
In addition, the process of working through a predictive modeling problem involves optimization at multiple steps in addition to learning a model, including:
- Choosing the hyperparameters of a model.
- Choosing the transforms to apply to the data prior to modeling
- Choosing the modeling pipeline to use as the final model.
Now that we know that optimization plays a central role in machine learning, let’s look at some examples of learning algorithms and how they use optimization.
Learning as Optimization
Predictive modeling problems involve making a prediction from an example of input.
A numeric quantity must be predicted in the case of a regression problem, whereas a class label must be predicted in the case of a classification problem.
The problem of predictive modeling is sufficiently challenging that we cannot write code to make predictions. Instead, we must use a learning algorithm applied to historical data to learn a “program” called a predictive model that we can use to make predictions on new data.
In statistical learning, a statistical perspective on machine learning, the problem is framed as the learning of a mapping function (f) given examples of input data (X) and associated output data (y).
- y = f(X)
Given new examples of input (Xhat), we must map each example onto the expected output value (yhat) using our learned function (fhat).
- yhat = fhat(Xhat)
The learned mapping will be imperfect. No model is perfect, and some prediction error is expected given the difficulty of the problem, noise in the observed data, and the choice of learning algorithm.
Mathematically, learning algorithms solve the problem of approximating the mapping function by solving a function optimization problem.
Specifically, given examples of inputs and outputs, find the set of inputs to the mapping function that results in the minimum loss, minimum cost, or minimum prediction error.
The more biased or constrained the choice of mapping function, the easier the optimization is to solve.
Let’s look at some examples to make this clear.
A linear regression (for regression problems) is a highly constrained model and can be solved analytically using linear algebra. The inputs to the mapping function are the coefficients of the model.
We can use an optimization algorithm, like a quasi-Newton local search algorithm, but it will almost always be less efficient than the analytical solution.
- Linear Regression: Function inputs are model coefficients, optimization problems that can be solved analytically.
A logistic regression (for classification problems) is slightly less constrained and must be solved as an optimization problem, although something about the structure of the optimization function being solved is known given the constraints imposed by the model.
This means a local search algorithm like a quasi-Newton method can be used. We could use a global search like stochastic gradient descent, but it will almost always be less efficient.
- Logistic Regression: Function inputs are model coefficients, optimization problems that require an iterative local search algorithm.
A neural network model is a very flexible learning algorithm that imposes few constraints. The inputs to the mapping function are the network weights. A local search algorithm cannot be used given the search space is multimodal and highly nonlinear; instead, a global search algorithm must be used.
A global optimization algorithm is commonly used, specifically stochastic gradient descent, and the updates are made in a way that is aware of the structure of the model (backpropagation and the chain rule). We could use a global search algorithm that is oblivious of the structure of the model, like a genetic algorithm, but it will almost always be less efficient.
- Neural Network: Function inputs are model weights, optimization problems that require an iterative global search algorithm.
We can see that each algorithm makes different assumptions about the form of the mapping function, which influences the type of optimization problem to be solved.
We can also see that the default optimization algorithm used for each machine learning algorithm is not arbitrary; it represents the most efficient algorithm for solving the specific optimization problem framed by the algorithm, e.g. stochastic gradient descent for neural nets instead of a genetic algorithm. Deviating from these defaults requires a good reason.
Not all machine learning algorithms solve an optimization problem. A notable example is the k-nearest neighbors algorithm that stores the training dataset and does a lookup for the k best matches to each new example in order to make a prediction.
Now that we are familiar with learning in machine learning algorithms as optimization, let’s look at some related examples of optimization in a machine learning project.
Optimization in a Machine Learning Project
Optimization plays an important part in a machine learning project in addition to fitting the learning algorithm on the training dataset.
The step of preparing the data prior to fitting the model and the step of tuning a chosen model also can be framed as an optimization problem. In fact, an entire predictive modeling project can be thought of as one large optimization problem.
Let’s take a closer look at each of these cases in turn.
Data Preparation as Optimization
Data preparation involves transforming raw data into a form that is most appropriate for the learning algorithms.
This might involve scaling values, handling missing values, and changing the probability distribution of variables.
Transforms can be made to change representation of the historical data to meet the expectations or requirements of specific learning algorithms. Yet, sometimes good or best results can be achieved when the expectations are violated or when an unrelated transform to the data is performed.
We can think of choosing transforms to apply to the training data as a search or optimization problem of best exposing the unknown underlying structure of the data to the learning algorithm.
- Data Preparation: Function inputs are sequences of transforms, optimization problems that require an iterative global search algorithm.
This optimization problem is often performed manually with human-based trial and error. Nevertheless, it is possible to automate this task using a global optimization algorithm where the inputs to the function are the types and order of transforms applied to the training data.
The number and permutations of data transforms are typically quite limited and it may be possible to perform an exhaustive search or a grid search of commonly used sequences.
Hyperparameter Tuning as Optimization
Machine learning algorithms have hyperparameters that can be configured to tailor the algorithm to a specific dataset.
Although the dynamics of many hyperparameters are known, the specific effect they will have on the performance of the resulting model on a given dataset is not known. As such, it is a standard practice to test a suite of values for key algorithm hyperparameters for a chosen machine learning algorithm.
This is called hyperparameter tuning or hyperparameter optimization.
It is common to use a naive optimization algorithm for this purpose, such as a random search algorithm or a grid search algorithm.
- Hyperparameter Tuning: Function inputs are algorithm hyperparameters, optimization problems that require an iterative global search algorithm.
- Nevertheless, it is becoming increasingly common to use an iterative global search algorithm for this optimization problem. A popular choice is a Bayesian optimization algorithm that is capable of simultaneously approximating the target function that is being optimized (using a surrogate function) while optimizing it.
This is desirable as evaluating a single combination of model hyperparameters is expensive, requiring fitting the model on the entire training dataset one or many times, depending on the choice of model evaluation procedure (e.g. repeated k-fold cross-validation).
Model Selection as Optimization
Model selection involves choosing one from among many candidate machine learning models for a predictive modeling problem.
Really, it involves choosing the machine learning algorithm or machine learning pipeline that produces a model. This is then used to train a final model that can then be used in the desired application to make predictions on new data.
This process of model selection is often a manual process performed by a machine learning practitioner involving tasks such as preparing data, evaluating candidate models, tuning well-performing models, and finally choosing the final model.
This can be framed as an optimization problem that subsumes part of or the entire predictive modeling project.
- Model Selection: Function inputs are data transform, machine learning algorithm, and algorithm hyperparameters; optimization problem that requires an iterative global search algorithm.
Increasingly, this is the case with automated machine learning (AutoML) algorithms being used to choose an algorithm, an algorithm and hyperparameters, or data preparation, algorithm and hyperparameters, with very little user intervention.
Like hyperparameter tuning, it is common to use a global search algorithm that also approximates the objective function, such as Bayesian optimization, given that each function evaluation is expensive.
This automated optimization approach to machine learning also underlies modern machine learning as a service (MLaaS) products provided by companies such as Google, Microsoft, and Amazon.
Although fast and efficient, such approaches are still unable to outperform hand-crafted models prepared by highly skilled experts, such as those participating in machine learning competitions.
This article has been published from the source link without modifications to the text. Only the headline has been changed.