What is Gradient Boosting?

Gradient Boosting

A common type of machine learning model that has managed to be extremely useful in data science competitions is a gradient boosting model. Gradient boosting is basically the process of converting weak learning models into strong learning models. Yet how exactly is this accomplished? Let’s take a closer look at gradient boosting algorithms and better understand how a gradient boosting model converts weak learners into strong learners.

Defining Gradient Boosting

This article aims to give you a good intuition for what gradient boosting is, without many breakdowns of the mathematics that underlie the algorithms. Once you have an appreciation for how gradient boosting operates at a high level, you are encouraged to go deeper and explore the math that makes it possible.

Let’s start by defining what it means to “boost” a learner. Weak learners are converted into strong learners by adjusting the properties of the learning model. Exactly what learning algorithm is being boosted?

Boosting models work by augmenting another common machine learning model, a decision tree.

decision tree model functions by splitting a dataset down into smaller and smaller portions, and once the subsets can’t be split any further, the result is a tree with nodes and leaves. Nodes in a decision tree are where decisions about data points are made using different filtering criteria. The leaves in a decision tree are the data points that have been classified. Decision tree algorithms can handle both numerical and categorical data, and splits in the tree are based on specific variables/features.

Illustration of the way boosting models are trained.

One type of boosting algorithm is the AdaBoost algorithm. AdaBoost algorithms start by training a decision tree model and assigning an equal weight to every observation. After the first tree has been evaluated for accuracy, the weights for the different observations are adjusted. Observations that were easy to classify have their weights lowered, while observations that were difficult to classify have their weights increased. A second tree is created using these adjusted weights, with the aim that the second tree’s predictions will be more accurate than the first tree’s predictions.

The model now consists of the predictions for the original tree and the new tree (or Tree 1 + Tree 2). The classification accuracy is assessed once more based on the new model. A third tree is created based on the calculated error for the model, and the weights are once more adjusted. This process continues for a given number of iterations, and the final model is an ensemble model that uses the weighted sum of the predictions made by all the previously constructed trees.

The process described above uses Decision Trees and the base predictors/models, yet a boosting approach can be carried out with a wide range of models like the many standard classifier and regressor models. The key concepts to understand are that subsequent predictors learn from the mistakes made by previous ones and that the predictors are created sequentially.

The primary advantage of boosting algorithms is that they take less time to find the current predictions when compared to other machine learning models. Care needs to be used when employing boosting algorithms, however, as they are prone to overfitting.

Gradient Boosting

We’ll now look at one of the most common boosting algorithms. Gradient Boosting Models (GBM) are known for their high accuracy, and they augment the general principals used in AdaBoost.

The primary difference between a Gradient Boosting Model and AdaBoost is that GBMs use a different method of calculating which learners are misidentifying data points. AdaBoost calculates where a model is underperforming by examining data points that are heavily weighted. Meanwhile, GBMs use gradients to determine the accuracy of learners, applying a loss function to a model. Loss functions are a way to measure the accuracy of a model’s fit on the dataset, calculating an error and optimizing the model to reduce that error. GBMs let the user optimize a specified loss function based on their desired goal.

Taking the most common loss function – Mean Squared Error (MSE) – as an example, gradient descent is used to update predictions based on a predefined learning rate, aiming to find the values where loss is minimal.

To make it clearer:

New model predictions = output variables – old imperfect predictions.

In a more statistical sense, GBMs aim to find relevant patterns in a model’s residuals, adjusting the model to fit the pattern and bring the residuals as close to zero as possible. If you were to carry out a regression on the model’s predictions, the residuals would be distributed around 0 (perfect fit), and GBMs are finding patterns within the residuals and updating the model around these patterns.

In other words, the predictions are updated so that the sum of all residuals is as close to 0 as possible, meaning that the predicted values will be very close to the actual values.

Note that a wide variety of other loss functions (such as logarithmic loss) can be used by a GBM. MSE was selected above for the purpose of simplicity.

Variations On Gradient Boosting Models

Gradient Boosting Models are greedy algorithms that are prone to overfitting on a dataset. This can be guarded against with several different methods that can improve the performance of a GBM.

GBMs can be regulated with four different methods: Shrinkage, Tree Constraints, Stochastic Gradient Boosting, and Penalized Learning.

Shrinkage

As previously mentioned, in GBMs predictions are summed together in a sequential fashion. In “Shrinkage,” the additions of every tree to the overall sum are adjusted. Weights are applied that slow down the algorithm’s learning rate, necessitating that more trees be added to the model, which typically improves the model’s robustness and performance. The trade off is that the model takes longer to train.

Tree Constraints

Constraining the tree with various tweaks like adding more depth to the tree or increasing the number of nodes or leaves in the tree can make it harder for the model to overfit. Imposing a constraint on the minimum number of observations per split has a similar effect. Once more, the trade off is that it will take the model longer to train.

Random Sampling

The individual learners can be created through a stochastic process, based on randomly selected substamples of the training dataset. This has the effect of reducing correlations between trees, which guards against overfitting. The dataset can be substampled before creating the trees or before considering a split in the tree.

Penalized Learning

Beyond constraining the model through limiting the structure of the tree, it’s possible to use a regression tree. Regression trees have numerical values attached to each of the leaves, and these function as weights and can be adjusted with common regularization functions like L1 and L2 regularization.