DIY XGBoost Library

XGBoost is probably one of the most widely used libraries in data science. Many data scientists around the world are using it. It’s a very versatile algorithm that can be use to perform classification, regression as well as confidence intervals as shown in this article. But how many really understand its underlying principles?

You might think that a Machine Learning algorithm that performs as well as XGBoost might be using very complex and advanced mathematics. You probably imagine that it’s a masterpiece of software engineering.

And you’re partly right. The XGBoost library is a pretty complex one, but if you consider only the mathematical formulation of gradient boosting applied to decision trees, it’s not that complicated.

You will see below in detail how to train decision trees for regression using the gradient boosting method with less than 200 lines of code.

Decision Tree

Before getting into the mathematical details, let’s refresh our memory regarding decision trees. The principle is fairly simple: associate a value to a given set of features traversing a binary tree. Each node of the binary tree is attached to a condition; leaves contain values.

If the condition is true, we continue the traversal using the left node, if not, we use the right one. Once a leaf is reached, we have our prediction.

As often, a picture is worth a thousand words:

DIY XGBoost Library 1
A three-level decision tree.

The condition attached to a node can be regarded as a decision, hence the name Decision Tree.

This kind of structure is pretty old in the history of computer science and has been used for decades with success. A basic implementation is given by the following lines:

class DecisionNode:
    """
    Node decision class.
    This is a simple binary node, with potentially two children: left and right
    Left node is returned when condition is true
    False node is returned when condition is false
    """
    def __init__(self, name, condition, value=None):
        self.name = name
        self.condition = condition
        self.value = value
        self.left = None
        self.right = None

    def add_left_node(self, left):
        self.left = left

    def add_right_node(self, right):
        self.right = right

    def is_leaf(self):
        """
        Node is a leaf if it has no child
        """
        return (not self.left) and (not self.right)

    def next(self, data):
        """
        Return next node depending on data and node condition
        """
        cond = self.condition(data)
        if cond:
            return self.left
        else:
            return self.right


class DecisionTree:
    """
    A DecisionTree is a model that provides predictions depending on input.
    A Prediction is the sum of the leaves' values, for those leaves that were activated by the input
    """
    def __init__(self, root):
        """
        A DecisionTree is defined by an objective, a number of estimators and a max depth.
        """
        self.root = root

    def predict(self, data):
      child = root
      while child and not child.is_leaf():
        child = child.next(data)
      return child.value


root = DecisionNode('root', lambda d : d['A'] > 2.0)
root_left = DecisionNode('root_left', lambda d : d['B'] > 10.0, None)
root_right = DecisionNode('root_right', None, 1)
left_left = DecisionNode('left_left', None, 2)
left_right = DecisionNode('left_right', None, 3)

root.add_left_node(root_left)
root.add_right_node(root_right)

root_left.add_left_node(left_left)
root_left.add_right_node(left_right)

tree = DecisionTree(root)
print(tree.predict({'A' : 1, 'B' : 1})) # 1
print(tree.predict({'A' : 1, 'B' : 10})) # 1
print(tree.predict({'A' : 3, 'B' : 11})) # 2
print(tree.predict({'A' : 3, 'B' : 9})) # 3

Stacking multiple trees

Even though Decision Trees have been used with some success in a number of applications, like expert systems (before the AI winter) it remains a very basic model that cannot handle the complexity usually encountered in real-world data.

We usually refer to this kind of estimators as weak models.

To overcome this limitation, the idea emerged in the nineties to combine multiple weak models to create a strong model: ensemble learning.

This method can be applied to any kind of model, but as Decision Trees are simple, fast, generic and easily interpretable models, they are commonly used.

Various strategies can be deployed to combine models. We can for instance use a weighted sum of each model prediction. Or even better, use a Bayesian approach to combine them based on learning.

XGBoost and all boosting methods use another approach: each new model tries to compensate for the errors of the previous one.

Optimizing decision trees

As we have seen above, making predictions using a decision tree is straightforward. The job is hardly more complicated when using ensemble learning: all we have to do is sum contributions of each model.

What’s really complicated is building the tree itself! How can we find the best condition to apply at each node of our training dataset? This is where math helps us. A full derivation can be found in the XGBoost documentation. Here, we will focus only on the formulas of interest for this article.

As always in machine learning, we want to set our model parameters so that our model’s predictions on the training set minimize a given objective:

DIY XGBoost Library 2
Objective formula. Formula by the author.

Please note that this objective is made of two terms:

  • One to measure the error made by the prediction. It’s the famous loss function l(y, y_hat).
  • The other, omega, to control model complexity.

As stated in the XGBoost documentation, complexity is a very important part of the objective that allows us to tune the bias/variance trade-off. Many different functions can be used to define this regularization term. XGBoost use:

DIY XGBoost Library 3
Regularization term. Formula by the author.

Here T is the total number of leaves, whereas w_j are the weights attached to each leaf. This means that a large weight and a large number of leaves will be penalized.

As the error is often a complex, non-linear function, we linearized it using second-order Taylor expansion:

DIY XGBoost Library 4
Second-order expansion of the loss. Formula by the author.

where:

DIY XGBoost Library 5
Gaussian and hessian formulas. Formula by the author.

The linearization is computed with respect to the prediction term, as we want to estimate how the error changes when the prediction changes. Linearization is essential, as it will ease the minimization of the error.

What we want to achieve with gradient boosting, is to find the optimal delta_y_i that will minimize the loss function, i.e. we want to find how to modify the existing tree so that the modification improves the prediction.

When dealing with tree models, there are two kinds of parameters to consider:

  • The ones defining the tree in itself: the condition for each node, the depth of the tree
  • The values attached to each leaf of the tree. These values are the predictions themselves.

Exploring every tree configurations would be too complex, therefore gradient tree boosting methods only consider splitting one node into two leaves. This means that we have to optimize three parameters:

  • The splitting value: on which condition do we split data
  • The value attached to the left leaf
  • The value attached to the right leaf

In the XGBoost documentation, the best value for a leaf j with respect to the objective is given by:

DIY XGBoost Library 6
Optimal leaf value with respect to objective. Formula by the author.

where G_j is the sum of the gradient of the training points attached to the node j, and H_j is the sum of the hessian of the training points attached to the node j.

The reduction in the objective function obtained with this optimal param is:

DIY XGBoost Library 7
Objective improvement when using optimal weight. Formula by the author.

Choosing the right split value is done using brut force: we compute the improvement for each split value, and we keep the best one.

We now have all the required mathematical information to boost an initial tree performance with respect to a given objective by adding new leaves.

Before doing it concretely, let’s take some time to understand what these formulas mean.

 

Grokking gradient boosting

Let’s try to get some insight on how weight is computed, and what G_j and H_i equal to. As they are respectively the gradient and hessian of the loss function with respect to the prediction, we have to pick a loss function.

We will focus on the squared error, which is commonly used and is the default objective for XGBoost:

DIY XGBoost Library 8
Squared error loss function. Formula by the author

It’s a pretty simple formula, whose gradient is:

DIY XGBoost Library 9
The gradient of the loss function. Formula by the author.

and hessian:

DIY XGBoost Library 10
Hessian of the loss function.

Hence, if we remember the formulas for the optimal weight that maximize error reduction:

DIY XGBoost Library 6
Optimal leaf value with respect to the objective. Formula by the author.

We realize that the optimal weight, i.e. the value that we add to the previous prediction is the opposite of the average error between the previous prediction and the real value (when regularization is disabled, i.e. lambda = 0). Using the squared loss to train Decision Trees with gradient boosting just boils down to updating the prediction with the average error in each new node.

We also see that lambda has the expected effect, that is ensuring that weights are not too large, as weight is inversely proportional to lambda.

 

Training decision trees

Now comes the easy part. Let’s say that we have an existing decision tree that ensures predictions with a given error. We want to reduce the error and improve the attached objective by splitting one node.

The algorithm to do so is pretty straightforward:

  • Pick a feature of interest
  • Order data points attached to the current node using values of the selected feature
  • pick a possible split value
  • Put data points below this split value in the right node, and the one above in the left node
  • compute objective reduction for the parent node, the right node and the left one
  • If the sum of the objective reduction for left and right nodes is greater than the one of the parent node, keep the split value as the best one
  • Iterate for each split value
  • Use the best split value if any, and add the two new nodes
  • If no splitting improved the objective, don’t add child nodes.

The resulting code creates a DecisionTree class, that is configured by an objective, a number of estimators, i.e. the number of trees, and a max depth.

As promised the code takes less than 200 lines:

from asciitree import LeftAligned
from collections import OrderedDict as OD
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
from jax import grad, jacfwd, jacrev, jit
import jax.numpy as jnp
import numpy as np
from drawtree import draw_level_order

import random

def hessian(fun):
  return jit(jacfwd(jacrev(fun)))

class DecisionNode:
    """
    Node decision class.
    This is a simple binary node, with potentially two childs: left and right
    Left node is returned when condition is true
    False node is returned when condition is false<
    """
    def __init__(self, name, condition, value=None):
        self.name = name
        self.condition = condition
        self.value = value
        self.left = None
        self.right = None

    def add_left_node(self, left):
        self.left = left

    def add_right_node(self, right):
        self.right = right

    def is_leaf(self):
        """
        Node is a leaf if it has no child
        """
        return (not self.left) and (not self.right)

    def next(self, data):
        """
        Return next code depending on data and node condition
        """
        cond = self.condition(data)
        if cond:
            return self.left
        else:
            return self.right

class DecisionTree:
    """
    A DecisionTree is a model that provides predictions depending on input.
    Prediction is the sum of the values attached to leaf activated by input
    """
    def __init__(self, objective, nb_estimators, max_depth):
        """
        A DecisionTree is defined by an objective, a number of estimators and a max depth.
        """
        self.roots = [DecisionNode(f'root_{esti}', None, 0.0) for esti in range(0, nb_estimators)]
        self.objective = objective
        self.lbda = 0.0
        self.gamma = 1.0 * 0
        self.grad = grad(self.objective)
        self.hessian = hessian(self.objective)
        self.max_depth = max_depth
        self.base_score = None


    def _create_condition(self, col_name, split_value):
        """
        Create a closure that capture split value
        """
        return lambda dta : dta[col_name] < split_value

    def _pick_columns(self, columns):
        return random.choice(columns)

    def _add_child_nodes(self, node, nodes,
                         node_x, node_y,
                         split_value, split_column,
                         nb_nodes,
                         left_w, right_w, prev_w):
        node.name = f'{split_column} < {split_value}'
        node.condition = self._create_condition(split_column, split_value) # we must create a closure to capture split_value copy
        node.add_left_node(DecisionNode(f'left_{nb_nodes} - {split_column} < {split_value}',
                                        None, left_w + prev_w))
        node.add_right_node(DecisionNode(f'right_{nb_nodes} - {split_column} >= {split_value}',
                                         None, right_w + prev_w))
        mask = node_x[split_column] < split_value
        # Reverse order to ensure bfs
        nodes.append((node.left,
                      node_x[mask].copy(),
                      node_y[mask].copy(),
                      left_w + prev_w))
        nodes.append((node.right,
                      node_x[~mask].copy(),
                      node_y[~mask].copy(),
                      right_w + prev_w))


    def fit(self, x_train, y_train):
        """
        Fit decision trees using x_train and objective
        """
        self.base_score = y_train.mean()
        for tree_idx, tree_root in enumerate(self.roots):
            # store current node (currenly a lead), x_train and node leaf weight
            nodes = [(tree_root, x_train.copy(), y_train.copy(), 0.0)]
            nb_nodes = 0
            # Add node to tree using bfs
            while nodes:
                node, node_x, node_y, prev_w = nodes.pop(0)
                node_x['pred'] = self.predict(node_x)
                split_column = self._pick_columns(x_train.columns) # XGBoost use a smarter heuristic here
                best_split, split_value, left_w, right_w = self._find_best_split(split_column,
                                                                                 node_x, node_y)
                if best_split != -1:
                    self._add_child_nodes(node, nodes,
                                          node_x, node_y,
                                          split_value, split_column,
                                          nb_nodes,
                                          left_w, right_w, prev_w)
                nb_nodes += 1
                if nb_nodes >= 2**self.max_depth-1:
                    break


    def _gain_and_weight(self, x_train, y_train):
        """
        Compute gain and leaf weight using automatic differentiation
        """
        pred = x_train['pred'].values
        G_i = self.grad(pred, y_train.values).sum()
        H_i = self.hessian(pred, y_train.values).sum()
        return -0.5 * G_i * G_i / (H_i + self.lbda) + self.gamma, -G_i / (H_i + self.lbda)

    def _find_best_split(self, col_name, node_x, node_y):
        """
        Compute best split
        """
        x_sorted = node_x.sort_values(by=col_name)
        y_sorted = node_y[x_sorted.index]
        current_gain, _ = self._gain_and_weight(x_sorted, node_y)
        gain = 0.0
        best_split = -1
        split_value, best_left_w, best_right_w = None, None, None
        for split_idx in range(1, x_sorted.shape[0]):
            left_data = x_sorted.iloc[:split_idx]
            right_data = x_sorted.iloc[split_idx:]
            left_y = y_sorted.iloc[:split_idx]
            right_y = y_sorted.iloc[split_idx:]
            left_gain, left_w = self._gain_and_weight(left_data, left_y)
            right_gain, right_w = self._gain_and_weight(right_data, right_y)
            if current_gain - (left_gain + right_gain) > gain:
                gain = current_gain - (left_gain + right_gain)
                best_split = split_idx
                split_value = x_sorted[col_name].iloc[split_idx]
                best_left_w = left_w
                best_right_w = right_w
        return best_split, split_value, best_left_w, best_right_w

    def predict(self, data):
        preds = []
        for _, row in data.iterrows():
            pred = 0.0
            for tree_idx, root in enumerate(self.roots):
                child = root
                while child and not child.is_leaf():
                    child = child.next(row)
                pred += child.value
            preds.append(pred)
        return np.array(preds) + self.base_score

    def show(self):
        print('not yet implemented')


def squared_error(y_pred, y_true):
    diff = y_true - y_pred
    return jnp.dot(diff, diff.T)

x_train = pd.DataFrame({"A" : [3.0, 2.0, 1.0, 4.0, 5.0, 6.0, 7.0]})
y_train = pd.DataFrame({"Y" : [3.0, 2.0, 1.0, 4.0, 5.0, 6.0, 7.0]})

tree = DecisionTree(squared_error, 1, 3)
tree.fit(x_train, y_train['Y'])
pred = tree.predict(pd.DataFrame({'A': [1., 2., 3., 4., 5., 6., 7.]}))
print(pred) #-> [1. 2. 3. 4. 5. 6. 7.]

tree = DecisionTree(squared_error, 2, 3)
tree.fit(x_train, y_train['Y'])
pred = tree.predict(pd.DataFrame({'A': [1., 2., 3., 4., 5., 6., 7.]}))
print(pred) #-> [1. 2. 3. 4. 5. 6. 7.]

tree = DecisionTree(squared_error, 4, 2)
tree.fit(x_train, y_train['Y'])
pred = tree.predict(pd.DataFrame({'A': [1., 2., 3., 4., 5., 6., 7.]}))
print(pred) # -> [1.        2.        3.        4.        5.        5.9999995 7.       ]

x_train = pd.DataFrame({'A': [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0,
                              1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0,],
                        'B': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
                              1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,]})
y_train = pd.DataFrame({"Y" : [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0,
                               1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5]})

tree = DecisionTree(squared_error, 1, 6)
tree.fit(x_train, y_train['Y'])
pred = tree.predict(pd.DataFrame({'A': [1., 2., 3., 4., 5., 6., 7.],
                                  'B': [0., 1., 0., 1., 0., 1., 0.]}))
print(pred) #-> [1.  2.5 3.  4.5 5.  6.5 7. ]

The core of the training is coded in the function _find_best_split. It essentially follows the steps detailed above.

Note that to support any kind of objective, without the pain of manually calculating the gradient and the hessian, we use automatic differentiation and the jax library to automate computations.

Initially, we start with a tree with only one node whose leaf value leaf is zero. As we have mimic XGBoost, we also use a base score, that we set to be the mean of the value to predict.

Also, note that at line 126 we stop tree building if we reach the max depth defined when initializing the tree. We could have used other conditions like a minimum number of sample for each leaf or a minimum value for the new weight.

Another very important point is the choice of the feature used for the splitting. Here, for the sake of simplicity, features are selected randomly, but we could have been using smarter strategies, like using the feature having the greatest variance.

Conclusion

In this article, we have seen how gradient boosting works to train decision trees. To further improve our understanding, we have written the minimal set of lines required to train an ensemble of decision trees and use them for prediction.

Having a deep understanding of the algorithms we use for machine learning is absolutely critical. It not only helps us building better models but more importantly allows us to bend these models to our needs.

For instance, in the case of gradient boosting, playing with loss functions is an excellent way to increase precision when predicting.

This article has been published from the source link without modifications to the text. Only the headline has been changed.

Source link