Weak and Strong Learners in Ensemble Learning

It is common to describe ensemble learning techniques in terms of weak and strong learners.

For example, we may desire to construct a strong learner from the predictions of many weak learners. In fact, this is the explicit goal of the boosting class of ensemble learning algorithms.

Although we may describe models as weak or strong generally, the terms have a specific formal definition and are used as the basis for an important finding from the field of computational learning theory.

In this tutorial, you will discover weak and strong learners and their relationship with ensemble learning.

After completing this tutorial, you will know:

  • Weak learners are models that perform slightly better than random guessing.
  • Strong learners are models that have arbitrarily good accuracy.
  • Weak and strong learners are tools from computational learning theory and provide the basis for the development of the boosting class of ensemble methods.

Let’s get started.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Weak Learners
  2. Strong Learners
  3. Weak vs. Strong Learners and Boosting

Weak Learners

A weak classifier is a model for binary classification that performs slightly better than random guessing.

A weak learner produces a classifier which is only slightly more accurate than random classification.

— Page 21, Pattern Classification Using Ensemble Methods, 2010.

This means that the model will make predictions that are known to have some skill, e.g. making the capabilities of the model weak, although not so weak that the model has no skill, e.g. performs worse than random.

  • Weak Classifier: Formally, a classifier that achieves slightly better than 50 percent accuracy.

A weak classifier is sometimes called a “weak learner” or “base learner” and the concept can be generalized beyond binary classification.

Although the concept of a weak learner is well understood in the context of binary classification, it can be taken colloquially to mean any model that performs slightly better than a naive prediction method. In this sense, it is a useful tool for thinking about the capability of classifiers and the composition of ensembles.

  • Weak Learner: Colloquially, a model that performs slightly better than a naive model.

More formally, the notion has been generalized to multi-class classification and has a different meaning beyond better than 50 percent accuracy.

For binary classification, it is well known that the exact requirement for weak learners is to be better than random guess. […] Notice that requiring base learners to be better than random guess is too weak for multi-class problems, yet requiring better than 50% accuracy is too stringent.

— Page 46, Ensemble Methods, 2012.

It is based on formal computational learning theory that proposes a class of learning methods that possess weakly learnability, meaning that they perform better than random guessing. Weak learnability is proposed as a simplification of the more desirable strong learnability, where a learnable achieved arbitrary good classification accuracy.

A weaker model of learnability, called weak learnability, drops the requirement that the learner be able to achieve arbitrarily high accuracy; a weak learning algorithm needs only output an hypothesis that performs slightly better (by an inverse polynomial) than random guessing.

— The Strength of Weak Learnability, 1990.

It is a useful concept as it is often used to describe the capabilities of contributing members of ensemble learning algorithms. For example, sometimes members of a bootstrap aggregation are referred to as weak learners as opposed to strong, at least in the colloquial meaning of the term.

More specifically, weak learners are the basis for the boosting class of ensemble learning algorithms.

The term boosting refers to a family of algorithms that are able to convert weak learners to strong learners.

— Page 23, Ensemble Methods, 2012.

The most commonly used type of weak learning model is the decision tree. This is because the weakness of the tree can be controlled by the depth of the tree during construction.

The weakest decision tree consists of a single node that makes a decision on one input variable and outputs a binary prediction, for a binary classification task. This is generally referred to as a “decision stump.”

Here the weak classifier is just a “stump”: a two terminal-node classification tree.

— Page 339, The Elements of Statistical Learning, 2016.

It is used as a weak learner so often that decision stump and weak learner are practically synonyms.

  • Decision Stump: A decision tree with a single node operating on one input variable, the output of which makes a prediction directly.

Nevertheless, other models can also be configured to be weak learners.

Because boosting requires a weak learner, almost any technique with tuning parameters can be made into a weak learner. Trees, as it turns out, make an excellent base learner for boosting …

— Page 205, Applied Predictive Modeling, 2013.

Although not formally known as weak learners, we can consider the following as candidate weak learning models:

  • k-Nearest Neighbors, with k=1 operating on one or a subset of input variables.
  • Multi-Layer Perceptron, with a single node operating on one or a subset of input variables.
  • Naive Bayes, operating on a single input variable.

Now that we are familiar with a weak learner, let’s take a closer look at strong learners.

Strong Learners

A strong classifier is a model for binary classification that performs with arbitrary performance, much better than random guessing.

A class of concepts is learnable (or strongly learnable) if there exists a polynomial-time algorithm that achieves low error with high confidence for all concepts in the class.

— The Strength of Weak Learnability, 1990.

This is sometimes interpreted to mean perfect skill on a training or holdout dataset, although more likely refers to a “good” or “usefully skillful” model.

  • Strong Classifier: Formally, a classifier that achieves arbitrarily good accuracy.

We seek strong classifiers for predictive modeling problems. It is the goal of the modeling project to develop a strong classifier that makes mostly correct predictions with high confidence.

Again, although the concept of a strong classifier is well understood for binary classification, it can be generalized to other problem types and we can interpret the concept less formally as a well-performing model, perhaps near-optimal.

  • Strong Learner: Colloquially, a model that performs very well compared to a naive model.

We are attempting to develop a strong model when we fit a machine learning model directly on a dataset. For example, we might consider the following algorithms as techniques for fitting a strong model in the colloquial sense, where the hyperparameters of each method are tuned for the target problem:

  • Logistic Regression.
  • Support Vector Machine.
  • k-Nearest Neighbors.

And many more methods listed in the previous section or with which you may be familiar.

Strong learning is what we seek, and we can contrast their capability with weak learners, although we can also construct strong learners from weak learners.

Weak vs. Strong Learners and Boosting

We have established that weak learners perform slightly better than random, and that strong learners are good or even near-optimal and it is the latter that we seek for a predictive modeling project.

In computational learning theory, specifically PAC learning, the formal classes of weak and strong learnability were defined with the open question as to whether the two were equivalent or not.

The proof presented here is constructive; an explicit method is described for directly converting a weak learning algorithm into one that achieves arbitrary accuracy. The construction uses filtering to modify the distribution of examples in such a way as to force the weak learning algorithm to focus on the harder-to-learn parts of the distribution.

— The Strength of Weak Learnability, 1990.

Later, it was discovered that they are indeed equivalent. More so that a strong learner can be constructed from many weak learners, formally defined. This provided the basis for the boosting class of ensemble learning methods.

The main result is a proof of the perhaps surprising equivalence of strong and weak learnability.

— The Strength of Weak Learnability, 1990.

Although this theoretical finding was made, it still took years before the first viable boosting methods were developed, implementing the procedure.

Most notably Adaptive Boosting, referred to as AdaBoost, was the first successful boosting method, later leading to a large number of methods, culminating today in highly successful techniques such as gradient boosting and implementations such as Extreme Gradient Boosting (XGBoost).

Ensembles of weak learners was mostly studied in the machine learning community. In this thread, researchers often work on weak learners and try to design powerful algorithms to boost the performance from weak to strong. This thread of work has led to the birth of famous ensemble methods such as AdaBoost, Bagging, etc., and theoretical understanding on why and how weak learners can be boosted to strong ones.

— Page 16, Ensemble Methods, 2012.

Generally, the goal of boosting ensembles is to develop a large number of weak learners for a predictive learning problem, then best combine them in order to achieve a strong learner. This is good goal as weak learners are easy to prepare but not desirable, and strong learners are hard to prepare and highly desirable.

Since strong learners are desirable yet difficult to get, while weak learners are easy to obtain in real practice, this result opens a promising direction of generating strong learners by ensemble methods.

— Pages 16-17, Ensemble Methods, 2012.

  • Weak Learner: Easy to prepare, but not desirable due to their low skill.
  • Strong Learner: Hard to prepare, but desirable because of their high skill.

The procedure that was found to achieve this is to sequentially develop weak learners and add them to the ensemble, where each weak learner is trained in a way to pay more attention to parts of the problem domain that prior models got wrong. Although all boosting techniques follow this general procedure with specific differences and optimizations, the notion of weak and strong learners is a useful concept more generally for machine learning and ensemble learning.

For example, we have already seen how we can describe the goal of a predictive model is to develop a strong model. It is common practice to evaluate the performance of a model against a baseline or naive model, such as random predictions for binary classification. A weak learner is very much like the naive model, although slightly skillful and using a minimum of information from the problem domain, as opposed to completely naive.

Consider that although we do not technically construct weak learners in bootstrap aggregation (bagging), meaning the members are not decision stumps, we do aim to create weaker decision trees to comprise the ensemble. This is often achieved by fitting the trees on sampled subsets of the data and not pruning the trees, allowing them to overfit the training data slightly.

For classification we can understand the bagging effect in terms of a consensus of independent weak learners

— Page 286, The Elements of Statistical Learning, 2016.

Both changes are made to seek less correlated trees but have the effect of training weaker, but perhaps not weak, models to comprise the ensemble.

  • Bagging: explicitly trains weaker (but not weak) learners.

Consider stacked generalization (stacking) that trains a model to best combine the predictions from multiple different models fit on the same training dataset. Each contributing level-0 model is in effect a strong learner, and the meta level-1 model seeks to make a stronger model by combining the predictions from the strong models.

  • Stacking: explicitly combines the predictions from strong learners.

Mixture of experts (MoE) operates in a similar way, training multiple strong models (the experts) that are combined into hopefully stronger models via a meta-model, the gating network, and combing method.

Mixture-of-experts can also be seen as a classifier selection algorithm, where individual classifiers are trained to become experts in some portion of the feature space. In this setting, individual classifiers are indeed trained to become experts, and hence are usually not weak classifiers

— Page 16, Ensemble Machine Learning, 2012.

This highlights that although weak and strong learnability and learners are an important theoretical finding and basis for boosting, that the more generalized ideas of these classifiers are useful tools for designing and selecting ensemble methods.

Summary

In this tutorial, you discovered weak and strong learners and their relationship with ensemble learning.

Specifically, you learned:

  • Weak learners are models that perform slightly better than random guessing.
  • Strong learners are models that have arbitrarily good accuracy.
  • Weak and strong learners are tools from computational learning theory and provide the basis for the development of the boosting class of ensemble methods.

This article has been published from the source link without modifications to the text. Only the headline has been changed.

Source link