Running the example first summarizes the class distribution, confirming the imbalance was created as expected.
1
|
Counter({0: 9990, 1: 10}) |
Next, a scatter plot is created and examples are plotted as points colored by their class label, showing a large mass for the majority class (blue) and a few dots for the minority class (orange).
This severe class imbalance with so few examples in the positive class and the unstructured nature of the few examples in the positive class might make a good basis for using one-class classification methods.
Scatter Plot of a Binary Classification Problem With a 1 to 1000 Class ImbalanceOne-Class Support Vector Machines
The support vector machine, or SVM, algorithm developed initially for binary classification can be used for one-class classification.
If used for imbalanced classification, it is a good idea to evaluate the standard SVM and weighted SVM on your dataset before testing the one-class version.
When modeling one class, the algorithm captures the density of the majority class and classifies examples on the extremes of the density function as outliers. This modification of SVM is referred to as One-Class SVM.
… an algorithm that computes a binary function that is supposed to capture regions in input space where the probability density lives (its support), that is, a function such that most of the data will live in the region where the function is nonzero.
— Estimating the Support of a High-Dimensional Distribution, 2001.
The scikit-learn library provides an implementation of one-class SVM in the OneClassSVM class.
The main difference from a standard SVM is that it is fit in an unsupervised manner and does not provide the normal hyperparameters for tuning the margin like C. Instead, it provides a hyperparameter “nu” that controls the sensitivity of the support vectors and should be tuned to the approximate ratio of outliers in the data, e.g. 0.01%.
1
2
3
|
... # define outlier detection model model = OneClassSVM(gamma='scale', nu=0.01) |
The model can be fit on all examples in the training dataset or just those examples in the majority class. Perhaps try both on your problem.
In this case, we will try fitting on just those examples in the training set that belong to the majority class.
1
2
3
|
# fit on majority class trainX = trainX[trainy==0] model.fit(trainX) |
Once fit, the model can be used to identify outliers in new data.
When calling the predict() function on the model, it will output a +1 for normal examples, so-called inliers, and a -1 for outliers.
- Inlier Prediction: +1
- Outlier Prediction: -1
1
2
3
|
... # detect outliers in the test set yhat = model.predict(testX) |
If we want to evaluate the performance of the model as a binary classifier, we must change the labels in the test dataset from 0 and 1 for the majority and minority classes respectively, to +1 and -1.
1
2
3
4
|
... # mark inliers 1, outliers -1 testy[testy == 1] = -1 testy[testy == 0] = 1 |
We can then compare the predictions from the model to the expected target values and calculate a score. Given that we have crisp class labels, we might use a score like precision, recall, or a combination of both, such as the F-measure (F1-score).
In this case, we will use F-measure score, which is the harmonic mean of precision and recall. We can calculate the F-measure using the f1_score() function and specify the label of the minority class as -1 via the “pos_label” argument.
1
2
3
4
|
... # calculate score score = f1_score(testy, yhat, pos_label=-1) print('F1 Score: %.3f' % score) |
Tying this together, we can evaluate the one-class SVM algorithm on our synthetic dataset. We will split the dataset in two and use half to train the model in an unsupervised manner and the other half to evaluate it.
The complete example is listed below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
# one-class svm for imbalanced binary classification from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score from sklearn.svm import OneClassSVM # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.999], flip_y=0, random_state=4) # split into train/test set trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # define outlier detection model model = OneClassSVM(gamma='scale', nu=0.01) # fit on majority class trainX = trainX[trainy==0] model.fit(trainX) # detect outliers in the test set yhat = model.predict(testX) # mark inliers 1, outliers -1 testy[testy == 1] = -1 testy[testy == 0] = 1 # calculate score score = f1_score(testy, yhat, pos_label=-1) print('F1 Score: %.3f' % score) |
Running the example fits the model on the input examples from the majority class in the training set. The model is then used to classify examples in the test set as inliers and outliers.
Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a number of times.
In this case, an F1 score of 0.123 is achieved.
1
|
F1 Score: 0.123 |
Isolation Forest
Isolation Forest, or iForest for short, is a tree-based anomaly detection algorithm.
… Isolation Forest (iForest) which detects anomalies purely based on the concept of isolation without employing any distance or density measure
— Isolation-Based Anomaly Detection, 2012.
It is based on modeling the normal data in such a way to isolate anomalies that are both few in number and different in the feature space.
… our proposed method takes advantage of two anomalies’ quantitative properties: i) they are the minority consisting of fewer instances and ii) they have attribute-values that are very different from those of normal instances.
— Isolation Forest, 2008.
Tree structures are created to isolate anomalies. The result is that isolated examples have a relatively short depth in the trees, whereas normal data is less isolated and has a greater depth in the trees.
… a tree structure can be constructed effectively to isolate every single instance. Because of their susceptibility to isolation, anomalies are isolated closer to the root of the tree; whereas normal points are isolated at the deeper end of the tree.
— Isolation Forest, 2008.
The scikit-learn library provides an implementation of Isolation Forest in the IsolationForest class.
Perhaps the most important hyperparameters of the model are the “n_estimators” argument that sets the number of trees to create and the “contamination” argument, which is used to help define the number of outliers in the dataset.
We know the contamination is about 0.01 percent positive cases to negative cases, so we can set the “contamination” argument to be 0.01.
1
2
3
|
... # define outlier detection model model = IsolationForest(contamination=0.01, behaviour='new') |
The model is probably best trained on examples that exclude outliers. In this case, we fit the model on the input features for examples from the majority class only.
1
2
3
4
|
... # fit on majority class trainX = trainX[trainy==0] model.fit(trainX) |
Like one-class SVM, the model will predict an inlier with a label of +1 and an outlier with a label of -1, therefore, the labels of the test set must be changed before evaluating the predictions.
Tying this together, the complete example is listed below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
# isolation forest for imbalanced classification from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import f1_scor from sklearn.ensemble import IsolationForest # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.999], flip_y=0, random_state=4) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # define outlier detection model model = IsolationForest(contamination=0.01, behaviour='new') # fit on majority class trainX = trainX[trainy==0] model.fit(trainX) # detect outliers in the test set yhat = model.predict(testX) # mark inliers 1, outliers -1 testy[testy == 1] = -1 testy[testy == 0] = 1 # calculate score score = f1_score(testy, yhat, pos_label=-1) print('F1 Score: %.3f' % score) |
Running the example fits the isolation forest model on the training dataset in an unsupervised manner, then classifies examples in the test set as inliers and outliers and scores the result.
Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a number of times.
In this case, an F1 score of 0.154 is achieved.
1
|
F1 Score: 0.154 |
Note: the contamination is quite low and may result in many runs with an F1 Score of 0.0.
To improve the stability of the method on this dataset, try increasing the contamination to 0.05 or even 0.1 and re-run the example.
Minimum Covariance Determinant
If the input variables have a Gaussian distribution, then simple statistical methods can be used to detect outliers.
For example, if the dataset has two input variables and both are Gaussian, then the feature space forms a multi-dimensional Gaussian and knowledge of this distribution can be used to identify values far from the distribution.
This approach can be generalized by defining a hypersphere (ellipsoid) that covers the normal data, and data that falls outside this shape is considered an outlier. An efficient implementation of this technique for multivariate data is known as the Minimum Covariance Determinant, or MCD for short.
It is unusual to have such well-behaved data, but if this is the case for your dataset, or you can use power transforms to make the variables Gaussian, then this approach might be appropriate.
The Minimum Covariance Determinant (MCD) method is a highly robust estimator of multivariate location and scatter, for which a fast algorithm is available. […] It also serves as a convenient and efficient tool for outlier detection.
— Minimum Covariance Determinant and Extensions, 2017.
The scikit-learn library provides access to this method via the EllipticEnvelope class.
It provides the “contamination” argument that defines the expected ratio of outliers to be observed in practice. We know that this is 0.01 percent in our synthetic dataset, so we can set it accordingly.
1
2
3
|
... # define outlier detection model model = EllipticEnvelope(contamination=0.01) |
The model can be fit on the input data from the majority class only in order to estimate the distribution of “normal” data in an unsupervised manner.
1
2
3
4
|
... # fit on majority class trainX = trainX[trainy==0] model.fit(trainX) |
The model will then be used to classify new examples as either normal (+1) or outliers (-1).
1
2
3
|
... # detect outliers in the test set yhat = model.predict(testX) |
Tying this together, the complete example of using the elliptic envelope outlier detection model for imbalanced classification on our synthetic binary classification dataset is listed below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
# elliptic envelope for imbalanced classification from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score from sklearn.covariance import EllipticEnvelope # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.999], flip_y=0, random_state=4) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # define outlier detection model model = EllipticEnvelope(contamination=0.01) # fit on majority class trainX = trainX[trainy==0] model.fit(trainX) # detect outliers in the test set yhat = model.predict(testX) # mark inliers 1, outliers -1 testy[testy == 1] = -1 testy[testy == 0] = 1 # calculate score score = f1_score(testy, yhat, pos_label=-1) print('F1 Score: %.3f' % score) |
Running the example fits the elliptic envelope model on the training dataset in an unsupervised manner, then classifies examples in the test set as inliers and outliers and scores the result.
Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a number of times.
In this case, an F1 score of 0.157 is achieved.
1
|
F1 Score: 0.157 |
Local Outlier Factor
A simple approach to identifying outliers is to locate those examples that are far from the other examples in the feature space.
This can work well for feature spaces with low dimensionality (few features), although it can become less reliable as the number of features is increased, referred to as the curse of dimensionality.
The local outlier factor, or LOF for short, is a technique that attempts to harness the idea of nearest neighbors for outlier detection. Each example is assigned a scoring of how isolated or how likely it is to be outliers based on the size of its local neighborhood. Those examples with the largest score are more likely to be outliers.
We introduce a local outlier (LOF) for each object in the dataset, indicating its degree of outlier-ness.
— LOF: Identifying Density-based Local Outliers, 2000.
The scikit-learn library provides an implementation of this approach in the LocalOutlierFactor class.
The model can be defined and requires that the expected percentage of outliers in the dataset be indicated, such as 0.01 percent in the case of our synthetic dataset.
1
2
3
|
... # define outlier detection model model = LocalOutlierFactor(contamination=0.01) |
The model is not fit. Instead, a “normal” dataset is used as the basis for identifying outliers in new data via a call to fit_predict().
To use this model to identify outliers in our test dataset, we must first prepare the training dataset to only have input examples from the majority class.
1
2
3
|
... # get examples for just the majority class trainX = trainX[trainy==0] |
Next, we can concatenate these examples with the input examples from the test dataset.
1
2
3
|
... # create one large dataset composite = vstack((trainX, testX)) |
We can then make a prediction by calling fit_predict() and retrieve only those labels for the examples in the test set.
1
2
3
4
5
|
... # make prediction on composite dataset yhat = model.fit_predict(composite) # get just the predictions on the test set yhat yhat[len(trainX):] |
To make things easier, we can wrap this up into a new function with the name lof_predict() listed below.
1
2
3
4
5
6
7
8
|
# make a prediction with a lof model def lof_predict(model, trainX, testX): # create one large dataset composite = vstack((trainX, testX)) # make prediction on composite dataset yhat = model.fit_predict(composite) # return just the predictions on the test set return yhat[len(trainX):] |
The predicted labels will be +1 for normal and -1 for outliers, like the other outlier detection algorithms in scikit-learn.
Tying this together, the complete example of using the LOF outlier detection algorithm for classification with a skewed class distribution is listed below.
# local outlier factor for imbalanced classification from numpy import vstack from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score from sklearn.neighbors import LocalOutlierFactor # make a prediction with a lof model def lof_predict(model, trainX, testX): # create one large dataset composite = vstack((trainX, testX)) # make prediction on composite dataset yhat = model.fit_predict(composite) # return just the predictions on the test set return yhat[len(trainX):] # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.999], flip_y=0, random_state=4) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # define outlier detection model model = LocalOutlierFactor(contamination=0.01) # get examples for just the majority class trainX = trainX[trainy==0] # detect outliers in the test set yhat = lof_predict(model, trainX, testX) # mark inliers 1, outliers -1 testy[testy == 1] = -1 testy[testy == 0] = 1 # calculate score score = f1_score(testy, yhat, pos_label=-1) print('F1 Score: %.3f' % score)
Running the example uses the local outlier factor model with the training dataset in an unsupervised manner to classify examples in the test set as inliers and outliers, then scores the result.
Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a number of times.
In this case, an F1 score of 0.138 is achieved.
1
|
F1 Score: 0.138 |
Summary
In this tutorial, you discovered how to use one-class classification algorithms for datasets with severely skewed class distributions.
Specifically, you learned:
- One-class classification is a field of machine learning that provides techniques for outlier and anomaly detection.
- How to adapt one-class classification algorithms for imbalanced classification with a severely skewed class distribution.
- How to fit and evaluate one-class classification algorithms such as SVM, isolation forest, elliptic envelope and local outlier factor.
This article has been published from a wire agency feed without modifications to the text. Only the headline has been changed.
Source link