Running the example first summarizes the class distribution, confirming the imbalance was created as expected.
1

Counter({0: 9990, 1: 10}) 
Next, a scatter plot is created and examples are plotted as points colored by their class label, showing a large mass for the majority class (blue) and a few dots for the minority class (orange).
This severe class imbalance with so few examples in the positive class and the unstructured nature of the few examples in the positive class might make a good basis for using oneclass classification methods.
OneClass Support Vector Machines
The support vector machine, or SVM, algorithm developed initially for binary classification can be used for oneclass classification.
If used for imbalanced classification, it is a good idea to evaluate the standard SVM and weighted SVM on your dataset before testing the oneclass version.
When modeling one class, the algorithm captures the density of the majority class and classifies examples on the extremes of the density function as outliers. This modification of SVM is referred to as OneClass SVM.
… an algorithm that computes a binary function that is supposed to capture regions in input space where the probability density lives (its support), that is, a function such that most of the data will live in the region where the function is nonzero.
— Estimating the Support of a HighDimensional Distribution, 2001.
The scikitlearn library provides an implementation of oneclass SVM in the OneClassSVM class.
The main difference from a standard SVM is that it is fit in an unsupervised manner and does not provide the normal hyperparameters for tuning the margin like C. Instead, it provides a hyperparameter “nu” that controls the sensitivity of the support vectors and should be tuned to the approximate ratio of outliers in the data, e.g. 0.01%.
1
2
3

... # define outlier detection model model = OneClassSVM(gamma='scale', nu=0.01) 
The model can be fit on all examples in the training dataset or just those examples in the majority class. Perhaps try both on your problem.
In this case, we will try fitting on just those examples in the training set that belong to the majority class.
1
2
3

# fit on majority class trainX = trainX[trainy==0] model.fit(trainX) 
Once fit, the model can be used to identify outliers in new data.
When calling the predict() function on the model, it will output a +1 for normal examples, socalled inliers, and a 1 for outliers.
 Inlier Prediction: +1
 Outlier Prediction: 1
1
2
3

... # detect outliers in the test set yhat = model.predict(testX) 
If we want to evaluate the performance of the model as a binary classifier, we must change the labels in the test dataset from 0 and 1 for the majority and minority classes respectively, to +1 and 1.
1
2
3
4

... # mark inliers 1, outliers 1 testy[testy == 1] = 1 testy[testy == 0] = 1 
We can then compare the predictions from the model to the expected target values and calculate a score. Given that we have crisp class labels, we might use a score like precision, recall, or a combination of both, such as the Fmeasure (F1score).
In this case, we will use Fmeasure score, which is the harmonic mean of precision and recall. We can calculate the Fmeasure using the f1_score() function and specify the label of the minority class as 1 via the “pos_label” argument.
1
2
3
4

... # calculate score score = f1_score(testy, yhat, pos_label=1) print('F1 Score: %.3f' % score) 
Tying this together, we can evaluate the oneclass SVM algorithm on our synthetic dataset. We will split the dataset in two and use half to train the model in an unsupervised manner and the other half to evaluate it.
The complete example is listed below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

# oneclass svm for imbalanced binary classification from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score from sklearn.svm import OneClassSVM # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.999], flip_y=0, random_state=4) # split into train/test set trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # define outlier detection model model = OneClassSVM(gamma='scale', nu=0.01) # fit on majority class trainX = trainX[trainy==0] model.fit(trainX) # detect outliers in the test set yhat = model.predict(testX) # mark inliers 1, outliers 1 testy[testy == 1] = 1 testy[testy == 0] = 1 # calculate score score = f1_score(testy, yhat, pos_label=1) print('F1 Score: %.3f' % score) 
Running the example fits the model on the input examples from the majority class in the training set. The model is then used to classify examples in the test set as inliers and outliers.
Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a number of times.
In this case, an F1 score of 0.123 is achieved.
1

F1 Score: 0.123 
Isolation Forest
Isolation Forest, or iForest for short, is a treebased anomaly detection algorithm.
… Isolation Forest (iForest) which detects anomalies purely based on the concept of isolation without employing any distance or density measure
— IsolationBased Anomaly Detection, 2012.
It is based on modeling the normal data in such a way to isolate anomalies that are both few in number and different in the feature space.
… our proposed method takes advantage of two anomalies’ quantitative properties: i) they are the minority consisting of fewer instances and ii) they have attributevalues that are very different from those of normal instances.
— Isolation Forest, 2008.
Tree structures are created to isolate anomalies. The result is that isolated examples have a relatively short depth in the trees, whereas normal data is less isolated and has a greater depth in the trees.
… a tree structure can be constructed effectively to isolate every single instance. Because of their susceptibility to isolation, anomalies are isolated closer to the root of the tree; whereas normal points are isolated at the deeper end of the tree.
— Isolation Forest, 2008.
The scikitlearn library provides an implementation of Isolation Forest in the IsolationForest class.
Perhaps the most important hyperparameters of the model are the “n_estimators” argument that sets the number of trees to create and the “contamination” argument, which is used to help define the number of outliers in the dataset.
We know the contamination is about 0.01 percent positive cases to negative cases, so we can set the “contamination” argument to be 0.01.
1
2
3

... # define outlier detection model model = IsolationForest(contamination=0.01, behaviour='new') 
The model is probably best trained on examples that exclude outliers. In this case, we fit the model on the input features for examples from the majority class only.
1
2
3
4

... # fit on majority class trainX = trainX[trainy==0] model.fit(trainX) 
Like oneclass SVM, the model will predict an inlier with a label of +1 and an outlier with a label of 1, therefore, the labels of the test set must be changed before evaluating the predictions.
Tying this together, the complete example is listed below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

# isolation forest for imbalanced classification from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import f1_scor from sklearn.ensemble import IsolationForest # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.999], flip_y=0, random_state=4) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # define outlier detection model model = IsolationForest(contamination=0.01, behaviour='new') # fit on majority class trainX = trainX[trainy==0] model.fit(trainX) # detect outliers in the test set yhat = model.predict(testX) # mark inliers 1, outliers 1 testy[testy == 1] = 1 testy[testy == 0] = 1 # calculate score score = f1_score(testy, yhat, pos_label=1) print('F1 Score: %.3f' % score) 
Running the example fits the isolation forest model on the training dataset in an unsupervised manner, then classifies examples in the test set as inliers and outliers and scores the result.
Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a number of times.
In this case, an F1 score of 0.154 is achieved.
1

F1 Score: 0.154 
Note: the contamination is quite low and may result in many runs with an F1 Score of 0.0.
To improve the stability of the method on this dataset, try increasing the contamination to 0.05 or even 0.1 and rerun the example.
Minimum Covariance Determinant
If the input variables have a Gaussian distribution, then simple statistical methods can be used to detect outliers.
For example, if the dataset has two input variables and both are Gaussian, then the feature space forms a multidimensional Gaussian and knowledge of this distribution can be used to identify values far from the distribution.
This approach can be generalized by defining a hypersphere (ellipsoid) that covers the normal data, and data that falls outside this shape is considered an outlier. An efficient implementation of this technique for multivariate data is known as the Minimum Covariance Determinant, or MCD for short.
It is unusual to have such wellbehaved data, but if this is the case for your dataset, or you can use power transforms to make the variables Gaussian, then this approach might be appropriate.
The Minimum Covariance Determinant (MCD) method is a highly robust estimator of multivariate location and scatter, for which a fast algorithm is available. […] It also serves as a convenient and efficient tool for outlier detection.
— Minimum Covariance Determinant and Extensions, 2017.
The scikitlearn library provides access to this method via the EllipticEnvelope class.
It provides the “contamination” argument that defines the expected ratio of outliers to be observed in practice. We know that this is 0.01 percent in our synthetic dataset, so we can set it accordingly.
1
2
3

... # define outlier detection model model = EllipticEnvelope(contamination=0.01) 
The model can be fit on the input data from the majority class only in order to estimate the distribution of “normal” data in an unsupervised manner.
1
2
3
4

... # fit on majority class trainX = trainX[trainy==0] model.fit(trainX) 
The model will then be used to classify new examples as either normal (+1) or outliers (1).
1
2
3

... # detect outliers in the test set yhat = model.predict(testX) 
Tying this together, the complete example of using the elliptic envelope outlier detection model for imbalanced classification on our synthetic binary classification dataset is listed below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

# elliptic envelope for imbalanced classification from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score from sklearn.covariance import EllipticEnvelope # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.999], flip_y=0, random_state=4) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # define outlier detection model model = EllipticEnvelope(contamination=0.01) # fit on majority class trainX = trainX[trainy==0] model.fit(trainX) # detect outliers in the test set yhat = model.predict(testX) # mark inliers 1, outliers 1 testy[testy == 1] = 1 testy[testy == 0] = 1 # calculate score score = f1_score(testy, yhat, pos_label=1) print('F1 Score: %.3f' % score) 
Running the example fits the elliptic envelope model on the training dataset in an unsupervised manner, then classifies examples in the test set as inliers and outliers and scores the result.
Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a number of times.
In this case, an F1 score of 0.157 is achieved.
1

F1 Score: 0.157 
Local Outlier Factor
A simple approach to identifying outliers is to locate those examples that are far from the other examples in the feature space.
This can work well for feature spaces with low dimensionality (few features), although it can become less reliable as the number of features is increased, referred to as the curse of dimensionality.
The local outlier factor, or LOF for short, is a technique that attempts to harness the idea of nearest neighbors for outlier detection. Each example is assigned a scoring of how isolated or how likely it is to be outliers based on the size of its local neighborhood. Those examples with the largest score are more likely to be outliers.
We introduce a local outlier (LOF) for each object in the dataset, indicating its degree of outlierness.
— LOF: Identifying Densitybased Local Outliers, 2000.
The scikitlearn library provides an implementation of this approach in the LocalOutlierFactor class.
The model can be defined and requires that the expected percentage of outliers in the dataset be indicated, such as 0.01 percent in the case of our synthetic dataset.
1
2
3

... # define outlier detection model model = LocalOutlierFactor(contamination=0.01) 
The model is not fit. Instead, a “normal” dataset is used as the basis for identifying outliers in new data via a call to fit_predict().
To use this model to identify outliers in our test dataset, we must first prepare the training dataset to only have input examples from the majority class.
1
2
3

... # get examples for just the majority class trainX = trainX[trainy==0] 
Next, we can concatenate these examples with the input examples from the test dataset.
1
2
3

... # create one large dataset composite = vstack((trainX, testX)) 
We can then make a prediction by calling fit_predict() and retrieve only those labels for the examples in the test set.
1
2
3
4
5

... # make prediction on composite dataset yhat = model.fit_predict(composite) # get just the predictions on the test set yhat yhat[len(trainX):] 
To make things easier, we can wrap this up into a new function with the name lof_predict() listed below.
1
2
3
4
5
6
7
8

# make a prediction with a lof model def lof_predict(model, trainX, testX): # create one large dataset composite = vstack((trainX, testX)) # make prediction on composite dataset yhat = model.fit_predict(composite) # return just the predictions on the test set return yhat[len(trainX):] 
The predicted labels will be +1 for normal and 1 for outliers, like the other outlier detection algorithms in scikitlearn.
Tying this together, the complete example of using the LOF outlier detection algorithm for classification with a skewed class distribution is listed below.
# local outlier factor for imbalanced classification from numpy import vstack from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score from sklearn.neighbors import LocalOutlierFactor # make a prediction with a lof model def lof_predict(model, trainX, testX): # create one large dataset composite = vstack((trainX, testX)) # make prediction on composite dataset yhat = model.fit_predict(composite) # return just the predictions on the test set return yhat[len(trainX):] # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.999], flip_y=0, random_state=4) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # define outlier detection model model = LocalOutlierFactor(contamination=0.01) # get examples for just the majority class trainX = trainX[trainy==0] # detect outliers in the test set yhat = lof_predict(model, trainX, testX) # mark inliers 1, outliers 1 testy[testy == 1] = 1 testy[testy == 0] = 1 # calculate score score = f1_score(testy, yhat, pos_label=1) print('F1 Score: %.3f' % score)
Running the example uses the local outlier factor model with the training dataset in an unsupervised manner to classify examples in the test set as inliers and outliers, then scores the result.
Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a number of times.
In this case, an F1 score of 0.138 is achieved.
1

F1 Score: 0.138 
Summary
In this tutorial, you discovered how to use oneclass classification algorithms for datasets with severely skewed class distributions.
Specifically, you learned:
 Oneclass classification is a field of machine learning that provides techniques for outlier and anomaly detection.
 How to adapt oneclass classification algorithms for imbalanced classification with a severely skewed class distribution.
 How to fit and evaluate oneclass classification algorithms such as SVM, isolation forest, elliptic envelope and local outlier factor.
This article has been published from a wire agency feed without modifications to the text. Only the headline has been changed.
Source link