Home Machine Learning DIY Reducing Overfitting in Machine Learning with Adversarial Validation

Reducing Overfitting in Machine Learning with Adversarial Validation

Audio version of the article

Over fitting a model to your data is one of the most common challenges you will face as a Data Scientist. This problem may be obvious sometimes, like when the model performs incredibly well on training data but poorly on the test data. When this happens, you know the model has over fitted and will try and fix it with cross-validation or hyperparameter tuning. But sometimes the problem of over fitting is very subtle, and not easily noticed.

Consider a situation where the test and train data are not obtained from the same source. There are chances that the patterns in the dataset differ. In such cases performing cross-validation will not help in solving the over fitting problem because the data for validation comes from the training set. To overcome these problems, adversarial validation is used.

In this article, we will focus on understanding how to overcome overfitting with adversarial validation and implement this on a sample dataset.

What is adversarial validation?

Adversarial validation is a technique applied to the data to help reduce overfitting. It was inspired by FastML and is popularly used in Kaggle competitions. The idea behind this is to identify the degree of similarity between the training and test data by analyzing feature distribution. To do this, we build a classifier to predict which data is from the training set and which is from the test set. This model will assign 0 for rows from the train set and 1 for rows of data from the test set. If there are differences between the distributions of train and test set then this classifier will be able to identify these. The better a model you can learn to distinguish them, the bigger the problem you have.

Implementation of the adversarial validation model

Because this type of validation is mostly used in Kaggle competitions, let us select a competition dataset and try to identify the model performance. The dataset used for this demonstration is found here.

We will download the data and perform all the basic pre-processing required to get the data into a usable format.

import pandas as pd
import numpy as np
import scipy.stats as stats
from catboost import Pool, CatBoostClassifier
from sklearn.metrics import roc_curve, roc_auc_score
import seaborn as sns
train_data= pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
train_data.head()

Reducing Overfitting in Machine Learning with Adversarial Validation

We will set the column to the target variable and drop the unnecessary columns. We will also fill empty columns with a default value.

target = train_data['price_doc']
train_data.drop(['id', 'timestamp', 'price_doc'], axis = 1, inplace = True)
train_data.fillna(-1, inplace = True)
test_data.drop(['id', 'timestamp'], axis = 1, inplace = True)

Now we will create a separate column for our adversarial validation classifier that contains 0 for training data and 1 for test data and combine both these datasets into one.

train_data['dataset_label'] = 1
test_data['dataset_label'] = 0
combined_data = pd.concat([train_data, test_data], axis = 0)

Reducing Overfitting in Machine Learning with Adversarial Validation

final_dataset = pd.get_dummies(combined_data,columns=categorical)

We have converted the data into a categorical one. Now its time to write the classifier and train it. For convenience let us use the catboost classification model.

def adversarial_validation(dataset):
    dataset = dataset.sample(frac=1)
    av_train = dataset[:len(train_data)]
    av_test = dataset[len(train_data):]
    return av_train, av_test
av_train, av_test = adversarial_validation(final_dataset)
train_data = Pool(
    data=av_train.drop('dataset_label',1),
    label=av_train['dataset_label']
)
remaining_data = Pool(
    data=av_test.drop('dataset_label',1),
    label=av_test['dataset_label']
)
col_to_list = final_dataset.columns.tolist()
col_to_list.remove('dataset_label')
metrics = {
    'iterations': 500,
    'eval_metric': 'AUC'
}
model = CatBoostClassifier(**metrics)
_ = model.fit(train_data, eval_set=remaining_data, plot=True, verbose=False)

Plotting an roc graph will tell us how well the classifier is performing.

def rocgraph(y_trues, y_preds, labels, x_max=1.0):
    fig, ax = plt.subplots()
    for i, y_pred in enumerate(y_preds):
        y_true = y_trues[i]
        fpr, tpr, thresholds = roc_curve(y_true, y_pred)
        auc = roc_auc_score(y_true, y_pred)
        ax.plot(fpr, tpr, label='%s; AUC=%.3f' % (labels[i], auc), marker='o', markersize=1)
    ax.legend()
    ax.grid()
    ax.plot(np.linspace(0, 1, 20), np.linspace(0, 1, 20), linestyle='--')
    ax.set_title('ROC curve')
    ax.set_xlabel('False Positive Rate')
    ax.set_xlim([-0.01, x_max])
    _ = ax.set_ylabel('True Positive Rate')
rocgraph(
    [remaining_data.get_label().astype('int')],
    [model.predict_proba(remaining_data)[:,1]],
    ['Baseline']
)

Reducing Overfitting in Machine Learning with Adversarial Validation

The above graph indicates that AUC value is 0.990 which means the model is working really well. But remember that the better the model performs, the more distributed our data is. Which means the train and test sets have a lot of variation and a machine learning model would perform poorly on the test set due to overfitting.

By plotting the graph of features we can identify the most important feature.

def plot_features(model, remaining_data, features):
    featurevalues = model.get_feature_importance(remaining_data, type='ShapValues')
    expected_value = featurevalues[0,-1]
    featurevalues = featurevalues[:,:-1]
    shap.summary_plot(featurevalues, remaining_data, feature_names=features, plot_type='bar')

Reducing Overfitting in Machine Learning with Adversarial Validation

With this information, we can go ahead and try dropping a few features and evaluating our model again. Our aim is to make it as difficult as possible for the adversarial classifier to classify between train and test points.

Here, I have dropped the first two columns to see how it impacts our classification model.

metric1 = dict(metrics)
metric1.update({"ignored_features": ['kitch_sq','full_sq']})
model2 = CatBoostClassifier(**metric1)
_ = model2.fit(train_data, eval_set=remaining_data)
plot_roc(
    [remaining_data.get_label()]*2,
    [model.predict_proba(remaining_data)[:,1], model2.predict_proba(remaining_data)[:,1]],
    ['Baseline', 'dropped']
)

Reducing Overfitting in Machine Learning with Adversarial Validation

The accuracy dropped to 96%. It is still a fairly strong model, but by applying other techniques it is possible to get the accuracy to a lower value.

Though adversarial validation is a great way to identify the distribution of data points, this method does not provide measures to fix the distributions. What can be done is to analyze the adversarial model by looking at important features. The most important feature helps the model to differentiate between the labels, so we can drop those features.

Conclusion

Adversarial validation can help in identifying the not so obvious reasons why the model performed well on train data but terrible on the test data. Using this method, it is possible to develop very refined machine learning models for the real world which is why it is so popular among Kaggle competitors. As a drawback, this method is still developing and does not give solutions to fix data distribution problems yet.

This article has been published from the source link without modifications to the text. Only the headline has been changed.

Source link

- Advertisment - Reducing Overfitting in Machine Learning with Adversarial ValidationReducing Overfitting in Machine Learning with Adversarial Validation

Most Popular

SwiftUI TabView Introduction and Tab Bar Customization

  The tab bar interface appears in some of the most popular mobile apps such as Facebook, Instagram, and Twitter. A tab bar appears at...

Preparing Data in Machine Learning

Data preparation may be one of the most difficult steps in any machine learning project. The reason is that each dataset is different and highly...

Using blockchain to crack down abusive imagery

Blockchain could be an effective and efficient solution for helping to rid the internet of abusive imagery. Tackling abusive imagery can help victims...

Transforming chatbots throgh AI and ML

The increasing technology has always been a saviour for us. Technology still provides us with solutions for existing problems. One of the answers offered...

Some tricks in Python

Python is one of the most popular programming languages ​​for beginner developers, making it the most widely taught language in schools around the world. However,...

Introduction to R Markdown

In this blog post, we’ll look at how to use R Markdown. By the end, you’ll have the skills you need to produce a...
- Advertisment - Reducing Overfitting in Machine Learning with Adversarial ValidationReducing Overfitting in Machine Learning with Adversarial Validation