Over fitting a model to your data is one of the most common challenges you will face as a Data Scientist. This problem may be obvious sometimes, like when the model performs incredibly well on training data but poorly on the test data. When this happens, you know the model has over fitted and will try and fix it with cross-validation or hyperparameter tuning. But sometimes the problem of over fitting is very subtle, and not easily noticed.
Consider a situation where the test and train data are not obtained from the same source. There are chances that the patterns in the dataset differ. In such cases performing cross-validation will not help in solving the over fitting problem because the data for validation comes from the training set. To overcome these problems, adversarial validation is used.
In this article, we will focus on understanding how to overcome overfitting with adversarial validation and implement this on a sample dataset.
What is adversarial validation?
Adversarial validation is a technique applied to the data to help reduce overfitting. It was inspired by FastML and is popularly used in Kaggle competitions. The idea behind this is to identify the degree of similarity between the training and test data by analyzing feature distribution. To do this, we build a classifier to predict which data is from the training set and which is from the test set. This model will assign 0 for rows from the train set and 1 for rows of data from the test set. If there are differences between the distributions of train and test set then this classifier will be able to identify these. The better a model you can learn to distinguish them, the bigger the problem you have.
Implementation of the adversarial validation model
Because this type of validation is mostly used in Kaggle competitions, let us select a competition dataset and try to identify the model performance. The dataset used for this demonstration is found here.
We will download the data and perform all the basic pre-processing required to get the data into a usable format.
import pandas as pd import numpy as np import scipy.stats as stats from catboost import Pool, CatBoostClassifier from sklearn.metrics import roc_curve, roc_auc_score import seaborn as sns train_data= pd.read_csv('train.csv') test_data = pd.read_csv('test.csv') train_data.head()
We will set the column to the target variable and drop the unnecessary columns. We will also fill empty columns with a default value.
target = train_data['price_doc'] train_data.drop(['id', 'timestamp', 'price_doc'], axis = 1, inplace = True) train_data.fillna(-1, inplace = True) test_data.drop(['id', 'timestamp'], axis = 1, inplace = True)
Now we will create a separate column for our adversarial validation classifier that contains 0 for training data and 1 for test data and combine both these datasets into one.
train_data['dataset_label'] = 1 test_data['dataset_label'] = 0 combined_data = pd.concat([train_data, test_data], axis = 0)
final_dataset = pd.get_dummies(combined_data,columns=categorical)
We have converted the data into a categorical one. Now its time to write the classifier and train it. For convenience let us use the catboost classification model.
def adversarial_validation(dataset): dataset = dataset.sample(frac=1) av_train = dataset[:len(train_data)] av_test = dataset[len(train_data):] return av_train, av_test av_train, av_test = adversarial_validation(final_dataset) train_data = Pool( data=av_train.drop('dataset_label',1), label=av_train['dataset_label'] ) remaining_data = Pool( data=av_test.drop('dataset_label',1), label=av_test['dataset_label'] ) col_to_list = final_dataset.columns.tolist() col_to_list.remove('dataset_label') metrics = { 'iterations': 500, 'eval_metric': 'AUC' } model = CatBoostClassifier(**metrics) _ = model.fit(train_data, eval_set=remaining_data, plot=True, verbose=False)
Plotting an roc graph will tell us how well the classifier is performing.
def rocgraph(y_trues, y_preds, labels, x_max=1.0): fig, ax = plt.subplots() for i, y_pred in enumerate(y_preds): y_true = y_trues[i] fpr, tpr, thresholds = roc_curve(y_true, y_pred) auc = roc_auc_score(y_true, y_pred) ax.plot(fpr, tpr, label='%s; AUC=%.3f' % (labels[i], auc), marker='o', markersize=1) ax.legend() ax.grid() ax.plot(np.linspace(0, 1, 20), np.linspace(0, 1, 20), linestyle='--') ax.set_title('ROC curve') ax.set_xlabel('False Positive Rate') ax.set_xlim([-0.01, x_max]) _ = ax.set_ylabel('True Positive Rate') rocgraph( [remaining_data.get_label().astype('int')], [model.predict_proba(remaining_data)[:,1]], ['Baseline'] )
The above graph indicates that AUC value is 0.990 which means the model is working really well. But remember that the better the model performs, the more distributed our data is. Which means the train and test sets have a lot of variation and a machine learning model would perform poorly on the test set due to overfitting.
By plotting the graph of features we can identify the most important feature.
def plot_features(model, remaining_data, features): featurevalues = model.get_feature_importance(remaining_data, type='ShapValues') expected_value = featurevalues[0,-1] featurevalues = featurevalues[:,:-1] shap.summary_plot(featurevalues, remaining_data, feature_names=features, plot_type='bar')
With this information, we can go ahead and try dropping a few features and evaluating our model again. Our aim is to make it as difficult as possible for the adversarial classifier to classify between train and test points.
Here, I have dropped the first two columns to see how it impacts our classification model.
metric1 = dict(metrics) metric1.update({"ignored_features": ['kitch_sq','full_sq']}) model2 = CatBoostClassifier(**metric1) _ = model2.fit(train_data, eval_set=remaining_data) plot_roc( [remaining_data.get_label()]*2, [model.predict_proba(remaining_data)[:,1], model2.predict_proba(remaining_data)[:,1]], ['Baseline', 'dropped'] )
The accuracy dropped to 96%. It is still a fairly strong model, but by applying other techniques it is possible to get the accuracy to a lower value.
Though adversarial validation is a great way to identify the distribution of data points, this method does not provide measures to fix the distributions. What can be done is to analyze the adversarial model by looking at important features. The most important feature helps the model to differentiate between the labels, so we can drop those features.
Conclusion
Adversarial validation can help in identifying the not so obvious reasons why the model performed well on train data but terrible on the test data. Using this method, it is possible to develop very refined machine learning models for the real world which is why it is so popular among Kaggle competitors. As a drawback, this method is still developing and does not give solutions to fix data distribution problems yet.
This article has been published from the source link without modifications to the text. Only the headline has been changed.