There are many sources of overfitting, but an important one is when your training and test data do not come from the same distribution.
Unfortunately, this is not an uncommon problem. For example, training a model with data collected during a period different from the test or production data could lead to poor performance. Even slight differences could considerably affect your results, but this is still an issue many people struggle to identify and decide how to better move forward.
That's where Adversarial Validation comes in.
Imagine you are training a model with data that significantly differs from the data used to evaluate the model's performance. You will not be surprised when your model makes wacky predictions! The model won't generalize to samples that are too far from the training data, so the results won't be good.
Fortunately, determining whether your splits come from the same distribution is not time-consuming; you can build a model to answer this question.
The intuition is simple: this model will never work if all data comes from the same distribution, but if there are differences between the training and test data, this model will learn to distinguish them.
That's the fundamental idea behind Adversarial Validation.
To understand whether your training and test splits are in good shape, you can take your training data, remove the target variable, and mix it with the test data. Then, you create a new binary target that differentiates training samples from test samples.
After preparing this new dataset, you can build a classification model to determine how easy it's to differentiate training samples from test samples.
Using a ROC curve, you can find out how easy it is for the binary model to classify the source of each sample. Ideally, you want the area under the curve to be around 0.5, which means the model can't tell both splits apart. As this area gets closer to 1.0, it is easier for the model to identify whether a particular instance comes from the training or the test set.
Adversarial Validation will help you understand whether there's a problem with your training and test splits, but that's not all: it will also help you find out what features are causing the issue.
You can list every feature and its importance in predicting whether a sample belongs to the training or test splits. The more predictive power, the more likely the variable contributes to the existing drift between training and test data.
You can use this list to investigate what specific characteristics in your data the model is using and find a way to eliminate them.
One of the main benefits of Adversarial Validation is how simple it's to implement and how quickly you can use it to determine whether you have a problem and where you should look to fix it.
Adversarial Validation is a widespread technique in Kaggle competitions. It is, however, beneficial beyond Kaggle, and it's an excellent approach for dealing with real-life situations where you suspect that drift is affecting your model.