Early Stopping

One of the most effective, easy-to-implement regularization techniques when training machine learning models.

Train a model for too long, and it will stop generalizing appropriately. Don't train it long enough, and it won't learn.

That's a critical tradeoff when building a machine learning model, and finding the perfect number of iterations is essential to achieving the results we expect.

Determining when to stop

We want to stop the training process as soon as the model starts overfitting.

An excellent approach to determine when the model starts overfitting is to evaluate it on a separate validation set after each iteration. While the performance on the training set will improve continually, as soon as the model starts overfitting, the performance on the validation set will begin degrading.

A validation set and a specific metric to measure the model's performance are the keys to determining when to stop the training process. For example, the validation loss is an excellent metric to understand whether the model is progressing in the right direction. We can also compute and track any other metric that captures the model's performance.

Stopping the process manually

With a validation set and a metric measuring the model's performance, we can treat the number of iterations or epochs as another hyperparameter and evaluate different values to find what works best. The model with the best performance will determine the correct number of iterations we should use.

This approach works. However, it's inefficient and needs manual involvement. Finding the correct number of iterations will require multiple training rounds, which will take a long time. Also, if we change anything else on the model, we might need to repeat the process to find a new stopping point.

Storing the best model To avoid manually finding the best number of iterations, we could use a different approach: Train the model for a long time—longer than what we expect it to take—and store the best-performing model after each iteration using the chosen metric.

Let's say we think that overfitting will start happening around 50 iterations, but we don't know precisely when. We could train the model for 100 iterations and, after every epoch, save the model weights if they improve the previously saved copy. At the end of this process, we will have the model with the best performance.

This approach doesn't require manual involvement, but it will waste an immense amount of resources training a model for many iterations past the overfitting point. Since we don't know when to stop training, we need to overshoot to ensure we capture the exact time when the model starts overfitting.

Stopping the process automatically

Instead of training the model for a pre-determined number of iterations while we keep storing the best set of weights, we could automatically stop the model as soon as we realize it started overfitting.

For example, if the model's performance degrades for a few consecutive iterations, we probably found the point of no return. At that time, we can safely stop the training process and use the best set of weights we previously stored.

The more noise we find in the training process, the longer we should wait for the model's performance to recover before deciding to stop training. Specific implementations of early stopping use the term "patience" to refer to how long we should wait for the model to show no improvements before stopping the process.

Early stopping roundup

A combination of a holdout validation set, a specific metric to measure the model's performance, a process to store the best weights of the model, and a trigger to automatically stop the training process as soon as it stops improving are all of the components we need. We call this technique "Early stopping."

Early stopping is one of the most popular regularization techniques to train machine learning models. It's both easy to implement and very effective.

Fortunately, every major library in the field supports early stopping for many different algorithms that use an iterative method to train. We configure them, and they'll do all the work for us.

Latest articles

The wrong batch size is all it takes. How different batch sizes influence the training process of neural networks using gradient descent.
Overfitting and Underfitting with Learning Curves. An introduction to two fundamental concepts in machine learning through the lens of learning curves.
When accuracy doesn't help. An introduction to precision, recall, and f1-score metrics to measure a machine learning model's performance.