A large part of the success of supervised machine learning systems is the existence of large quantities of labeled data. Unfortunately, in many cases, creating these labels is difficult, expensive, and time-consuming.
An obvious solution is to use machine learning to aid in the creation of the labels, but this presents a chicken and egg problem: how do we build a model to create labels before labeling our data to train that model?
Can we build a model using the fewest training labels we possibly can?
With active learning, we can build a model that will achieve better performance with fewer labeled samples by allowing the algorithm to choose the data that will provide the most information to its training process.
In other words, active learning will help us identify the most valuable samples from a dataset to build a model without having to create labels for the entire dataset.
The process starts by labeling a small percentage of the data to kick off the training of an initial model. We can select these samples randomly or strategically choose instances that we think will provide the most information to the model.
After training a model, we can use it to predict the labels for the rest of the dataset. We then can use these predictions and their associated confidence in two ways:
We can add any samples that our model predicted with high confidence to the labeled data set. This step assumes that you have enough trust in the model's predictions. A valid approach is to skip the first few iterations before automatically labeling data. We can request labels for any samples with low-confidence predictions. This step is crucial and is the engine that makes active learning work.
High-confidence samples aren't a challenge for our model, and the amount of information we will get from labeling them is low. Low-confidence samples indicate that our model needs help, so we'll focus any labeling efforts exclusively on them.
We can then use the improved labeled dataset to train a new model and repeat the process for as many rounds as necessary.
Active learning allows us to build a better-performing machine learning model using fewer training labels by strategically choosing the samples to train the model.
The process above is an example of "pool-based active learning," in which we decide which samples to label from a pool of unlabeled data. It uses an "uncertainty sampling" strategy, using the confidence of the model's predictions to determine which data would maximize the model's performance.
Other approaches include "Stream-Based Selective Sampling" and "Membership Query Synthesis". There are also different sampling strategies depending on the nature of the data, but the fundamental idea of active learning stays consistent.