Look at this plot:

![Learning Curves - Batch with one sample](images/article-the-wrong-batch-size-is-all-it-takes-1.jpg)

Now, compare it to this one:

![Learning Curves - Batch with every sample](images/article-the-wrong-batch-size-is-all-it-takes-2.jpg)

They are very, very different. However, these plots come from the same place. 

These are the training and testing losses of two simple neural networks that use the same architecture, trained on the exact data for 100 epochs, using the same optimizer, learning rate, momentum, and loss function.

Almost everything is the same. 

_Almost._

## What's happening?

I'm using a different batch size.

Something that may appear as a simple difference can completely change the results of a model.

I wrote some code to try and understand how we can get such wildly different results by simply varying the batch size we use to train a model, but before, we need to think about how gradient descent works.

To train a network, we run gradient descent for multiple iterations, or "epochs." On every iteration, the algorithm computes how much we need to adjust the model to get closer to the desired results. To do this, we take samples from the training dataset, run them through the model, and determine how far away the results are from the ones we expect. We call this difference "loss" and use it during backpropagation to update the model weights.

During this process, we must decide how many training samples we'll use to compute the loss. We call this the "batch size."

We have three choices:
1. We could use a single sample to compute the loss.
2. We could use the entire training set, all the data at once.
3. We could use a few samples, more than one, but fewer than the whole training set.

Let's jump into the code to understand how different batch sizes affect our models.

## The setup

The code is straightforward and starts by importing the libraries I need. I'm using a combination of Scikit-Learn and Keras to create a dataset, train, and evaluate the three models.

```python
import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs

from tensorflow import keras
from keras import layers
from keras import models
from keras import optimizers
```

The first step is to come up with a random dataset. I used Scikit-Learn's `make_blobs()` to do that:

```python
n = 1000
classes = 3
dimensions = 2

train_size = int(n * 0.8)

X, y = make_blobs(
    n_samples=n, 
    centers=classes, 
    n_features=dimensions,
    cluster_std=2
)

X_train, X_test = X[:train_size, :], X[train_size:, :]
y_train, y_test = y[:train_size], y[train_size:]
```

Notice that I'm creating three classes, so this will be a multi-class classification problem. Two different features to keep it simple, and I'm generating 1,000 samples. I'll use 80% of those samples to train the model and the remaining 20% to test it. 

Finally, here are two functions that I will use later on in every experiment:

```python
def fit_model(batch_size):
    model = models.Sequential([
        layers.Dense(32, input_dim=dimensions, activation="relu"),
        layers.Dense(classes, activation="softmax")
    ])

    model.compile(
        optimizer=keras.optimizers.SGD(learning_rate=0.01, momentum=0.9), 
        loss="sparse_categorical_crossentropy", 
        metrics=["accuracy"]
    )

    history = model.fit(
        X_train, 
        y_train, 
        validation_data=(X_test, y_test), 
        epochs=100, 
        batch_size=batch_size
    )

    return model, history


def evaluate(model, history):
    _, train_accuracy = model.evaluate(X_train, y_train)
    _, test_accuracy = model.evaluate(X_test, y_test)

    print(f"Trainining accuracy: {train_accuracy:.2f}")
    print(f"Testing accuracy: {test_accuracy:.2f}")

    plt.figure(figsize=(6, 4), dpi=160)

    plt.plot(history.history["loss"], label="train")
    plt.plot(history.history["val_loss"], label="test")
    plt.legend()
    plt.show()
```

The `fit_model()` function creates a simple neural network with one hidden layer. Then it compiles the model and then fits it with the training data. Notice how I'm passing the batch size as an argument to this function.

Finally, the `evaluate()` function evaluates the model on the training and testing data, prints the accuracies, and plots the loss.

## The first experiment

The first experiment uses a single sample as the batch size:

```python
model, history = fit_model(batch_size=1)
evaluate(model, history)
```

Using only one sample of data on every iteration to compute the loss is called "Stochastic Gradient Descent."

When you start training this model, the logs will start the following way:

```
Epoch 1/100 800/800 [==============================]
```

The 800 indicates that this model will update the model's weights 800 times. That's because we have 800 training samples! Since we are using a single sample as our batch size, the model is computing the loss and updating the weights 800 times! Not surprising, this experiment is the most computationally expensive of all three.

When I ran the experiment, it took almost 200 seconds to complete. That's a long time for such a simple model with just a few data samples! Here is the plot of the training and testing losses:

![Learning Curves - Batch with one sample](images/article-the-wrong-batch-size-is-all-it-takes-1.jpg)

Notice how much noise with the losses going up and down. Remember that the algorithm is computing the loss for every training sample, so depending on what that value is, the loss can vary dramatically, and that's what we see here. 

My training accuracy for this experiment was around 0.87, while the testing accuracy was 0.85. These are low, but there's something even more interesting:

Every time I run the experiment, the accuracies look very different. Sometimes better, sometimes worse. The noise we see in the plots is probably the reason this happens.

The loss keeps jumping around, and something good about this is that your algorithm will avoid getting stuck in a local minimum when you use a tiny batch size. The swings will help it break free from suboptimal solutions! But it could have the opposite effect as well. Just like the noise helps the algorithm get away from a local minimum, it can also prevent it from settling in the global minimum.

## The second experiment

The second experiment uses the entire training set to compute the loss and update the model weights:

```python
model, history = fit_model(batch_size=train_size)
evaluate(model, history)
```

Using all the data at once is called "Batch Gradient Descent."

This one runs fast. Notice here how only one update happens during every epoch:

```
Epoch 1/100 1/1 [==============================]
```

When I ran it, this experiment took 4 seconds. Compare this with the 200 seconds from the previous one. It makes sense: we compute the loss and update the model's weights only once for the entire training set.

The training and testing accuracies are much better: 0.93 and 0.94, respectively. Finally, look at both losses and how smooth they are and compare them with the noisy plots from the previous experiment:

![Learning Curves - Batch with every sample](images/article-the-wrong-batch-size-is-all-it-takes-2.jpg)

This method has a significant disadvantage: we need to store the entire training set in memory to compute the loss during every iteration. For a toy sample like this, storing all the data in memory is not a big deal, but it would be a no go for any decently sized dataset.

Finally, the lack of variability in the loss is also a problem: this method can get stuck in a local minimum, and there's nothing that will make it break free and find a better solution.

## The third experiment

Let's look at the third experiment:

```python
model, history = fit_model(batch_size=32)
evaluate(model, history)
```

Using some data, more than one sample but fewer than the entire training set is called "Mini-Batch Gradient Descent."

I'm using 32 samples, and that's why you see that we are doing 25 updates during every epoch: 800 divided by 32 is 25 different batches of samples:

```
Epoch 1/100 25/25 [==============================]
```

This experiment runs pretty fast as well: 10 seconds. Obviously, slower than using the entire training set but way faster than using a single sample.

The plot is beautiful: 

![Learning Curves - Batch with 32 samples](images/article-the-wrong-batch-size-is-all-it-takes-3.jpg)

You see some noise, but nothing compared to the first experiment. The training and testing accuracies are excellent at 0.96 and 0.94, respectively. 

Mini-Batch Gradient Descent is much better at avoiding local minima, more computationally efficient than Stochastic Gradient Descent, and doesn't need as much memory as Batch Gradient Descent. But it has one disadvantage: we need to worry about an additional hyperparameter, the batch size.

## The punchline

We rarely use a single sample or the entire training set in practice: both methods have significant disadvantages. Instead, we use the batch size as another hyperparameter that we can tweak to determine what's the best value for our model. 

Every problem is different, but a popular recommendation is to start exploring from a relatively small batch size. For example, [32 is a good default value](https://arxiv.org/abs/1206.5533) to get going.

By the way, interpreting learning curves is a critical skill. We can extract a lot of information by plotting and making sense of a few charts! [Here is an article](overfitting-underfitting-learning-curves) that uses learning curves to identify two of the most common problems in machine learning.

[Here is a Google Colab notebook](https://colab.research.google.com/drive/13DLLcERSPa-RUdWQV7X__Wpc1IfJstpJ?usp=sharing) with the source code of the experiments from this article.

How different batch sizes influence the training process of neural networks using gradient descent.

The wrong batch size is all it takes


Think about school, a teacher, and a set of questions you must study for every exam.

At the end of the year, the teacher publishes a chart showing the number of mistakes students made as it relates to how much time they had to prepare for the test:

![Results from the student's exam](images/article-overfitting-underfitting-learning-curves-1.jpg)

The longer students prepared, the fewer mistakes they made. That makes sense.

Every teacher wants their students to learn the material. They don't want them to optimize for the exam;  they want to build knowledge that will later **generalize** to situations they'll see outside the classroom.
 
We want the same for a machine learning model: we want it to learn and generalize what it knows to solve samples outside the training dataset. We can use a model that generalizes well to make predictions on data we haven't seen. A model that doesn't learn or can't generalize to new data is useless.

We refer to these models using two terms: when the model doesn't learn the training data, we say it's **underfitting**. When it starts memorizing and doesn't generalize to new data, we say it's **overfitting**.

Both cases are a problem, but fortunately, there are many ways to deal with underfitting and overfitting. Before we can do anything, however, we need to identify them.

## Learning curves

The student's analogy helps explain some of the technical jargon in the chart below:

![Learning Curves - High Training Loss](images/article-overfitting-underfitting-learning-curves-2.jpg)

This picture shows a similar idea to our chart before. It shows how the model's training loss changes as we train it for longer.

We use the term "epochs" to represent the number of iterations we used to train the model, equivalent to the time the teacher gave the students to prepare for the exam. The model's "loss" is similar to measuring the students' mistakes. The lower the loss, the better the model. 

We call these charts "**learning curves**." We'll use a few examples to understand when overfitting and underfitting happen when training and evaluating machine learning models.

## When the training loss is high

Let's get back to the previous learning curve:

![Learning Curves - High Training Loss](images/article-overfitting-underfitting-learning-curves-2.jpg)

There's a crucial detail here: this is the model's loss on the training dataset. We are measuring the model's learning ability on the same data we are using to train it.

Imagine the teacher using the same questions discussed in class to evaluate students during the exam. Presumably, students could memorize the answers to these questions and ace the test. The more time they spend with the questionnaire, the fewer mistakes they will make.

If we look at the chart, such a high training loss from a model with every possible advantage to do well can only mean something must be wrong with it.

Whenever the model's training loss is too high, we say the model does not fit the data correctly. We call this "**underfitting**."

## When the training loss is low

Here is the same learning curve, but this time you can see the training loss is significantly lower:

![Learning Curves - Low Training Loss](images/article-overfitting-underfitting-learning-curves-3.jpg)

A curve like this is what we should expect from a model that fits the data appropriately, so we know it isn't underfitting. But can we say this is a good model?

If we go back to the students that answered the same questions they used while preparing for the exam, can we consider they learned the material if the exam results were good? There's a chance they just memorized every answer and didn't learn anything, so we can't make any assumptions.

This learning curve tells us the model is not underfitting, but we can't say whether it's overfitting or not. We need more information. 

## More learning curves

If we want to know whether the students learned the material, we need to create an exam with questions that differ from those they already saw.

We use the same idea when training machine learning models: a training set to teach the model and a separate set to evaluate it. We can now plot two learning curves; the training loss as we did before and a new testing loss:

![Learning Curves - Low Training and High Testing Loss](images/article-overfitting-underfitting-learning-curves-4.jpg)

While the training loss alone was not enough to decide whether the model was overfitting, the testing loss makes it clear. The model's loss might be good on the training dataset, but it suffers when we evaluate it on the testing set. Think of the students that crushed the exam when they knew the questions beforehand but didn't score well when there were new questions.

Whenever this happens, we say the model fits the data too well and doesn't generalize to unseen data. The name we use for this is "**overfitting**."

## When both losses are high

Remember that a high training loss indicates that our model is underfitting. Take a look at this chart:

![Learning Curves - High Training and High Testing Loss](images/article-overfitting-underfitting-learning-curves-5.jpg)

The testing loss curve is further proof our model isn't doing well. The model can't cope with the training data, much less with the testing data. If the students can't perform well on an exam they know the questions beforehand, they will undoubtedly do poorly on one for which they don't know the answers.

An underfitting model will have high training and testing losses.

## When both losses are low

Here is what we'd like to see when we plot our learning curves:

![Learning Curves - Low Training and Low Testing Loss](images/article-overfitting-underfitting-learning-curves-6.jpg)

The low training loss suggests the model learns the training data well. At the same time, the low testing loss indicates the model generalizes correctly and works well on unseen data.

A well-fit model doesn't underfit or overfit. The learning curves of a good model will show both the training and testing losses continuously decreasing to a low value.

## When there's a gap between losses

These losses show promise:

![Learning Curves - Gap between losses](images/article-overfitting-underfitting-learning-curves-7.jpg)

They both start decreasing, but there's a significant gap between them. 

This situation happens whenever the training dataset is insufficient for the model to generalize appropriately and perform well on the testing set. For example, the training and the testing data might differ significantly, or the training dataset might not be large enough. 

Whenever this happens, understand you have a lot of opportunities to improve the model. You want that gap to be small.

## When the testing loss suddenly increases

The following chart shows a common scenario when training a model:

![Learning Curves - Testing Loss suddenly increases](images/article-overfitting-underfitting-learning-curves-8.jpg)

Notice how the training and testing losses decrease appropriately until a point when the testing loss starts increasing. The training loss continues improving at that moment, but the testing loss diverges significantly.

We already know that a model with low training and high testing loss is overfitting. We can apply the same idea to this chart and conclude that up to the point where the testing loss reverses course, we had a well-fit model, but from then on out, the model overfits. [Early stopping](early-stopping) is a popular technique to prevent this from ruining your model.

## Summary

We use learning curves to diagnose a machine learning model during training and evaluation. This article focused on learning curves that show the model's loss, but we can use other metrics with a similar purpose.

Learning curves are a great tool to help us determine whether a model is overfitting or underfitting:

* An overfitting model performs well on the training data but doesn't generalize to testing data. 
* An underfitting model performs poorly on training and testing data.

Understanding how to interpret learning curves is crucial for every machine learning practitioner.

An introduction to two fundamental concepts in machine learning through the lens of learning curves.

Overfitting and Underfitting with Learning Curves


I once built a machine learning model that predicted with 99.99% accuracy whether a person would die in a car crash on any given day.

My solution was straightforward; I built a model that consistently predicted the person wouldn't die. That's it.

Without getting into the math, your odds of dying in a car crash every time you get behind the wheel are roughly 1 in 7,000,000 or 0.00000014%. So, yes, my model was very accurate!

## The problem with accuracy
The car crash example is one of those imbalanced problems where we are interested in detecting an outcome that represents the overwhelming minority of all the samples: Most people will not die in a car crash, so my model, while very accurate, is useless.

This type of problem is widespread. For instance, detecting fraudulent credit card transactions, rare diseases, or selecting spam from legitimate emails are also imbalanced problems.

In general, any time we are working with an imbalanced problem where the negative cases disproportionally outnumber the positive samples, accuracy is not a helpful metric for assessing the performance of a machine learning model.

## Recall
We usually want to maximize the model's ability to detect specific outcomes, and we can measure that using a metric called "**recall**." This metric represents the model's ability to identify all relevant samples in a problem.

To compute the model's recall, we can divide the positive instances detected by the model by the total number of existing positive samples in the dataset. A more formal way to think about this is by breaking these into separate concepts:

The "true positive" (TP) samples are those positive instances the model detected. 
The "false negative" (FN) samples are those positive instances the model missed.
The total number of positive samples is the sum of "true positive" and "false negative" samples.

Therefore, we can define the model's recall as follows:

![Recall](images/article-when-accuracy-doesnt-help-1.jpg)

Remember the dumb model I built to predict people dying in car crashes? The model always predicts that you won't die, which is the negative case. I can only get true and false negatives from the model, but I'll never get any positive samples, which means the model's recall will be zero. Computing the recall makes it clear that my model is not practical.

## Precision
The model's recall gives us a good representation of the model's ability to identify relevant outcomes. A model with high recall tells us that it can find most of the positive samples in the dataset, which is excellent.

But what happens if I change my model and always return a positive result? Wouldn't that make the recall of the model perfect?

It would, but now I'll be misclassifying every negative sample. Although the model's recall will be high, its "**precision**" will be very low.

The precision of a model is the other metric that goes hand in hand with the model's recall. Precision measures the ability of a model to identify only relevant samples. We can compute it by dividing the positive instances detected by the model by the total number of cases marked as positive by the model, regardless of whether they are correct. More formally, we need the following terms:

The "true positive" (TP) samples are the positive instances that a model detected. 
The "false positive" (FP) samples are the negative instances that the model incorrectly classified as positive.
The total number of cases the model marked as positive is the sum of "true positive" and "false positive" samples.

Therefore, we can define the model's precision as follows:

![Precision](images/article-when-accuracy-doesnt-help-2.jpg)

This time, although artificially changing my model to have a high recall may look like a good idea, its precision will indicate that the model is still not good.

## Finding a balance
Imagine I make a few more changes to my model to focus on detecting a few specific positive samples. The model won't notice any instances that don't fit the mold but at least will catch some and, therefore, be very precise. How would that affect the model's recall?

As we make changes that increase the precision of a model, we will affect its recall. Conversely, as we increase the model's recall, we will reduce its precision. There's a tradeoff between these metrics, which is why we use them together to understand the model's performance better.

Fortunately, we have a third metric that encapsulates the balance between precision and recall. We call it the f1-score:

![F1-Score](images/article-when-accuracy-doesnt-help-3.jpg)

The f1-score is the harmonic mean between precision and recall. It gives equal weight to both metrics while punishing any model variation that prioritizes either without regard for the other. F1-score is a version of the more general Fβ-score metric, where we can adjust the value of β to give more weight to either precision or recall.

Using the f1-score, we can compute the performance of a model and ensure its recall and precision are in perfect balance.

## Summary
When working on an imbalanced problem, accuracy will not be a helpful metric to assess the model's performance. Precision and recall will help us close the gap, and we can use the f1-score to balance both metrics in a single value.

Here is a chart that summarizes how we can look at the results of a model to compute the precision and recall. The red instances are positive outcomes, while the blue ones are the negative instances:

![Precision and Recall](images/article-when-accuracy-doesnt-help-4.jpg)

Notice how we can surface the true positive, false positive, true negative, and false negative values from this illustration. A [confusion matrix](confusion-matrix) is another way to visualize the model results and get immediate access to a similar breakdown.


An introduction to precision, recall, and f1-score metrics to measure a machine learning model's performance.

When accuracy doesn't help


Most people summarize the performance of a classification model using a single high-level metric like accuracy or f1-score. While helpful, this doesn't give us enough information about the quality of the predictions and the mistakes the model makes.

The confusion matrix is a way to take a deeper look into the performance of a classification model.

A single metric is not enough
What conclusions can you draw from a multi-class classification model that predicts four classes with 80% accuracy? Unfortunately, while we know that the model correctly predicts four out of five samples, we can't say anything about the performance of specific classes or the mistakes the model makes.

For example, the model could predict one particular class exceptionally well and make most of its mistakes in another category. Understanding this information is critical to improving the model, so we often must move past a single-metric performance evaluation.

## The confusion matrix
A confusion matrix is a table that helps analyze the performance of a classification model. It breaks down each class and the number of correct and incorrect predictions the model makes. It gives us immediate access to the model's errors and their type.

We need to evaluate a classification model to fill up a confusion matrix. Let's imagine a hypothetical scenario where we build a binary classification model:

![Evaluation scenario](images/article-confusion-matrix-1.jpg)

We have 100 samples, 59 belonging to class A and 41 to class B. After running the model, we get 52 out of 59 class A predictions correct and 28 out of 41 class B predictions correct. This information is enough to create a straightforward confusion matrix with the results:

![Confusion Matrix - Binary classification](images/article-confusion-matrix-2.jpg)

The model correctly predicted 52 samples from class A and made a mistake with the other seven samples. It also correctly predicted 28 samples from class B and made 13 mistakes. 

The intersection between a row and a column gives us the number of predictions between those two classes. In this example, I use the rows of the matrix to talk about expected values while I use the columns to talk about the predicted values. Many opt to display the model's predictions horizontally and the actual values vertically, but that doesn't affect the main idea.

## Positive and negative outcomes
A common problem that binary classification models solve is identifying specific instances from normal ones. For example, identifying spam from regular emails or fraudulent transactions from legitimate ones. We commonly refer to these two cases as positive and negative, where a positive sample is an outcome we want to identify.

We can use a confusion matrix to represent the predictions of a model, but this time, let's extend it with more information:

![Confusion Matrix - Binary classification - Using Positive and Negative classes](images/article-confusion-matrix-3.jpg)

Notice how simple it is to determine the number of True Positives (TP,) True Negatives (TN,) False Positives (FN,) and False Negatives from the confusion matrix. This information is critical to computing metrics like accuracy, precision, recall, and f1-score, among others.

## Extending it to more classes
We can extend a confusion matrix to more than two classes. Here is an example with four classes:

![Confusion Matrix - Multi-class classification](images/article-confusion-matrix-4.jpg)

Like before, we can determine the model's mistakes by looking at the different combinations in the matrix. For example, the model predicts 12 samples that belong to class D as class C but never makes that mistake with class B. It has problems telling class C and D apart, with 19 total errors. Class C is the worst performing class with 18 mistakes, while class B is the best with only six mistakes. 

Notice how a confusion matrix for multiple classes works just like the one for a binary classification problem.

## But why would you care?
A confusion matrix is a visualization tool that surfaces essential information that otherwise would be hard to see. And because it's very effective at doing that, many people use it.

Understanding how it works and using it to communicate the results of your model will allow you to talk to other people, understand their work, and share yours.

But even more important than that, the information that a confusion matrix puts right in front of us, it's critical to decide the best strategy to improve that model. It makes painfully clear where the problems are and where our time should go.

One of the simplest and most popular tools to analyze the performance of a classification model.

Confusion Matrix


Train a model for too long, and it will stop generalizing appropriately. Don't train it long enough, and it won't learn.

That's a critical tradeoff when building a machine learning model, and finding the perfect number of iterations is essential to achieving the results we expect.

## Determining when to stop
We want to stop the training process as soon as the model starts overfitting.

An excellent approach to determine when the model starts overfitting is to evaluate it on a separate validation set after each iteration. While the performance on the training set will improve continually, as soon as the model starts overfitting, the performance on the validation set will begin degrading. 

A validation set and a specific metric to measure the model's performance are the keys to determining when to stop the training process. For example, the validation loss is an excellent metric to understand whether the model is progressing in the right direction. We can also compute and track any other metric that captures the model's performance.

## Stopping the process manually
With a validation set and a metric measuring the model's performance, we can treat the number of iterations or epochs as another hyperparameter and evaluate different values to find what works best. The model with the best performance will determine the correct number of iterations we should use.

This approach works. However, it's inefficient and needs manual involvement. Finding the correct number of iterations will require multiple training rounds, which will take a long time. Also, if we change anything else on the model, we might need to repeat the process to find a new stopping point.

Storing the best model
To avoid manually finding the best number of iterations, we could use a different approach: Train the model for a long time—longer than what we expect it to take—and store the best-performing model after each iteration using the chosen metric.

Let's say we think that overfitting will start happening around 50 iterations, but we don't know precisely when. We could train the model for 100 iterations and, after every epoch, save the model weights if they improve the previously saved copy. At the end of this process, we will have the model with the best performance.

This approach doesn't require manual involvement, but it will waste an immense amount of resources training a model for many iterations past the overfitting point. Since we don't know when to stop training, we need to overshoot to ensure we capture the exact time when the model starts overfitting.

## Stopping the process automatically
Instead of training the model for a pre-determined number of iterations while we keep storing the best set of weights, we could automatically stop the model as soon as we realize it started overfitting.

For example, if the model's performance degrades for a few consecutive iterations, we probably found the point of no return. At that time, we can safely stop the training process and use the best set of weights we previously stored.

The more noise we find in the training process, the longer we should wait for the model's performance to recover before deciding to stop training. Specific implementations of early stopping use the term "patience" to refer to how long we should wait for the model to show no improvements before stopping the process.

## Early stopping roundup
A combination of a holdout validation set, a specific metric to measure the model's performance, a process to store the best weights of the model, and a trigger to automatically stop the training process as soon as it stops improving are all of the components we need. We call this technique "Early stopping."

Early stopping is one of the most popular regularization techniques to train machine learning models. It's both easy to implement and very effective.

Fortunately, every major library in the field supports early stopping for many different algorithms that use an iterative method to train. We configure them, and they'll do all the work for us.


One of the most effective, easy-to-implement regularization techniques when training machine learning models.

Early Stopping


A large part of the success of supervised machine learning systems is the existence of large quantities of labeled data. Unfortunately, in many cases, creating these labels is difficult, expensive, and time-consuming.

An obvious solution is to use machine learning to aid in the creation of the labels, but this presents a chicken and egg problem: how do we build a model to create labels before labeling our data to train that model?

## Active learning
Can we build a model using the fewest training labels we possibly can?

With active learning, we can build a model that will achieve better performance with fewer labeled samples by allowing the algorithm to choose the data that will provide the most information to its training process. 

In other words, active learning will help us identify the most valuable samples from a dataset to build a model without having to create labels for the entire dataset.

## How does the process work?
The process starts by labeling a small percentage of the data to kick off the training of an initial model. We can select these samples randomly or strategically choose instances that we think will provide the most information to the model.

After training a model, we can use it to predict the labels for the rest of the dataset. We then can use these predictions and their associated confidence in two ways: 

We can add any samples that our model predicted with high confidence to the labeled data set. This step assumes that you have enough trust in the model's predictions. A valid approach is to skip the first few iterations before automatically labeling data.
We can request labels for any samples with low-confidence predictions. This step is crucial and is the engine that makes active learning work.

High-confidence samples aren't a challenge for our model, and the amount of information we will get from labeling them is low. Low-confidence samples indicate that our model needs help, so we'll focus any labeling efforts exclusively on them.

We can then use the improved labeled dataset to train a new model and repeat the process for as many rounds as necessary.

## Pool-based with uncertainty sampling
Active learning allows us to build a better-performing machine learning model using fewer training labels by strategically choosing the samples to train the model.

The process above is an example of "pool-based active learning," in which we decide which samples to label from a pool of unlabeled data. It uses an "uncertainty sampling" strategy, using the confidence of the model's predictions to determine which data would maximize the model's performance. 

Other approaches include ["Stream-Based Selective Sampling"](https://burrsettles.com/pub/settles.activelearning.pdf) and ["Membership Query Synthesis"](https://burrsettles.com/pub/settles.activelearning.pdf). There are also different sampling strategies depending on the nature of the data, but the fundamental idea of active learning stays consistent. 


A learning technique to build better-performing machine learning models using fewer training labels.

Active Learning


One of the most common problems we can solve using machine learning is classifying samples into different categories. However, using supervised classification models becomes impossible if we don't have enough instances of a particular class to train a model.

Imagine you want to validate the quality of circuit boards using a picture. You have access to thousands of working circuit board images, but it's improbable that you'll have images illustrating every possible defect. How can you frame this problem and train a model to solve it?

## Anomaly detection
A great way to introduce autoencoders is with an anomaly detection example. A way to approach the circuit board problem is to find a way to detect outliers that differ significantly from the norm. That's the goal of anomaly detection problems, which have one particular characteristic: these anomalous observations are rare and, therefore, it's hard to collect enough data.

We need to build a model that can leverage all the data representing typical instances and use it to detect any outliers that represent anomalies.

## The structure of an autoencoder
Think of autoencoders as data compression algorithms that use neural networks. The initial part of the network compresses the original input into an intermediate representation, and the second part reverses the process to reconstruct the original information. In other words, an autoencoder tries to learn an approximation to the identity function.

A critical characteristic of an autoencoder is the bottleneck. The bottleneck is the section that stores the compressed representation of the data. This portion of the network sits between the encoder and decoder and restricts the number of hidden units to prevent the autoencoder from memorizing the input values.

If there's any structure in the original data, autoencoders will learn it. The encoding process summarizes the data, so the decoding process can reproduce it back, keeping as much fidelity as possible. Any unique characteristic not representative of the whole dataset will never make it into the bottleneck.

## Getting back to the circuit board example
Let's get back to our circuit board problem, where we want to build a model to flag any picture that doesn't look correct, but we don't have any sample showing problematic boards. We can use an autoencoder to solve this problem.

An autoencoder will learn the essential characteristics of a working circuit board. It will summarize the dataset of pictures into those features that better represent a board. Imagine each image is 100x100 pixels, and we include a bottleneck with 5,000 neurons. The autoencoder must represent 10,000 pixels in half the space, so it will have to throw away any information that's not critical to reproduce the essence of a working circuit board.

In summary, the autoencoder will learn to represent every characteristic that makes a working circuit board and discard any information that's not common in the training dataset.

Anytime we show the network a picture of a circuit board, the autoencoder will reproduce it with a low error. But anytime we use a circuit board that doesn't look like every other board, the reproduction error will be high because the autoencoder can't reproduce any of the unusual characteristics of a circuit board that looks different.

Looking at this reproduction error is the key that helps us use autoencoders for anomaly detection tasks.

## Other applications of autoencoders
Autoencoders are helpful beyond anomaly detection applications. For example, we can teach autoencoders to remove noise from pictures or audio by showing them samples with noise and expecting them to reproduce their corresponding clean version.

In those cases, we teach the encoder to develop an intermediate representation that helps the decoder produce clean samples. After we train this autoencoder, we can remove the noise from any image or audio similar to the input dataset.

Autoencoders are also helpful for information retrieval, imputation, feature extraction, and dimensionality reduction problems.

## Summary
Autoencoders are neural networks designed in a way they can learn any existing structure in a dataset. They create a compact representation of the data we can leverage later in different applications.

A critical characteristic of the design of an autoencoder is its bottleneck, which forces the network to compress the original data into a meaningful representation.

Autoencoders are a great approach to tackling anomaly detection and noise reduction problems.

A learning technique to represent data efficiently using neural networks.

Autoencoders


There are many sources of [overfitting](overfitting-underfitting-learning-curves), but an important one is when your training and test data do not come from the same distribution.

Unfortunately, this is not an uncommon problem. For example, training a model with data collected during a period different from the test or production data could lead to poor performance. Even slight differences could considerably affect your results, but this is still an issue many people struggle to identify and decide how to better move forward.

That's where Adversarial Validation comes in.

## Is your data identically distributed?

Imagine you are training a model with data that significantly differs from the data used to evaluate the model's performance. You will not be surprised when your model makes wacky predictions! The model won't generalize to samples that are too far from the training data, so the results won't be good.

Fortunately, determining whether your splits come from the same distribution is not time-consuming; you can build a model to answer this question. 

The intuition is simple: this model will never work if all data comes from the same distribution, but if there are differences between the training and test data, this model will learn to distinguish them.

That's the fundamental idea behind Adversarial Validation.

## Building the new classifier

To understand whether your training and test splits are in good shape, you can take your training data, remove the target variable, and mix it with the test data. Then, you create a new binary target that differentiates training samples from test samples.

After preparing this new dataset, you can build a classification model to determine how easy it's to differentiate training samples from test samples. 

Using a ROC curve, you can find out how easy it is for the binary model to classify the source of each sample. Ideally, you want the area under the curve to be around 0.5, which means the model can't tell both splits apart. As this area gets closer to 1.0, it is easier for the model to identify whether a particular instance comes from the training or the test set.

## Finding the source of the problem

Adversarial Validation will help you understand whether there's a problem with your training and test splits, but that's not all: it will also help you find out what features are causing the issue.

You can list every feature and its importance in predicting whether a sample belongs to the training or test splits. The more predictive power, the more likely the variable contributes to the existing drift between training and test data.

You can use this list to investigate what specific characteristics in your data the model is using and find a way to eliminate them.

## A straightforward approach

One of the main benefits of Adversarial Validation is how simple it's to implement and how quickly you can use it to determine whether you have a problem and where you should look to fix it.

Adversarial Validation is a widespread technique in Kaggle competitions. It is, however, beneficial beyond Kaggle, and it's an excellent approach for dealing with real-life situations where you suspect that drift is affecting your model.

A clever technique to help you understand why your machine learning model is not performing well on your test dataset.

Adversarial Validation


Most people think of data augmentation as a technique to improve their model during training.

Starting from an initial dataset, you can generate synthetic copies of each sample that will make a model resilient to variations in the data.

This is true, but there's something more you can do to improve your model's predictions.

## Improving the process

Imagine you are building a multi-class classification model.

When testing it, you take every sample, run it through the model, and determine the corresponding class from the result.

You use one picture to make a prediction, so you have one opportunity to get the correct answer.

Unfortunately, sometimes this is not enough.

## Test-time augmentation

You can take advantage of data augmentation to give you a better opportunity to make the correct prediction.

Test-time augmentation is a technique where you can augment samples before running them through the model, then average the prediction results.

Instead of running a picture through the model, you can generate three versions of it by changing the image's contrast, rotating it slightly, and cropping it a bit.

You now have four different pictures to make a prediction. Run them through the model, average the four softmax vectors you get back, and determine the final class from the result.

By augmenting the original image, you give the model more opportunities to "see" something different and compute the correct prediction.

## Good augmentations

The success of Test-time augmentation depends on how good are your augmented samples.

That's where most of your time will go.

Your augmented samples will have a lot of influence on the final result. If you create sloppy variations of the original image, test-time augmentation can quickly decrease your model's performance.

In practice, explore this technique using a few slight modifications to the initial picture. You'll find most of the success relies on avoiding excessive complexity.


Active Learning

A learning technique to build better-performing machine learning models using fewer training labels.

Active learning

How does the process work?

Pool-based with uncertainty sampling

Latest articles