\(\renewcommand{\vec}[1]{\mathbf{#1}}\)

\(\newcommand{\vecw}{\vec{w}}\) \(\newcommand{\vecx}{\vec{x}}\) \(\newcommand{\vecu}{\vec{u}}\) \(\newcommand{\veca}{\vec{a}}\) \(\newcommand{\vecb}{\vec{b}}\)

\(\newcommand{\vecwi}{\vecw^{(i)}}\) \(\newcommand{\vecwip}{\vecw^{(i+1)}}\) \(\newcommand{\vecwim}{\vecw^{(i-1)}}\) \(\newcommand{\norm}[1]{\lVert #1 \rVert}\)

Exploration 1.2: Basic Machine Learning Concepts

In this Exploration, we will explain some of the most important and useful concepts in machine learning.

Training, Test, and Generalization Errors

First, let us define the three types of datasets and the corresponding errors.

Underfitting and Overfitting

As mentioned above, ultimately we care about the test error, not the training error. As training progresses (e.g., over more and more training data, or using larger and larger models), we will see the training error keeps decreasing. However, the test error (e.g., simulated by the dev error) first decreases but then after a certain point, starts to increase.

Overfitting occurs as a result of overtraining, where the training error keeps decreasing but the test error increases

This phenomenon is called “overfitting” which means the model is starting to memorize the idiosyncracies (such as outliers or noise) of the training set, rather than learning something fundamental about the prediction task. Overfitting is the single most common pitfall in machine learning.

The overfitting model follows the training data closely, but is unlikely to generalize well on unseen data

Overfitting occurs not only due to overtraining, but also due to excessive model capacity. For example, in this curve fitting example, if data has some noise, and if you use a very high degree polynomial to fit the data, it will fit very well (in the extreme case, with zero training error). But this high-degree polynomial is definitely unlikely to capture the real distribution behind the training data, and is unlikely to generalize well.

In curve-fitting, a higher order polynomial can always fit the data better, but not necessarily generalizes better

On the opposite end, if the model capacity is too small or training is insufficient (e.g., too few iterations), we will see “underfitting”. But generally speaking, this problem is not as severe as overfitting.

Methods to Prevent Overfitting

There are three common ways to prevent overfitting:

Cross-Validation and Leave-One-Out

While it is common to set out a small portion of the training data as dev set, one might wonder what if that portion is not very representative of the whole set. It would certainly be better to try every portion as a dev set, so this is what is known as “leave-one-out cross-validation”. We divide the data into, say, \(N=10\) subsets, and train on the first 9 subsets and test on the last subset, and then train on subsets 1,2,…,8 and 10 and test on subset 9, and so on. In other words, we always leave out one subset and train on the remaining \(N-1\) subsets. As \(N\) increases, this cross-validation error is the best approximation of the generalization error based on the data we have (assuming the training and test data sets follow the same distribution).

In cross validation, we divide the data into N pieces, and iteratively train on N-1 pieces and test on the remaining one.

Videos