Exploration 3.2: Regularization

Cost Function

We have introduced the mathematcal setting of linear regression in last section. Becasue a linear model is decided by its weight vector, we can think of it as a function of \(\mathbf{x}\) characterized by \(\mathbf{w}\):

\[ f_{\mathbf{w}}(\mathbf{x}) = \mathbf{w} \cdot \mathbf{x} \]

Given a training set \(D = \{(\mathbf{x}^{(i)}, y^{(i)}) \mid i = 1 \ldots |D|\}\) which is a set of training examples \(\mathbf{x}=(1, x_1, \ldots, x_d)\) and their corresponding labels \(y \in \mathbb{R}\), the goal is to fit a weight vector \(\mathbf{w}\) so that \(f_{\mathbf{w}}(\mathbf{x})\) is close to \(y\). To formalize this, we will define a function that measures, for each value of the \(\mathbf{w}\)’s, how close the \(f(\mathbf{x})\)’s are to the corresponding \(y\)’s. We define the cost function:

\[ J(\mathbf{w}) = \frac{1}{2}\sum_{(\mathbf{x}, y)\in D} \big(f_{\mathbf{w}}(\mathbf{x})-y\big)^2 \]

Overfitting and Underfitting

In machine learning, the goal is to build models that can generalize well to new data. However, sometimes models can either be too simple or too complex, leading to underfitting or overfitting, respectively. Underfitting occurs when a model is too simple to capture the underlying patterns in the data, resulting in poor performance on the training data as well as on new data. On the other hand, overfitting occurs when a model is too complex and fits the training data too closely, resulting in excellent performance on the training data but poor performance on new data. Overfitting is a common problem in machine learning, especially when working with high-dimensional data or limited amounts of training data.

Regularization

Regularization is a technique that can be used to prevent overfitting in linear regression models and improve the generalization performance. Regularization works by adding a penalty term to the cost function that the linear regression model is trying to minimize. The penalty term is based on the magnitude of the weight (or coefficients) in the model. By adding this penalty term, the model is encouraged to use smaller coefficients, which results in a simpler model that is less likely to overfit the data.

The most widely used type of regularization in linear reguression is \(\ell_2\) regularization, which adds a penalty proportional to the square of the coefficients:

\[ J(\mathbf{w}) = \frac{1}{2}\sum_{(\mathbf{x}, y)\in D} \big(f_{\mathbf{w}}(\mathbf{x})-y\big)^2 + \lambda\Vert\mathbf{w}\Vert_2^2 \]

where \(\lambda \geq 0\) is a hyperparameter called regularization strength to determine how much weight is given to the penalty term relative to original cost function.
Another common type of regularization is \(\ell_1\) regularization, which adds a penalty proportional to the absolute value of the coefficients: which adds a penalty proportional to the square of the coefficients.

\[ J(\mathbf{w}) = \frac{1}{2}\sum_{(\mathbf{x}, y)\in D} \big(f_{\mathbf{w}}(\mathbf{x})-y\big)^2 + \lambda\Vert\mathbf{w}\Vert_1 \]

Overall, regularization is an important technique for improving the performance of linear regression models, especially when dealing with high-dimensional data or limited amounts of training data.

Videos