Exploration Page 4: Normal Equation

Gradient descent gives one way of minimizing \(J(\mathbf{w})\). In addition to this iterative algorithm, there is another approach where we perform the minimization explicitly. This involves taking the derivatives of \(J(\mathbf{w})\) with respect to \(\mathbf{w}\) and setting them to zero. To simplify this process and avoid extensive algebraic calculations, we can use notation specifically designed for calculus with matrices and vectors.

Matrix and Vector Representation

Using matrix and vector notation, we can express linear regression in a concise and elegant way. We represent the input data as an \(n \times m\) matrix \(\mathbf{X}\), where each row corresponds to a training example and each column corresponds to a feature, i.e., \(n=|D|\) and \(m=d+1\).

\[ \mathbf{X}=\left[\begin{array}{c} -\left(\mathbf{x}^{(1)}\right)^\top- \\ -\left(\mathbf{x}^{(2)}\right)^\top- \\ \vdots \\ -\left(\mathbf{x}^{(n)}\right)^\top- \end{array}\right] \]

The target values are represented as an \(n \times 1\) column vector \(\mathbf{y}\).

\[ \mathbf{y}=\left[\begin{array}{c} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(n)} \end{array}\right] \]

The cost function (without regularization) can be expressed as follows using matrix and vector representation.

\[ J(\mathbf{w}) = \frac{1}{2} \Vert \mathbf{y}-\mathbf{X} \mathbf{w}\Vert_2^2 \]

Normal Equation

Consider an unconstrained optimization problem

\[ \min_{\mathbf{w}\in \mathbb{R}^m} J(\mathbf{w})\]

where \(J: \mathbb{R}^m \rightarrow \mathbb{R}\) is continuously differentiable. Suppose \(J\) is convex (we skip the definition here), a point \(\mathbf{w}^\star\) is an optimal solution if and only if \(\nabla J(\mathbf{w}^\star)=0\).

In fact, the cost function \(J(\mathbf{w})\) is a quadratic function, meaning it is a convex function. Therefore, we can minimize \(J(\mathbf{w})\) by setting its derivative with respect to \(\mathbf{w}\) as \(0\). The derivative is

\[ \begin{aligned} \nabla_\mathbf{w} J(\mathbf{w}) & =\nabla_\mathbf{w} \frac{1}{2}(\mathbf{X} \mathbf{w}-\mathbf{y})^\top(\mathbf{X} \mathbf{w}-\mathbf{y}) \\ & =\frac{1}{2} \nabla_\mathbf{w}\left((\mathbf{X} \mathbf{w})^\top \mathbf{X} \mathbf{w}-(\mathbf{X} \mathbf{w})^\top \mathbf{y}-\mathbf{y}^\top(\mathbf{X} \mathbf{w})+\mathbf{y}^\top \mathbf{y}\right) \\ & =\frac{1}{2} \nabla_\mathbf{w}\left(\mathbf{w}^\top\left(\mathbf{X}^\top \mathbf{X}\right) \mathbf{w}-\mathbf{y}^\top(\mathbf{X} \mathbf{w})-\mathbf{y}^\top(\mathbf{X} \mathbf{w})\right) \\ & =\frac{1}{2} \nabla_\mathbf{w}\left(\mathbf{w}^\top\left(\mathbf{X}^\top \mathbf{X}\right) \mathbf{w}-2\left(\mathbf{X}^\top \mathbf{y}\right)^\top \mathbf{w}\right) \\ & =\frac{1}{2}\left(2 \mathbf{X}^\top \mathbf{X} \mathbf{w}-2 \mathbf{X}^\top \mathbf{y}\right) \\ & =\mathbf{X}^\top \mathbf{X} \mathbf{w}-\mathbf{X}^\top \mathbf{y} \end{aligned} \]

Assume that \(\mathbf{X}\) is a full-column rank matrix, \(\mathbf{X}^\top\mathbf{X}\) is nonsingluar. By setting the above derivative as \(0\), we have

\[ \mathbf{w} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\]

\(\ell_2\) Regularization

If the cost function includes \(\ell_2\) regularization, the normal equation can also be utilized to find the optimal solution. The cost function can be expressed as

\[ J(\mathbf{w}) = \frac{1}{2} \Vert \mathbf{y}-\mathbf{X} \mathbf{w}\Vert_2^2 + \frac{1}{2}\lambda \Vert \mathbf{w} \Vert_2^2 \]

The derivative is

\[ \begin{aligned} \nabla_\mathbf{w} J(\mathbf{w}) & =\mathbf{X}^\top \mathbf{X} \mathbf{w}-\mathbf{X}^\top \mathbf{y} + \lambda \mathbf{w} . \end{aligned} \]

The closed form solution is

\[ \mathbf{w} = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}\]

where \(\mathbf{I}\) denotes identity matrix.

Comparison with Gradient Descent

The normal equation solution provides a closed-form solution that directly computes the optimal model parameters by solving a system of linear equations. However, to compute the matrix inverse \((\mathbf{X}^\top\mathbf{X})^{-1}\) requires \(O(n^3)\) time, which is extremely slow and even unfeasible for large scale datasets. Assume \(n=100,000\), it will need \(74\) GB RAM to compute the matrix inverse. On the other hand, the gradient descent algorithm needs very little memory and can handle large datasets. However, it requires selecting an appropriate learning rate and the number of iterations. Overall, the choice between these two methods depends on the specific problem and dataset size.