\(\renewcommand{\vec}[1]{\mathbf{#1}}\)

\(\newcommand{\vecw}{\vec{w}}\) \(\newcommand{\vecx}{\vec{x}}\) \(\newcommand{\vecy}{\vec{y}}\)

\(\newcommand{\sigmoidfnc}[1]{\frac{1}{1 + e^{-#1}}}\) \(\newcommand{\tanhfnc}[1]{\frac{e^x - e^{-x}}{e^x + e^{-#1}}}\) \(\newcommand{\relufnc}[1]{\max(0, #1)}\)

\(\newcommand{\norm}[1]{\lVert #1 \rVert}\) \(\newcommand{\normone}[1]{\norm{#1}_1}\) \(\newcommand{\normtwo}[1]{\norm{#1}_2}\)

Exploration 4.1: Multilayer Neural Networks

Multilayer Neural Networks, also known as Artificial Neural Networks (ANN) or Deep Neural Networks (DNN), are a powerful class of machine learning algorithms that can learn complex patterns and representations from the input data. It is a generalization of single-layer perceptron that we studied before. In this exploration, we will cover the basics of multilayer neural networks and how they are constructed.

Multilayer Neural Network Architecture

A multilayer neural network consists of multiple layers of interconnected nodes or neurons. Each neuron computes a weighted sum of its input values and passes it through an activation function to produce an output value. The layers can be categorized into three types:

Input Layer: The first layer that receives input data.
Hidden Layers: The layers between the input and output layers. There can be multiple hidden layers in a neural network.
Output Layer: The final layer that produces the output of the network.

Typical structure of a Multilayer Neural Network: This illustration depicts the standard structure of a feed-forward neural network consisting of two hidden layers. The network begins with an input layer receiving variables x_1, x_2, …, x_n. These inputs traverse through the hidden layers where the network’s learning and adjustments occur. Finally, they reach the output layer, yielding the results y_1, …, y_m. Each layer consists of neurons interconnected, each computing distinct operations, thereby enabling complex patterns and relationships to be captured within the data.

Neurons and Activation Functions

A neuron in a neural network computes a weighted sum of its input values and applies an activation function to produce an output value. The activation function introduces non-linearity into the network, allowing it to learn complex relationships in the data. Common activation functions include:

Sigmoid: \(\sigma(x) = \sigmoidfnc{x}\)
Hyperbolic Tangent (tanh): \(\tanh(x) = \tanhfnc{x}\)
Rectified Linear Unit (ReLU): \(\text{ReLU}(x) = \relufnc{x}\)

Varieties of Activation Functions: The illustration features three fundamental activation functions utilized in neural networks. (a) The Sigmoid function, ranging between 0 and 1, is typically used for binary classification problems. (b) The tanh function, with its output range from -1 to 1, centers the data making it zero-centered, which often accelerates convergence. (c) The ReLU (Rectified Linear Unit) function, allowing only positive values to pass through.

Forward Propagation

Forward propagation is the process of computing the output of a neural network given an input. The input values are passed through the network layer by layer, with each neuron computing its output value based on the weighted sum of its input values and applying the activation function. The output of the final layer is the output of the network.The forward propagation algorithm can be summarized as follows:

For each neuron in the input layer, set the output equal to the input feature value.
For each neuron in the hidden and output layers, calculate the weighted sum of the outputs from the previous layer and apply the activation function to obtain the output.

Loss Functions

In order to train a neural network, we need to define a loss function that quantifies the difference between the predicted output and the true output. The objective of training is to minimize this loss function. Some common loss functions include:

Mean Squared Error (MSE) Loss: Used for regression tasks, this loss function calculates the squared difference between the predicted and true output values and averages it over all the examples. Mathematically, the MSE loss is defined as:

\[\ell(\vecy, \hat{\vecy}) = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2\]

where \(N\) is the number of examples, \(y_i\) is the true output, and \(\hat{y}_i\) is the predicted output.
Cross-Entropy Loss: Used for classification tasks, this loss function calculates the negative log-likelihood of the true class for each example and averages it over all the examples. Mathematically, the cross-entropy loss for a single example is defined as:

\[\ell(\vecy, \hat{\vecy}) = -\sum_{i=1}^C y_i \log(\hat{y}_i)\]

where \(C\) is the number of classes, \(y_i\) is the true class label (one-hot encoded), and \(\hat{y}_i\) is the predicted probability for class \(i\).

Backward Propagation

Backward propagation, also known as backpropagation, is an algorithm used to train multilayer neural networks. It involves computing the gradient of the loss function with respect to each weight by applying the chain rule. The gradient is then used to update the weights in the network to minimize the loss function. The backpropagation algorithm can be summarized as follows:

For each neuron in the output layer, compute the gradient of the loss with respect to its output.
For each neuron in the hidden layers, compute the gradient of the loss with respect to its output using the chain rule and the gradients computed for the neurons in the next layer.
Update the weights using the computed gradients and a learning rate.

Training a Multilayer Neural Network

Training a multilayer neural network involves the following steps:

Initialize the weights and biases of the network.
Perform forward propagation to compute the output of the network.
Calculate the loss function based on the predicted output and true target values.
Perform backward propagation to compute the gradient of the loss function with respect to the weights.
Update the weights using the computed gradient and a learning rate.
Repeat steps 2-5 for multiple epochs or until a stopping criterion is met.

Overfitting and Regularization

Overfitting occurs when a neural network learns the noise in the training data instead of the underlying patterns. This can lead to poor generalization to new, unseen data. Regularization techniques can be used to prevent overfitting. Common regularization methods include \(\ell_1\) and \(\ell_2\) regularization, dropout, and early stopping.

\(\ell_1\) and \(\ell_2\) Regularization: These techniques add a penalty term to the loss function based on the magnitude of the weights. \(\ell_1\) regularization adds the sum of the absolute values of the weights, while \(\ell_2\) regularization adds the sum of the squared values of the weights. This encourages the model to have smaller weights, making it less likely to overfit. Mathematically, the loss function with \(\ell_1\) or \(\ell_2\) regularization is defined as:
- \(\ell_1\) Regularization: \[\ell_{\text{regularized}}(\vecy, \hat{\vecy}) = \ell(\vecy, \hat{\vecy}) + \lambda \normone{\vecw} \]
- \(\ell_2\) Regularization: \[\ell_{\text{regularized}}(\vecy, \hat{\vecx}) = \ell(\vecy, \hat{\vecy}) + \lambda \normtwo{\vecw}\]
where \(\ell(\vecy, \hat{\vecy})\) is the original loss function \(\vecw\) is the weight of the model, and \(\lambda\) is the regularization parameter.
Dropout: Dropout is a regularization technique that involves randomly “dropping out” or setting to zero a fraction of the neurons in a layer during training. This prevents the model from relying too much on any single neuron and encourages the model to learn more robust representations. Dropout is applied only during training, and during inference, all neurons are used with their weights scaled by the dropout rate.
Early Stopping: Early stopping involves monitoring the performance of the model on a validation set during training and stopping the training process when the performance on the validation set starts to degrade, indicating that the model is starting to overfit. This technique helps to find the optimal point in training where the model has the best generalization performance.

By applying one or a combination of these regularization techniques, overfitting can be mitigated, leading to better generalization of the neural network to new, unseen data.