Exploration 1.3: Nearest Neighbor Classifier

Introduction

The nearest neighbor classifier is a simple machine learning algorithm that is often used for classification tasks. It works by finding the “nearest” training example (i.e., the one with the smallest distance) to a new, unlabeled example in the feature space, and assigning the label of that nearest example to the new example.

In other words, the nearest neighbor classifier makes predictions based on the similarity of a new data point to the data points in the training set. The algorithm is called “lazy” because it does not actually learn a model from the training data; instead, it simply memorizes the training examples and uses them to make predictions at test time.

k-nearest neighbor classification example. The green circle can be classified as red for k=3 or blue for k=5.

The choice of distance metric is critical in nearest neighbor classification, as it determines how the algorithm defines similarity between examples. Common distance metrics include Euclidean distance, Manhattan distance, and cosine similarity.

Despite its simplicity, the nearest neighbor classifier can be surprisingly effective, especially for low-dimensional data and in cases where the decision boundary is highly irregular. However, it can be computationally expensive for large datasets, and may suffer from the “curse of dimensionality” when the feature space is high-dimensional.

The choice of distance metric

There are generally two commonly used distance metrics: the Euclidean distance and the Manhattan distance. Actually, both are special cases of \(\ell_p\)-norm or Minkowski distance. In a \(d\)-dimensional space, given two vectors \(x=(x_1, \ldots, x_d)\) and \(y=(y_1, \ldots, y_d)\), their \(\ell_p\)-norm or Minkowski distance is defined as:

\[ \ell_p (x, y) = \big[ |x_1 - y_1| ^p + \ldots + |x_d - y_d| ^p \big] ^{1/p} \]

Now the Euclidean distance is simply the \(\ell_2\)-norm:

\[ \ell_2 (x, y) = \sqrt{|x_1 - y_1| ^2 + \ldots + |x_d - y_d| ^2} \]

and the Manhattan distance is simply the \(\ell_1\)-norm:

\[ \ell_1 (x, y) = |x_1 - y_1| + \ldots + |x_d - y_d|\]

As two extreme cases, we can also use the \(\ell_\infty\) norm, also known as the Chebyshev norm:

\[ \ell_\infty (x, y) = \max\{|x_1 - y_1|, \ldots, |x_d - y_d|\}\]

or the \(\ell_0\) norm, which is just the number of the non-zero dimensions:

\[ \ell_0 (x, y) = |x_1 - y_1|^0 + \ldots + |x_d - y_d|^0\]

Here is a visualization of Manhattan and Euclidean Distances. You can imagine the Manhattan distance being the walking distance in Manhattan, which is always longer than or equal to the Euclidean distance (“flying distance”).

Manhattan vs. Euclidean Distances

The choice of \(k\)

In practice, we often use the labels of the \(k\) nearest neighbors and take majority vote to predict the label of a test example, to be more robust.

For example, in the following figure, if \(k=1\), the prediction will be red, and \(k=3\) will also be red, but \(k=5\) will result in blue.

k-NN example with k=3 and k=5

Note that to avoid tie-breaking votes, \(k\) must be an odd number. But how to choose the best value of \(k\)? Answer: we use a dev set.

The Decision Boundaries

The decision boundary of 1-NN is called the Voronoi Diagram, a famous concept in geometry. Here we showcase the Voronoi Diagrams of Euclidean (top) and Manhattan distances (bottom):

Example of Voronoi Diagram for Euclidean Distances
Example of Voronoi Diagram for Manhattan Distances

Self-Check Exercises

Quick self-check questions:

  1. The two extreme cases for the choice of \(k\) are: \(k=1\) and \(k=\infty\). Which one is extreme overfitting and which one is extreme underfitting? What does \(k=\infty\) do? Explain.
  2. How does the Voronoi Diagram change when \(k\) increases from 1 to \(\infty\)?
  3. Is there a “training” component of \(k\)-NN? If not, is that odd?
  4. How to efficiently implement \(k\)-NN, using numpy?

Videos