Curse of dimensionality
The headline failure mode of KNN (and many distance-based methods): in high dimensions, all pairwise distances become nearly equal, so “nearest neighbor” stops carrying information. The prof: “the dimensionality is a curse, and in this case, it’s a curse for K-nearest neighbours… makes it useless.”
Definition (prof’s framing)
“If there’s a number of dimensions and you look at the average distance between points, of course as you add more dimensions that average distance is going to be bigger. But the weird thing is that the variance , like how much those distances vary just by chance , gets smaller and smaller. … So that in this million-dimensional space, the points that are close together have almost the same distance value as points that are far away.” - L08-classif-2
“Because the nearest neighbors tend to be far away in high dimensions and the method is no longer local. This is referred to as the curse of dimensionality.” , slide deck
Notation & setup
- : number of predictors / dimensions.
- : number of observations.
- “Curse” kicks in when grows faster than (roughly) , there isn’t enough data to fill the space.
What goes wrong (mechanically)
Pick points uniformly in the unit hypercube . As grows:
- The volume of any local neighborhood shrinks rapidly. To enclose a fixed fraction of the data, the neighborhood radius must grow toward . So “local” stops meaning local.
- The ratio of nearest-to-farthest pairwise distance approaches 1. The variance of pairwise distances shrinks relative to the mean , every point is roughly equally far from every other. Distance loses discriminating power.
- Most of the volume of the hypercube concentrates near the boundary. The vast majority of points sit close to the “edges” rather than in the interior , KNN’s nearest-neighbor estimates extrapolate.
The prof’s recommendation: simulate it yourself , sample uniform points in dimensions, plot the distribution of pairwise distances as grows. The collapse-onto-a-mean is striking.
Insights & mental models
- Distance-based methods take the biggest hit. knn-classification, k-means clustering with Euclidean distance, kernel methods. Anything where “the closest point” is the answer.
- Parametric methods often survive better. logistic-regression, linear-discriminant-analysis , they make stronger assumptions, so they don’t have to estimate dense local neighborhoods.
- LDA actually thrives in high . “Mapping it down to a dimension specifically to separate out these categories.” - L09-classif-3. classes → discriminant scores → fits the curve to the relevant low-d subspace.
- The fixes are all “use lower dimensions”:
- Dimensionality reduction: PCA, PCR, LDA-as-projection.
- Regularization: ridge, lasso , shrink to a low-d effective subspace.
- Restrictive parametric models: naive-bayes (assume diagonal ), simpler families.
- Feature selection: drop irrelevant predictors before running KNN.
- High-dim multicollinearity is the regression-side analog. When is large, predictors are increasingly likely to be near-collinear by chance , XᵀX gets ill-conditioned even before approaches . See collinearity / high-dimensional-regression.
- The fix isn’t always more data. ” much larger than ” , but in genomics ( in the tens of thousands) you can’t make keep up. Methods need to assume structure.
Exam signals
“The dimensionality is a curse, and in this case, it’s a curse for K-nearest neighbours. … makes it useless.” - L08-classif-2
“Curse of dimensionality. Euclidean distance in high-dimensional spaces becomes large for everyone , this doesn’t work too well if we have many, many covariates.” - L07-classif-1
“The effectiveness of the KNN classifier falls quickly when the dimension of the predictor space is high.” , slide deck
The prof returned to it three times across module 4 and module 6 (high-dimensional-regression). The KNN-killing role is the most flagged framing.
Pitfalls
- Treating “high ” as a fixed cutoff. It’s relative to . With and , KNN can still work; with and , it can’t.
- Standardization isn’t a fix. Standardizing puts every dim on the same scale, which is necessary, but doesn’t change the underlying geometric concentration.
- Switching to a different distance (cosine, Manhattan) helps marginally. Prof: “Cosine distance might help in some cases, something you could play with that might be fun.” , but the underlying curse persists; you’ve just changed the rate.
- Confusing curse-of-dimensionality with multicollinearity. Related, but distinct. Multicollinearity is about correlated predictors; CoD is about geometric distance concentration. In high , both happen and reinforce.
- Believing “all methods break in high .” lasso, LDA, PCA-based methods are designed exactly for this regime.
Scope vs ISLP
- In scope: Definition (geometric concentration of distances), the KNN failure mode, why parametric methods are more robust, what the standard fixes are (dimensionality reduction, regularization).
- Look up in ISLP: §4.5.1, p. 161 (KNN-CoD discussion); §6.4 for the full high-dim regression story (modules 6 territory).
- Skip in ISLP: Detailed concentration-of-measure proofs (measure-theoretic). Specific bounds on the rate of collapse.
Exercise instances
None , no exercise problem drills the curse directly. The concept appears as scaffolding inside knn-classification (“why does KNN fail in high ?”), as motivation for PCR / ridge / lasso in module 6, and as motivation for naive-bayes (diagonal when is large).
How it might appear on the exam
- Conceptual MCQ: “Which method is most affected by the curse of dimensionality?” → KNN. (LDA/QDA/logistic less so; PCA/lasso designed for it.)
- T/F: “Standardizing predictors removes the curse of dimensionality” → false (helps with comparability, but the geometry persists).
- Method-comparison: “For a problem with and , which classifier(s) would you avoid?” → KNN; consider LDA, naive Bayes, logistic with regularization.
- Why naive Bayes wins in high : parameter count is not , dodges the variance explosion.
- Why LDA survives: projects to dimensions, exploits the low-d structure of the class means.
Related
- knn-classification: the headline victim.
- linear-discriminant-analysis: the headline survivor (dimensionality-reduction method).
- dimensionality-reduction: the umbrella concept of mapping .
- principal-component-analysis: the standard unsupervised reduction.
- principal-component-regression: the regression-side application.
- high-dimensional-regression: the companion in module 6.
- naive-bayes: large- rescue via the diagonal- assumption.
- collinearity: the regression-side pathology that high- amplifies.