Dimensionality reduction
The umbrella concept underneath PCA / PCR / PLS / LDA-as-projection / NN-feature-extractor: collapse predictors into composite ones for visualization, downstream regression, or as a fix for collinearity / curse-of-dimensionality. The prof returns to it across modules , every method that takes a wide and produces a narrower one is an instance.
Definition (prof’s framing)
“Dimensionality reduction tries to reduce, collapse the dimensions … both dimensionality reduction and clustering approaches are trying to simplify the data , to give simple summaries. Dimensionality reduction looks for low-dimensional ones … clustering tries to find groups.” - L21-unsupervised-1 / L22-unsupervised-2
Construct new variables as linear (or nonlinear) functions of the original , with . Then use the ‘s in place of the ‘s downstream , for visualization (project to 2 / 3 dims), for regression (PCR / PLS), or as input to another model.
“There’s a bunch of dimensionality reduction methods. [PCA] is by far the oldest and the simplest and often the most useful.” - L21-unsupervised-1
Notation & setup
- : original data matrix.
- : reduced data matrix, .
- For linear dim reduction, each , a linear combination with weights .
- For nonlinear dim reduction (NN feature extractor), for some learned .
The unifying recipe
Linear dimensionality reduction is one trick repeated with different choices for the ‘s:
After reducing, you typically fit a model on :
Composing the two linear maps gives back implied coefficients on the original ‘s:
“You’re taking the X, you’re squishing it down to fit your model, and then you go backwards to the original model again.” - L14-modelsel-3
This is the formal equivalence that makes PCR / PLS feel like “regression with a different prior on ” , they constrain to live in an -dimensional subspace.
The instances in this course
| Method | chosen to … | Supervised? | Module |
|---|---|---|---|
| PCA | maximize s.t. and for | No | 10 (and used in 6) |
| PCR | same as PCA, then regress on first PCs | No (for ); yes for | 6 |
| PLS | maximize s.t. orthogonality | Yes | 6 |
| LDA | project to discriminant directions max-separating class means | Yes | 4 |
| NN feature extractor | last hidden layer of a trained network used as features for downstream regression | Yes (via task loss) | 11 |
“The PCR pattern , compress down to fewer features with some method, then regress , generalizes far beyond PCA. Example: video frames as . They’re huge. Run them through a learned feature extractor (a neural net) to get a small vector per frame, then regress on that compressed representation.” - L15-modelsel-4
So the same idea connects an unsupervised classical algorithm (PCA), a supervised classical algorithm (PLS), a generative classifier (LDA-as-projection), and modern deep learning (transfer learning / feature extraction).
Insights & mental models
Three reasons to do it.
- Visualization , plot observations in the first 2 PCs to see structure that’s invisible in dims. “Who can think in eight dimensions?” - L21-unsupervised-1
- Defeat collinearity / well-pose regression. Orthogonal ‘s remove the redundancy that makes singular. PCR / ridge / lasso all attack the same core problem; dim reduction is the “rotate it away” version.
- Compression for downstream models. Smaller, easier regressions; faster fits; sometimes better generalization (regularization-by-projection).
Supervised vs unsupervised is the key axis. PCA / PCR pick directions using only → can miss directions that matter for (“a key assumption … is that the response actually lives in the directions of largest -variance” - L15-modelsel-4). PLS / LDA / NN-feature-extractor use to guide selection → won’t miss it but adds variance and risks overfitting to .
Linear vs nonlinear. PCA / PCR / PLS / LDA are all linear projections , they can’t capture curved structure. Nonlinear extensions (kernel PCA, autoencoders, NN feature extractors) handle that, but at the cost of interpretability.
Standardize first for any variance-driven method (PCA, PCR, PLS, k-means inside the pipeline). See standardization.
is the central tuning knob.
- Unsupervised: pick from scree / cumulative PVE , see explained-variance-and-scree-plot.
- Supervised: pick by cross-validating the downstream model. Treat like any other tuning parameter.
Dim reduction ≠ variable selection. PCR / PCA give combinations of all original variables , every appears in every in general. If you need to drop variables (interpretability, cost of measurement), use lasso or subset-selection. The prof: “Use PCR/ridge when you want predictive power and don’t care which raw variables drive it. Use lasso or subset selection when you need this or that one.” - L15-modelsel-4
The cousin discipline: clustering also “simplifies” high-dim data, but produces a discrete summary (group label per point) rather than a continuous low-dim projection. The prof groups them as the two “summarize high-dim data” tools in module 10.
Returns across the course
- L08 (L08-classif-2): introduced as the fix for collinearity / curse of dimensionality. PCA is the canonical tool; LDA is previewed as another dim-reduction route.
- L09 (L09-classif-3): LDA framed as classes → discriminant scores. “This is why LDA is often listed as a dimensionality-reduction method.”
- L14 (L14-modelsel-3): PCR pipeline introduced , standardize → PCA → regress on first PCs → back-transform. The unifying recipe (build from , fit, back out ) is laid down here.
- L15 (L15-modelsel-4): PCR finished; PLS as the supervised counterpart; the “PCR pattern generalizes to NN feature extractors” insight.
- L21 (L21-unsupervised-1): PCA covered in full, framed as “the first dim-reduction tool.” Pivot to clustering as the discrete-summary alternative.
Exam signals
“Dimensionality reduction looks for low-dimensional ones … clustering tries to find groups. So it’s more of a discretized thing. But in both cases, you’re looking at a short description of high-dimensional data.” - L21-unsupervised-1
“The PCR pattern , compress down to fewer features with some method, then regress , generalizes far beyond PCA.” - L15-modelsel-4
“By far the oldest and the simplest and often the most useful.” - L21-unsupervised-1 (on PCA among dim-reduction methods)
Pitfalls
- Confusing dim reduction with variable selection. Different goals, different tools.
- Using PCA when matters. PCA is unsupervised , large-variance directions of may not be predictive directions for .
- Not standardizing. Dim reduction is variance-driven; mixed units → meaningless.
- Linear-only methods on curved data. PCA can’t capture nonlinear manifolds; if the data lies on a curve / circle / arc, PCA’s first two PCs may completely miss the structure. Switch to nonlinear method or accept the limitation.
- Picking unsupervised when there’s a supervised task. Cross-validate against the downstream model , much more principled than the elbow on a scree plot.
- Treating “low-dim” as “interpretable.” PCs are linear combinations of all original variables; loadings help interpretation but don’t give clean attribution to single ‘s.
Scope vs ISLP
- In scope: the unified frame (build from , fit, back out ); the supervised/unsupervised axis; the four/five canonical methods (PCA, PCR, PLS, LDA-as-projection, NN feature extractor); when to use which; standardization mandate; visualization use case; collinearity / high-dim motivation.
- Look up in ISLP: §6.3 (the dimensionality-reduction frame for regression , PCR + PLS); §12.2 (PCA in the unsupervised-learning chapter); §4.4 (LDA, with the projection viewpoint).
- Skip in ISLP:
- Spectral / eigen decomposition derivations of , out per L04-statlearn-3.
- Kernel PCA, ICA, manifold learning (t-SNE, UMAP, MDS) , none lectured. Out.
- Detailed PLS history & chemometrics tuning , L15-modelsel-4: PCR is the workhorse; PLS gets the one-liner.
Exercise instances
(No exercises uniquely target the umbrella concept , every dim-reduction exercise lives under a specific instance atom: PCA in Exercise 10.1 / Exercise 6.7, PCR in Exercises 6.7–6.8, PLS in Exercise 6.9, LDA in Exercise 4.2.)
How it might appear on the exam
- Method comparison: PCA vs PLS vs LDA vs lasso , match each to its goal (unsupervised dim reduction; supervised dim reduction; class-separating projection; sparse variable selection).
- Conceptual MC:
- “PCA picks directions using .” → False.
- “PLS picks directions using .” → True.
- “Dim reduction performs variable selection.” → False (the ‘s are linear combinations of all ‘s).
- Output interpretation: given a biplot or a 2D projection of NYT stories, identify what the projection has captured.
- “Why dim-reduce?” answer: visualization, defeat collinearity, well-pose , regularize the downstream model.
- Choosing : when to use scree (unsupervised) vs CV (supervised).
Related
- principal-component-analysis: the canonical instance; unsupervised and linear.
- explained-variance-and-scree-plot: how to choose in the unsupervised case.
- principal-component-regression: PCA as the front end of regression; one of the module-6 dim-reduction routes.
- partial-least-squares: the supervised counterpart of PCR.
- linear-discriminant-analysis: supervised linear projection that doubles as -dim reduction.
- curse-of-dimensionality: the high- pathology that dim reduction defends against.
- collinearity: the second pathology dim reduction handles by rotating to orthogonal ‘s.
- k-means-clustering / hierarchical-clustering , the discrete-summary cousins of dim reduction; also “simplify high-dim data.”