Dimensionality reduction

The umbrella concept underneath PCA / PCR / PLS / LDA-as-projection / NN-feature-extractor: collapse predictors into composite ones for visualization, downstream regression, or as a fix for collinearity / curse-of-dimensionality. The prof returns to it across modules , every method that takes a wide and produces a narrower one is an instance.

Definition (prof’s framing)

“Dimensionality reduction tries to reduce, collapse the dimensions … both dimensionality reduction and clustering approaches are trying to simplify the data , to give simple summaries. Dimensionality reduction looks for low-dimensional ones … clustering tries to find groups.” - L21-unsupervised-1 / L22-unsupervised-2

Construct new variables as linear (or nonlinear) functions of the original , with . Then use the ‘s in place of the ‘s downstream , for visualization (project to 2 / 3 dims), for regression (PCR / PLS), or as input to another model.

“There’s a bunch of dimensionality reduction methods. [PCA] is by far the oldest and the simplest and often the most useful.” - L21-unsupervised-1

Notation & setup

  • : original data matrix.
  • : reduced data matrix, .
  • For linear dim reduction, each , a linear combination with weights .
  • For nonlinear dim reduction (NN feature extractor), for some learned .

The unifying recipe

Linear dimensionality reduction is one trick repeated with different choices for the ‘s:

After reducing, you typically fit a model on :

Composing the two linear maps gives back implied coefficients on the original ‘s:

“You’re taking the X, you’re squishing it down to fit your model, and then you go backwards to the original model again.” - L14-modelsel-3

This is the formal equivalence that makes PCR / PLS feel like “regression with a different prior on ” , they constrain to live in an -dimensional subspace.

The instances in this course

Method chosen to …Supervised?Module
PCAmaximize s.t. and for No10 (and used in 6)
PCRsame as PCA, then regress on first PCsNo (for ); yes for 6
PLSmaximize s.t. orthogonalityYes6
LDAproject to discriminant directions max-separating class meansYes4
NN feature extractorlast hidden layer of a trained network used as features for downstream regressionYes (via task loss)11

“The PCR pattern , compress down to fewer features with some method, then regress , generalizes far beyond PCA. Example: video frames as . They’re huge. Run them through a learned feature extractor (a neural net) to get a small vector per frame, then regress on that compressed representation.” - L15-modelsel-4

So the same idea connects an unsupervised classical algorithm (PCA), a supervised classical algorithm (PLS), a generative classifier (LDA-as-projection), and modern deep learning (transfer learning / feature extraction).

Insights & mental models

Three reasons to do it.

  1. Visualization , plot observations in the first 2 PCs to see structure that’s invisible in dims. “Who can think in eight dimensions?” - L21-unsupervised-1
  2. Defeat collinearity / well-pose regression. Orthogonal ‘s remove the redundancy that makes singular. PCR / ridge / lasso all attack the same core problem; dim reduction is the “rotate it away” version.
  3. Compression for downstream models. Smaller, easier regressions; faster fits; sometimes better generalization (regularization-by-projection).

Supervised vs unsupervised is the key axis. PCA / PCR pick directions using only → can miss directions that matter for (“a key assumption … is that the response actually lives in the directions of largest -variance” - L15-modelsel-4). PLS / LDA / NN-feature-extractor use to guide selection → won’t miss it but adds variance and risks overfitting to .

Linear vs nonlinear. PCA / PCR / PLS / LDA are all linear projections , they can’t capture curved structure. Nonlinear extensions (kernel PCA, autoencoders, NN feature extractors) handle that, but at the cost of interpretability.

Standardize first for any variance-driven method (PCA, PCR, PLS, k-means inside the pipeline). See standardization.

is the central tuning knob.

  • Unsupervised: pick from scree / cumulative PVE , see explained-variance-and-scree-plot.
  • Supervised: pick by cross-validating the downstream model. Treat like any other tuning parameter.

Dim reduction ≠ variable selection. PCR / PCA give combinations of all original variables , every appears in every in general. If you need to drop variables (interpretability, cost of measurement), use lasso or subset-selection. The prof: “Use PCR/ridge when you want predictive power and don’t care which raw variables drive it. Use lasso or subset selection when you need this or that one.” - L15-modelsel-4

The cousin discipline: clustering also “simplifies” high-dim data, but produces a discrete summary (group label per point) rather than a continuous low-dim projection. The prof groups them as the two “summarize high-dim data” tools in module 10.

Returns across the course

  • L08 (L08-classif-2): introduced as the fix for collinearity / curse of dimensionality. PCA is the canonical tool; LDA is previewed as another dim-reduction route.
  • L09 (L09-classif-3): LDA framed as classes → discriminant scores. “This is why LDA is often listed as a dimensionality-reduction method.”
  • L14 (L14-modelsel-3): PCR pipeline introduced , standardize → PCA → regress on first PCs → back-transform. The unifying recipe (build from , fit, back out ) is laid down here.
  • L15 (L15-modelsel-4): PCR finished; PLS as the supervised counterpart; the “PCR pattern generalizes to NN feature extractors” insight.
  • L21 (L21-unsupervised-1): PCA covered in full, framed as “the first dim-reduction tool.” Pivot to clustering as the discrete-summary alternative.

Exam signals

“Dimensionality reduction looks for low-dimensional ones … clustering tries to find groups. So it’s more of a discretized thing. But in both cases, you’re looking at a short description of high-dimensional data.” - L21-unsupervised-1

“The PCR pattern , compress down to fewer features with some method, then regress , generalizes far beyond PCA.” - L15-modelsel-4

“By far the oldest and the simplest and often the most useful.” - L21-unsupervised-1 (on PCA among dim-reduction methods)

Pitfalls

  • Confusing dim reduction with variable selection. Different goals, different tools.
  • Using PCA when matters. PCA is unsupervised , large-variance directions of may not be predictive directions for .
  • Not standardizing. Dim reduction is variance-driven; mixed units → meaningless.
  • Linear-only methods on curved data. PCA can’t capture nonlinear manifolds; if the data lies on a curve / circle / arc, PCA’s first two PCs may completely miss the structure. Switch to nonlinear method or accept the limitation.
  • Picking unsupervised when there’s a supervised task. Cross-validate against the downstream model , much more principled than the elbow on a scree plot.
  • Treating “low-dim” as “interpretable.” PCs are linear combinations of all original variables; loadings help interpretation but don’t give clean attribution to single ‘s.

Scope vs ISLP

  • In scope: the unified frame (build from , fit, back out ); the supervised/unsupervised axis; the four/five canonical methods (PCA, PCR, PLS, LDA-as-projection, NN feature extractor); when to use which; standardization mandate; visualization use case; collinearity / high-dim motivation.
  • Look up in ISLP: §6.3 (the dimensionality-reduction frame for regression , PCR + PLS); §12.2 (PCA in the unsupervised-learning chapter); §4.4 (LDA, with the projection viewpoint).
  • Skip in ISLP:
    • Spectral / eigen decomposition derivations of , out per L04-statlearn-3.
    • Kernel PCA, ICA, manifold learning (t-SNE, UMAP, MDS) , none lectured. Out.
    • Detailed PLS history & chemometrics tuning , L15-modelsel-4: PCR is the workhorse; PLS gets the one-liner.

Exercise instances

(No exercises uniquely target the umbrella concept , every dim-reduction exercise lives under a specific instance atom: PCA in Exercise 10.1 / Exercise 6.7, PCR in Exercises 6.7–6.8, PLS in Exercise 6.9, LDA in Exercise 4.2.)

How it might appear on the exam

  • Method comparison: PCA vs PLS vs LDA vs lasso , match each to its goal (unsupervised dim reduction; supervised dim reduction; class-separating projection; sparse variable selection).
  • Conceptual MC:
    • “PCA picks directions using .” → False.
    • “PLS picks directions using .” → True.
    • “Dim reduction performs variable selection.” → False (the ‘s are linear combinations of all ‘s).
  • Output interpretation: given a biplot or a 2D projection of NYT stories, identify what the projection has captured.
  • “Why dim-reduce?” answer: visualization, defeat collinearity, well-pose , regularize the downstream model.
  • Choosing : when to use scree (unsupervised) vs CV (supervised).