Principal components regression (PCR)
The third of module 6’s three families (after subset-selection and shrinkage): transform the correlated predictors into orthogonal composite predictors (the principal components) via PCA, then fit standard OLS on those. The prof’s headline framing: PCR is a discretized ridge-regression, both shrink the small-eigenvalue directions, but PCR does it abruptly (truncate) while ridge does it smoothly. “PCR can be seen as a discretized version of ridge regression.” - L15-modelsel-4.
Definition (prof’s framing)
“Principal Components Regression involves: constructing the first principal components , [and] using these components as the predictors in a standard linear regression model.”, slide deck (selection_regularization_presentation_lecture2.md)
The pipeline:
“Standardize , run PCA, get . Fit a standard linear regression of on , where is a tuning parameter. Sweep from 1 up to . Pick the that minimizes CV-MSE.” - L15-modelsel-4
“You’re taking the X, you’re squishing it down to fit your model, and then you go backwards to the original model again.” - L14-modelsel-3
Notation & setup
- : design matrix (standardized, see below).
- : -th principal component (linear combination of original ‘s, with , ordered by decreasing variance, mutually orthogonal). See principal-component-analysis.
- : number of PCs retained, the PCR tuning parameter, chosen by cross-validation.
- : regression coefficients of on the ‘s.
- (back-transformed): coefficient on the original , recovered as .
Formula(s) to know cold
Component construction:
Reduced regression:
Back-transform to original-X coefficients (slide-flagged):
This last formula is what the slides labeled the “constrained interpretation”: dimension reduction constrains the coefficients of a standard linear regression to live in the -dimensional subspace spanned by .
Fraction of -variance captured by first PCs (slide-flagged): where are eigenvalues of (= variances of the corresponding PCs).
The pipeline (canonical procedure)
- Standardize (mean 0, sd 1 per column). PCA is not scale-invariant.
- Run PCA on the standardized → get in decreasing-variance order.
- Sweep from 1 to . For each :
- Fit OLS of on → get ‘s.
- Compute CV-MSE.
- Pick at the CV-MSE minimum (or by the 1-SE rule / explained-variance threshold).
- (Optional) back-transform ‘s to ‘s in original- space via .
Insights & mental models
Why bother, the multicollinearity angle
PCR’s main reason to exist: handling collinearity.
“If two ‘s are nearly identical (or strongly correlated more generally, multicollinearity), the OLS fit can’t tell them apart. Parameters can trade off of each other, which is bad… Squishing to orthogonal ‘s removes the redundancy. Each is independent of the others by construction. Cleaner fit, lower variance.” - L14-modelsel-3
Three ways the course offers to handle multicollinearity (per L14-modelsel-3):
- L1 (lasso): pick one of the correlated variables, zero the others.
- L2 (ridge): hold both back, share the load.
- PCA / PCR: rotate to an orthogonal basis where the correlation is gone by construction.
PCR ≈ a discretized ridge, the load-bearing analogy
The single most important conceptual point in this segment, from L15-modelsel-4:
“PCR can be seen as a discretized version of ridge regression. Ridge regression encourages ties, if two things explain the same thing, then have them share the weight. The point where they share is actually very similar to getting a principal component that captures the part that both share. So PCR is doing it more abruptly because it simply says okay, direction, direction, direction, new axes, new data, everything that’s shared go here. Ridge regression does it more continuously.”
Both methods place pressure on the least important principal directions:
- PCR: drop them outright (those past the cutoff ). Hard threshold.
- Ridge: shrink them most. Soft threshold. Slide gives the per-PC shrinkage factor as , heavier shrinkage on smaller eigenvalues.
“Higher pressure on less important PCs. PCR discards the smallest eigenvalue components.”, slide deck
The two big PCR assumptions (both can fail)
Slide-flagged:
“Key assumptions: A small number of principal components suffice to explain (1) most of the variability in the data, [and] (2) the relationship with the response. … The assumptions above are not guaranteed to hold in every case. This is true specially for assumption 2 above. Since the PCs are selected via unsupervised learning.”, slide deck
The critical drawback verbatim from L15-modelsel-4:
“There’s no guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response.”
PCR’s bet: the high-variance directions in -space are also the directions depends on. When that holds, PCR is great. When the response is driven by low-variance directions (e.g., a single rare-but-predictive feature), PCR misses it because it threw that direction away. PLS is the supervised fix, uses to choose directions.
PCR is not a variable-selection method
“Just like Ridge, it doesn’t actually select the parameters, it gives you components, and each component is a combination of the original axes.” - L15-modelsel-4
“Use PCR/ridge when you want predictive power and don’t care which raw variables drive it. Use lasso or subset selection when you need ‘this or that one.’” - L15-modelsel-4
After back-transforming, all original- coefficients are typically nonzero (since each PC is a combination of all ‘s).
Bias / variance / test-error reading along
The CV-MSE-vs- curve is U-shaped (similar in shape to ridge’s CV-vs-):
- Bias drops fast at small , then plateaus (more PCs → more parameters → already low bias).
- Variance rises with (more parameters, more noise fitting).
- Test MSE dips through a minimum where bias and variance cross-trade.
“Of the many solutions that have a low bias, the mean squared error on the held out data is in a similar location to where the variance is minimized.” - L15-modelsel-4
Credit-data PCR result
“With the original-variable betas are tiny, PC1 didn’t load heavily on the variables that drive . A big jump at : that’s where income finally enters strongly. Income just happened not to vary as much or wasn’t as correlated as other things. … CV-MSE was decreasing slowly, then dropped massively from to . Settle on .” - L15-modelsel-4
Slide line: “The lowest cross-validation error occurs when there are components, almost no dimension reduction at all.” On the Credit dataset PCR did not outperform other methods because the response-relevant direction (income) had middling -variance.
“Even though the things that you care about were just ended up being four variables, primarily, it took you 10 PCs to get there.” - L15-modelsel-4
This is a real-world cautionary tale for assumption (2) above.
PCR generalizes beyond linear PCA
“The PCR pattern, compress down to fewer features with some method, then regress, generalizes far beyond PCA. Example: video frames as . They’re huge. Run them through a learned feature extractor (a neural net) to get a small vector per frame, then regress on that compressed representation.” - L15-modelsel-4
Same skeleton, nonlinear compressor. (Conceptual bridge to dimensionality-reduction more broadly.)
Exam signals
“There’s no guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response.” - L15-modelsel-4
“PCR can be seen as a discretized version of ridge regression.” - L15-modelsel-4 (echoed verbatim on slide)
“Just like Ridge, it doesn’t actually select the parameters.” - L15-modelsel-4 (slide also: “Similar to Ridge, PCR does not perform feature selection, PCs are linear combination of all predictors.”)
“PCA is not scale invariant. … Standardize all the variables before applying PCA.”, slide deck
The PCR ↔ ridge analogy is the headline conceptual point. The PCR-as-not-variable-selection contrast with lasso is exam-flavoured. PCR was on the 2024 exam (called via library(pls)).
Pitfalls
- Forgetting to standardize. PCA / PCR are not scale-invariant. “PCA is not scale invariant. So if you don’t standardize them so that their mean is zero and their variance is one, then if one had a standard deviation of like a million, then that will be your strongest variable.” - L14-modelsel-3.
- Thinking PCR selects variables. It doesn’t, every back-transformed is typically nonzero.
- Trusting PCR when assumption (2) fails. If depends on a low--variance direction, PCR throws it away and underperforms. The Credit example illustrates: needed of to capture income, almost no dimension reduction achieved.
- Confusing with . is the number of PCs kept; is the total number of original predictors. PCR’s appeal is when .
- Reading PCR coefficients as variable importance. They’re back-projected from the ‘s, they describe the regression but are not interpretable as “predictor X matters this much.”
Scope vs ISLP
- In scope: the PCR pipeline (standardize → PCA → fit → back-transform); the back-transform formula ; choosing by CV; the two key assumptions and their failure modes; the PCR-as-discretized-ridge analogy; the Credit-data result; the contrast with lasso (PCR does not select variables); the formula via eigenvalues.
- Look up in ISLP: §6.3.1 (pp. 281–286) for the PCR algorithm and the simulated PCR-vs-ridge-vs-lasso comparison figure. §10 for full PCA treatment.
- Skip in ISLP: the SVD derivation of PCA (the prof noted full PCA mechanics live in chapter 10; he didn’t redo them in module 6); detailed ridge-shrinkage-per-PC algebra (slide gave the factor, the prof noted “this is a confusing figure, I’ll try to make another one for next time” - L15-modelsel-4).
Exercise instances
- Exercise6.7, How many principal components should we use for the Credit dataset? Justify. (Read the explained-variance / CV-MSE curve, defend a cutoff.)
- Exercise6.8, Apply PCR on the Credit dataset and compare with the methods covered in Lecture 1 (OLS / subset-selection / ridge-regression / lasso).
(Slide labels these “Recommended exercise 7” and “Recommended exercise 11” respectively. Note: the slide-deck-labeled “Recommended exercise 11” maps to RecEx6 problem 8 in the actual exercise file.)
How it might appear on the exam
- Output interpretation: given a CV-MSE-vs- curve and / or a PVE / cumulative-variance plot, identify the optimal and explain reasoning (CV minimum or 90/95% variance threshold).
- Method comparison (recurring across past exams): PCR vs ridge vs lasso vs OLS, given a results table, explain which performed best and why. PCR ≈ ridge is the canonical answer; lasso wins if the truth is sparse.
- True / false: “PCR performs variable selection.” → False. “PCR can be seen as a discretized version of ridge regression.” → True. “PCR uses the response to choose components.” → False (PCA is unsupervised, that’s the contrast with PLS).
- Conceptual short answer: “What is PCR’s main weakness compared with PLS?” → PCA picks high-variance directions in -space without consulting , so if depends on a low-variance -direction PCR misses it; PLS uses to choose directions instead.
- Pseudocode / equation-writing: “Write out the PCR procedure.” → standardize → PCA → CV over → fit OLS on first PCs → back-transform.
- Hand calculation (light): given eigenvalues, compute fraction of variance explained by first PCs via .
Related
- principal-component-analysis: the unsupervised compression PCR rides on (deep treatment in module 10).
- ridge-regression: PCR’s smooth-shrinkage analogue; “PCR ≈ discretized ridge.”
- partial-least-squares: supervised cousin of PCR; uses to choose directions.
- lasso: the variable-selecting alternative; opposite philosophy from PCR.
- dimensionality-reduction: the broader family PCR belongs to; the same skeleton (“compress, then regress”) generalizes to nonlinear compressors.
- standardization: required preprocessing.
- collinearity: PCR’s main reason to exist.
- explained-variance-and-scree-plot: how PCA picks the “right” (variance threshold vs CV-MSE).
- cross-validation: for choosing honestly.