← Back to wiki
Module 06 — PCA, PCR & PLS
35 questions · 130 points · ~55 min
Focused drill on the dimension-reduction route through module 6:
principal-component analysis, principal-components regression, and partial least squares.
The deck mixes recall, computation, and scenario-interpretation questions with
Socratic understanding questions designed to probe the why behind each method —
active recall against ISLP §6.3 and §12.2.
Click an option to lock the answer; the explanation auto-opens. Score tracker bottom-left.
In the PCR pipeline, the first principal component $Z_1 = \sum_j \phi_{j1} X_j$ of the standardized predictors is the linear combination that:
- A minimizes the within-cluster sum of squared distances from each $x_i$ to its nearest centroid.
- B maximizes $\text{Cov}(Z_1, Y)$ with $\sum_j \phi_{j1}^2 = 1$.
- C equals the single $X_j$ with the largest sample variance, the other loadings set to $0$.
- D maximizes $\text{Var}(Z_1)$ subject to $\sum_j \phi_{j1}^2 = 1$.
Show answer
Correct answer: D
The prof's verbatim framing: "find the best way of rotating our shit such that now it has a maximum variance." The unit-norm constraint exists exactly so the algorithm rotates rather than rescales.
A confuses PCA with K-means (within-cluster sum-of-squares). B describes PLS, which is supervised and uses $Y$ — PCA is unsupervised, so $Y$ never enters. C confuses "direction of largest variance" with "variable with largest variance"; PCA finds linear combinations, not single columns.
Atoms: principal-component-analysis, principal-component-regression. Lecture: L21-unsupervised-1. ISLP §12.2.1.
Suppose someone proposed dropping the unit-norm constraint $\sum_j \phi_{j1}^2 = 1$ from the PCA optimization. Why does the constraint exist — what specifically goes wrong without it?
- A Without the constraint, orthogonality between successive PCs breaks, so $Z_1$ and $Z_2$ carry overlapping information and the decorrelation guarantee that PCR relies on for stable downstream OLS is lost.
- B Without the constraint the optimization is unbounded — scaling $\phi_1$ by any positive constant inflates $\text{Var}(\phi_1^\top x)$ without bound, so "maximize variance" has no finite solution.
- C Without the constraint, PCA loses invariance to rotations of the coordinate system, so different column orderings of $X$ would yield genuinely different first principal components on the same dataset.
- D Without the constraint, the loadings $\phi_{j1}$ stop being interpretable as the correlations between $Z_1$ and the original $X_j$, which breaks the standard loading-as-correlation reading used in PCA biplots.
Show answer
Correct answer: B
The prof's verbatim explanation: "If you didn't subject that constraint, you could just make it … arbitrarily create high numbers, but also because we don't want to rescale the data." The unit-norm forces PCA to pick a direction, not a magnification — the variance comparison is meaningful only if all candidate $\phi$ live on the unit sphere.
A confuses what the constraint controls — orthogonality between PCs is enforced separately (each new PC is constrained orthogonal to previous ones), independent of unit-norm. C is wrong on dependency: column reordering doesn't change PCA's answer either way; orthonormal rotations preserve the constraint anyway. D inverts the relationship — the loading-as-correlation reading is a downstream property; the unit-norm constraint exists for boundedness, not interpretability.
Atoms: principal-component-analysis. Lecture: L21-unsupervised-1. ISLP §12.2.1.
Mark each statement about principal component analysis as true or false.
Show answer
- False — the prof's slide bullet read aloud: "PCA is not scale invariant." Without standardization, the largest-unit variable dominates PC1.
- True — loading vectors are unit norm by construction; this is the constraint that prevents PCA from cheating by inflating variance through scaling.
- True — "the principal component vector is unique. Of course, the sign flip is boring. It just means which direction it is."
- False — PCA is unsupervised; it sees only $X$. This is precisely the failure mode that motivates PLS.
- False — keeping all $p$ PCs is a rotation of coordinates. Reduction comes from truncating to the first $M < p$.
Atoms: principal-component-analysis, dimensionality-reduction. Lecture: L21-unsupervised-1. ISLP §12.2.
PCA is sometimes called a "rotation." In what precise sense is this language exact, and where does it become merely useful shorthand?
- A PCA is exactly a rotation only on standardized data; on unstandardized data it becomes a stretching operation along the standard-deviation axis of each column, so the rotation language fails whenever variances differ across predictors.
- B The "rotation" is around the data centroid, applied to each observation in turn — after PCA the points sit at new coordinates while the original axes stay fixed in physical space.
- C Keeping all $p$ PCs is an exact orthonormal change of basis; truncating to $M < p$ adds the projection step. "Rotation" is shorthand for the whole rotate-then-truncate pipeline.
- D The rotation is exact only when no two eigenvalues are tied; ties between eigenvalues collapse the orthonormal basis into a degenerate form and break the rotation interpretation.
Show answer
Correct answer: C
The rotation is exact on the full set of $p$ PCs — you've just changed basis with an orthonormal matrix, which preserves all distances and total variance. The reduction comes from chopping at $M < p$ (the projection step). "PCA reduces dimension" is true only after the truncation, not at the rotation.
A confuses preprocessing with geometry — standardization affects which directions get picked, not whether the basis change is a rotation. B inverts what's rotated: PCA rotates the coordinate axes, leaving the points themselves fixed in physical space. D invents a degeneracy: tied eigenvalues leave the eigenvector basis non-unique within the tied subspace, but every choice is still orthonormal — the rotation interpretation survives.
Atoms: principal-component-analysis, dimensionality-reduction. Lecture: L21-unsupervised-1. ISLP §12.2.2 develops the orthonormal-basis-and-projection geometry.
Question 5
5 points
Exam 2025 P2f
PCA is performed on a dataset with 6 standardized predictors. The eigenvalues of the sample covariance are:
$\lambda_1 = 3.0,\ \lambda_2 = 1.5,\ \lambda_3 = 0.8,\ \lambda_4 = 0.4,\ \lambda_5 = 0.2,\ \lambda_6 = 0.1.$
How many principal components must you keep to retain at least 90% of the total variance?
- A 4 components
- B 3 components
- C 5 components
- D 6 components
Show answer
Correct answer: A
Total variance $= 3.0 + 1.5 + 0.8 + 0.4 + 0.2 + 0.1 = 6$ (which equals $p$, since the variables are standardized). Cumulative PVE: $3.0/6 = 0.500$; $+1.5/6 = 0.750$; $+0.8/6 = 0.883$ (still under 0.90); $+0.4/6 = 0.950$ — threshold crossed at $M = 4$.
B stops at 0.883 and reads "almost 90%" as enough — the threshold is 0.90, not 0.85. C and D add components past the threshold; once 0.95 is reached, more components do not earn their keep.
Atoms: explained-variance-and-scree-plot, principal-component-analysis. Lecture: L27-summary. ISLP §12.2.3.
Question 6
5 points
Exam 2025 P2f
Continuing the setup: the loading vector for PC1 is $\phi_1 = (0.7,\ 0.5,\ 0.4,\ 0.3,\ 0.1,\ 0.0)^\top$. For an observation with standardized values $x = (1,\ -1,\ 0.5,\ 0,\ 2,\ -2)$, what is the score $z_1 = \phi_1^\top x$?
- A $1.6$
- B $0.6$
- C $2.0$
- D $0.7$
Show answer
Correct answer: B
$z_1 = 0.7(1) + 0.5(-1) + 0.4(0.5) + 0.3(0) + 0.1(2) + 0.0(-2) = 0.7 - 0.5 + 0.2 + 0 + 0.2 + 0 = 0.6$. Show your work for partial credit.
A drops the signs on the negative entries of $x$ ($0.7 + 0.5 + 0.2 + 0 + 0.2 = 1.6$) — the absolute-value mistake. C ignores the multiplications and sums the loadings instead ($0.7 + 0.5 + 0.4 + 0.3 + 0.1 + 0 = 2.0$). D keeps only the largest contribution $\phi_{11} x_1 = 0.7$ and drops the rest.
Atoms: principal-component-analysis. Lecture: L27-summary. ISLP §12.2.1.
PCA on a centered dataset gives PC1 with variance $\lambda_1 = 7.2$, while the total variance of the data is $9.6$. What is the proportion of variance explained by PC1?
- A $0.50$
- B $0.25$
- C $0.75$
- D $0.87$
Show answer
Correct answer: C
$\text{PVE}_1 = \lambda_1 / \sum_k \lambda_k = 7.2 / 9.6 = 0.75$.
A halves to 0.50 by mis-totalling the variance (e.g. doubling: $7.2 / 14.4$). B reports the residual fraction $1 - \text{PVE}_1 = 2.4 / 9.6 = 0.25$ — the proportion not explained by PC1, not the proportion explained. D computes $\sqrt{0.75} \approx 0.87$ — confuses variance with standard deviation.
Atoms: explained-variance-and-scree-plot, principal-component-analysis. ISLP §12.2.3 (PVE definition and the $R^2 = 1 - \text{RSS}/\text{TSS}$ identity).
Suppose your $n$ observations lie almost exactly on a unit circle in $\mathbb{R}^2$, so the underlying structure is one-dimensional and curved. What does PCA report — and what does this tell you about its scope?
- A A single dominant PC capturing the circular structure exactly, since the data lie on an intrinsically 1-D manifold and PCA finds the manifold.
- B A first PC that traces along the circle non-linearly, with the second eigenvalue essentially zero — so the scree plot peaks sharply at PC1.
- C A flat scree plot with both eigenvalues equal to zero, since the data have no linear variance and PCA returns the null direction.
- D Two PCs of roughly equal variance — PCA's linear projections can't follow a curve, so the algorithm reports the data as effectively 2-D.
Show answer
Correct answer: D
PCA assumes linear structure. The prof: "PCA is kind of stuck just making linear stuff. It can't do anything non-linear." A circle is intrinsically 1-D but not linearly 1-D — the variance spreads across both x and y axes equally, so the scree plot looks flat and PCA gives no useful reduction. Nonlinear methods (kernel PCA, autoencoders) handle this; PCA does not.
A confuses "1-D manifold" with "1-D linear subspace" — only the latter is what PCA can find. B invents a non-linear PC: PCA's loadings define a linear combination of the original columns, never a curve. C is wrong on arithmetic: a unit circle has nonzero linear variance along both axes (that's the whole point — variance is spread, not absent).
Atoms: principal-component-analysis, dimensionality-reduction. Lecture: L21-unsupervised-1. ISLP §12.2 (linear-projection framing throughout).
Mark each statement as true or false.
Show answer
- True — PCA is variance-driven; mixed units make the eigenvalue ranking meaningless without standardization.
- False — PCR inherits PCA's scale-(non)-invariance: rescaling one column changes which directions the PCs align with, and therefore the fit.
- True — every PC is a linear combination of all original $X_j$, so the back-transform $\hat\beta_j = \sum_m \hat\theta_m \phi_{jm}$ is generically nonzero. PCR is not a variable-selection method.
- True — the prof's USArrests demo: without scaling, PC1 loaded almost entirely on
Assault (variance 6945) — meaningless artifact of units.
Atoms: standardization, principal-component-regression. Lecture: L14-modelsel-3. ISLP §12.2.4.
You run PCA on four standardized macroeconomic predictors ($X_1$ = GDP growth, $X_2$ = interest rate, $X_3$ = inflation, $X_4$ = unemployment) and obtain the first PC loadings $\phi_1 = (0.7,\ -0.6,\ 0.3,\ -0.2)^\top$. Which is the most accurate reading of PC1?
- A An "economic-activity" axis: high PC1 corresponds to high GDP growth and low interest rates, with smaller contributions from inflation and unemployment.
- B A weighted-average axis of all four variables, since the loadings are similar in absolute value and the unit-norm constraint forces near-equal weighting.
- C The response that PCR will regress $Y$ onto, since the loadings have been chosen to maximize $\text{Cov}(\phi_1^\top X, Y)$.
- D A three-variable axis: the negative loading on $X_2$ means interest rate has been excluded from the PCA, leaving GDP, inflation, and unemployment.
Show answer
Correct answer: A
Sign and magnitude both matter. Big positive loading on GDP and big negative loading on interest rate define the axis; smaller loadings contribute proportionally less. Standard prof move on USArrests: PC1 loads roughly equally on Murder/Assault/Rape (an "overall criminality" axis); same kind of reading here.
B is wrong on form: $|0.7|$ and $|0.2|$ differ by a factor of $3.5$ — not "roughly equal." C describes PLS, not PCA; PCA never sees $Y$. D confuses negative loadings with zero loadings; a negative loading means the variable contributes in the opposite direction, not that it is excluded.
Atoms: principal-component-analysis, dimensionality-reduction. ISLP §12.2.1 (loading interpretation, USArrests example).
Why is standardization mandatory for PCA / PCR / PLS but only cosmetic for OLS? Pick the explanation that most directly captures the asymmetry.
- A PCA's variance objective is unit-dependent — rescaling $X_j$ by $c$ multiplies its variance by $c^2$ and changes the PC ranking. OLS's $\hat\beta_j$ rescales by $1/c$ in compensation, so the fitted $\hat y$ is unchanged.
- B OLS uses an internal weighted-least-squares pass that adjusts for scale differences across columns of $X$, while PCA's eigendecomposition of the sample covariance has no such compensation step and so inherits raw-unit sensitivity.
- C OLS is fit by maximum likelihood under a Gaussian model, and Gaussian likelihoods are unit-invariant under linear rescaling by construction; PCA uses a quadratic-form covariance metric whose value depends on the absolute scale of each column, so it is not.
- D PCA's eigendecomposition only converges when all predictors share common units, otherwise the iteration on the covariance matrix fails to a numerical limit; OLS solves the closed-form normal equations directly and so handles mixed units natively.
Show answer
Correct answer: A
This is the load-bearing distinction between coefficient-fitting and direction-finding methods. OLS is equivariant under linear rescaling — the model accommodates the unit change automatically. PCA's objective is the variance of a particular linear combination, and variance scales with the square of any unit change, so the ranking of directions is unit-dependent. The prof's slide: "PCA is not scale invariant."
B fabricates an internal CV pass for OLS — there is none; OLS is a closed-form solver with no scale-correction machinery. C confuses likelihood with equivariance: under a Gaussian likelihood, OLS's invariance under rescaling is the same equivariance argument, not a likelihood-specific property — and PCA can be given a likelihood interpretation too (probabilistic PCA) without becoming scale-invariant. D fabricates a convergence dependency: eigendecomposition operates on a covariance matrix regardless of input units; the issue is interpretation of the eigenvalue ranking, not whether the algorithm runs.
Atoms: standardization, principal-component-analysis. ISLP §12.2.4 ("Scaling the variables").
Which of the following describes the canonical PCR procedure with $M$ chosen by cross-validation?
- A Run PCA on the raw unstandardized $X$, then standardize the resulting components to unit variance, and fit OLS of $Y$ on the first $M$ standardized components.
- B Standardize $Y$, run PCA on the standardized response, then regress each $X_j$ on the first $M$ PCs of $Y$ and average the coefficients.
- C Fit OLS of $Y$ on $X$, run PCA on the residuals, and keep the first $M$ residual components as the new predictors in a refit.
- D Standardize $X$, run PCA, then for each $M = 1, \ldots, p$ fit OLS of $Y$ on the first $M$ PCs and pick the $M$ minimizing CV-MSE.
Show answer
Correct answer: D
The prof's verbatim recipe: "Standardize $X$, run PCA, get $Z_1, \ldots, Z_p$. Fit a standard linear regression of $Y$ on $Z_1, \ldots, Z_M$, where $M$ is a tuning parameter. Sweep $M$ from 1 up to $p$. Pick the $M$ that minimizes CV-MSE."
A standardizes the wrong object — components are already on the variance scale of the standardized $X$. B inverts the role of $X$ and $Y$ (PCA is on the predictors, not the response). C is residual-PCA — a different idea entirely; no method in the course works this way.
Atoms: principal-component-regression, cross-validation. Lecture: L15-modelsel-4. ISLP §6.3.1.
What is the relationship between PCA and PCR?
- A PCR is the supervised version of PCA: PCR maximizes $\text{Var}(Z \mid Y)$ to align the rotation with the response, whereas PCA uses unconditional $\text{Var}(Z)$.
- B PCA is unsupervised dimension reduction; PCR follows PCA with an OLS regression of $Y$ on the first $M$ components.
- C PCA produces orthogonal components but cannot regress on them; PCR is the modification adding an $L_2$ penalty on the PC scores to absorb collinearity.
- D PCR uses $\text{Cov}(X, Y)$ to choose its components, whereas PCA uses $\text{Var}(X)$ alone with no reference to $Y$.
Show answer
Correct answer: B
PCR = PCA front-end + OLS back-end. The PCA step is the same unsupervised algorithm covered in module 10; the regression step on the first $M$ PCs makes the pipeline supervised. The prof: "you're taking the X, you're squishing it down to fit your model, and then you go backwards to the original model again."
A is wrong on the criterion: PCA does not condition on $Y$ in any form — there is no $\text{Var}(Z \mid Y)$ step in either PCA or PCR; both use the unsupervised $\text{Var}(Z)$ objective on $X$ alone. C invents an $L_2$ penalty that PCR does not have — PCR's mechanism is hard truncation at $M$, not shrinkage on PC scores; PCA also handles correlated predictors fine. D describes PLS, not PCR; PCR uses PCA's variance-maximizing components.
Atoms: principal-component-regression, principal-component-analysis. Lecture: L14-modelsel-3. ISLP §6.3.1.
The prof framed PCR as "a discretized version of ridge regression." Which sentence best captures the analogy?
- A Both methods pressure the small-eigenvalue PC directions; PCR drops them past the cutoff $M$, while ridge shrinks them smoothly via $\lambda_j^2 / (\lambda_j^2 + \lambda)$, heaviest on the smallest eigenvalues.
- B Both methods solve the same penalized least-squares problem on the original predictors; ridge writes it in its standard Lagrangian form, while PCR is the equivalent budget-constraint form $\sum_j \beta_j^2 \leq M$ with $M$ playing the role of the budget.
- C PCR equals ridge regression exactly in the limit $\lambda \to 0$; outside that limit the two methods are mathematically unrelated and their fitted coefficients behave differently as functions of the tuning parameter.
- D Both methods perform exact variable selection on the original $X$ by zeroing out the smallest original-$X$ coefficients $\hat\beta_j$ once their estimated effect drops below a threshold set by the tuning parameter.
Show answer
Correct answer: A
Verbatim: "Higher pressure on less important PCs. PCR discards the $p - M$ smallest eigenvalue components." Ridge applies the smooth shrinkage factor $\lambda_j^2 / (\lambda_j^2 + \lambda)$ on each PC direction — heavier on small eigenvalues. Same target (small-variance directions), different shape (hard truncation vs. smooth shrinkage).
B invents a budget formulation: PCR has no penalized-loss representation; its mechanism is hard truncation in PC space, not a constraint on $\sum_j \beta_j^2$. C inverts the relationship — ridge with $\lambda = 0$ is OLS, and PCR with $M = p$ is also OLS, so they meet at the no-shrinkage endpoint, not at $\lambda \to 0$. D is wrong on both counts: neither PCR nor ridge does exact variable selection — that is lasso's job.
Atoms: principal-component-regression, ridge-regression. Lecture: L15-modelsel-4. ISLP §6.3.1.
Both ridge and PCR put pressure on the small-eigenvalue PC directions. Why does this happen — what is the deeper mathematical reason these two methods end up doing similar things?
- A Both methods are penalized least-squares estimators sharing a common $L_2$ regularizer applied directly to the original-$X$ coefficients $\hat\beta_j$, with the tuning parameter ($\lambda$ for ridge, $M$ for PCR) controlling the strength of the same shared penalty term.
- B Small-eigenvalue PC directions are where OLS estimates are unstable — their variance scales as $\sigma^2 / \lambda_j$. Both methods damp this: ridge smoothly via $\lambda_j^2 / (\lambda_j^2 + \lambda)$, PCR by hard truncation past $M$.
- C Both methods preprocess $X$ via singular-value decomposition and then solve the downstream regression on the resulting orthogonal columns, which gives them mathematically identical bias profiles for any choice of their tuning parameters.
- D Both methods are limiting cases of lasso along the elastic-net family: ridge as $\alpha \to 0$ in the standard elastic-net mixing parameter, and PCR as $\alpha \to 1$ but with an additional hard-thresholding step applied to the principal components.
Show answer
Correct answer: B
This is the load-bearing reason behind the discretized-ridge framing. Decomposing OLS along the eigen-basis of $X^\top X$, the coefficient on the $j$-th eigendirection has variance proportional to $1/\lambda_j$ — small-eigenvalue directions are the noisy ones. Both regularizers know this implicitly: ridge attenuates each direction by a smooth factor that bites hardest on small $\lambda_j$, and PCR drops the smallest $\lambda_j$ directions outright. Same target, different shape.
A is wrong about PCR's formulation: ridge minimizes $\|y - X\beta\|^2 + \lambda\|\beta\|^2$, but PCR has no penalty term — it works by hard truncation in PC space, not by penalized loss. C is mechanically wrong: PCR uses an eigendecomposition / SVD of $X$, while standard ridge solvers do not require one. D inverts the family relations: lasso is its own animal, not a parent of ridge or PCR; ridge is not "elastic net at $\alpha = 0$" running through a lasso limit.
Atoms: principal-component-regression, ridge-regression. Lecture: L15-modelsel-4. ISLP §6.2.3 + §6.3.1 (the eigenbasis view of ridge and the PCR-as-discretized-ridge bridge).
Mark each statement about principal-components regression (PCR) as true or false.
Show answer
- False — every PC is a linear combination of all $X_j$, so all back-transformed $\hat\beta_j$ are typically nonzero. The prof: "Just like Ridge, [PCR] doesn't actually select the parameters." Lifted from Exam 2023 P5.
- True — verbatim slide quote: "PCR can be seen as a discretized version of ridge regression."
- True — direct from the slide-flagged "constrained interpretation": dimension reduction constrains $\hat\beta$ to live in the $M$-dimensional subspace spanned by the kept loadings.
- False — PCA (the front end of PCR) is unsupervised. This is exactly the no-guarantee-the-high-variance-directions-relate-to-$Y$ pitfall, fixed by PLS.
- True — at $M = p$ the rotation is invertible and OLS on the rotated predictors gives the same fit as OLS on the originals (under standardization, the $\hat y$'s match exactly).
Atoms: principal-component-regression, ridge-regression. Lecture: L15-modelsel-4. ISLP §6.3.1.
On the Credit dataset, the prof reported that PCR needed $M \approx 10$ components (out of $\sim 11$) to perform well, while ridge regression performed comparably keeping all directions but shrinking. Why doesn't ridge suffer the same failure?
- A Ridge places its hyperparameter $\lambda$ on the $L_1$ norm of $\hat\beta$, which automatically picks the right small set of components for the response.
- B Ridge runs an internal cross-validation pass for each predictor separately, so it can adapt to which raw $X_j$ drives $Y$ regardless of the predictor's $X$-variance and re-fit its shrinkage strength column-by-column on the Credit data.
- C Ridge keeps every PC direction, just shrinking the small-eigenvalue ones — a low-$X$-variance but $Y$-predictive direction is retained at smaller magnitude, while PCR's hard truncation drops it.
- D Ridge applies the same shrinkage factor to every direction, equalizing their importance across the eigen-spectrum and so picking up small-$X$-variance directions automatically alongside the dominant high-variance ones in the Credit fit.
Show answer
Correct answer: C
This is the resolution of the Credit puzzle. Income is a low-$X$-variance but $Y$-predictive direction. PCR's hard threshold throws it away unless $M$ reaches the rank of that direction (which happened to be 10). Ridge keeps it in the model from the start, just shrunk — so a small-but-nonzero contribution is recovered at every $\lambda$. Smoothness wins exactly when the relevant direction sits low in the variance ranking.
A misnames ridge's penalty (it's $L_2$, not $L_1$) and conflates penalty type with selection — even with $L_1$ ridge wouldn't be picking "components" anyway. B fabricates a per-predictor CV: ridge does a single CV pass over $\lambda$, not one per variable. D contradicts the slide-flagged factor $\lambda_j^2 / (\lambda_j^2 + \lambda)$, which gives different shrinkage to different directions (heavier on small eigenvalues, not equal).
Atoms: principal-component-regression, ridge-regression. Lecture: L15-modelsel-4. ISLP §6.3.1, Figure 6.20 (PCR / PLS / ridge / lasso compared on simulated data).
On the Credit dataset, the prof reported that PCR's CV-MSE was minimized at $M = 10$ out of $\sim 11$ available components — almost no dimension reduction. What does this tell you?
- A PCR overfit the Credit data because $M$ was chosen too close to $p$, so the cross-validated test error climbed at large $M$ — the textbook overfitting signature of a model that retained too many components for its sample size.
- B The high-variance directions of $X$ were not the directions driving $Y$;
income was a low-$X$-variance but predictive feature, so PCR needed many components before that signal entered.
- C The Credit predictors are so strongly collinear that each PC captures only a fragment of the $Y$-relevant direction, so $M$ has to be large to reassemble the signal.
- D PCA was run without standardization, so one large-unit predictor dominated the leading PCs and the rest of the components were needed to recover the signal.
Show answer
Correct answer: B
Verbatim: "Even though the things that you care about were just ended up being four variables, primarily, it took you 10 PCs to get there." Income happened not to vary much in $X$-space, so the relevant direction was buried late in the PC ordering. This is the canonical failure of PCR's assumption (2) — that high-$X$-variance directions are also $Y$-predictive.
A inverts cause and diagnosis: $M$ being close to $p$ is the symptom of the failure mode, not its cause, and overfitting would push CV-MSE up at large $M$, the opposite of what was observed. C describes a different failure mode (rank-deficient collinearity) and contradicts PCA's geometry — strong collinearity collapses signal into fewer PCs, not more. D is a real PCR pitfall but not the one at play here — the prof's pipeline standardized $X$ correctly.
Atoms: principal-component-regression, partial-least-squares. Lecture: L15-modelsel-4. ISLP §6.3.1.
You run PCR with $M = 2$ on three standardized predictors. The first two loading vectors are $\phi_1 = (0.7,\ 0.5,\ 0.5)^\top$ and $\phi_2 = (-0.3,\ 0.6,\ 0.7)^\top$. The OLS regression of $Y$ on the components gives $\hat\theta_1 = 4$ and $\hat\theta_2 = 2$. What is the back-transformed coefficient $\hat\beta_2$ on the second original predictor?
- A $1.2$
- B $2.0$
- C $1.1$
- D $3.2$
Show answer
Correct answer: D
Back-transform formula: $\hat\beta_j = \sum_{m=1}^M \hat\theta_m \phi_{jm}$. So $\hat\beta_2 = \hat\theta_1 \phi_{21} + \hat\theta_2 \phi_{22} = 4(0.5) + 2(0.6) = 2 + 1.2 = 3.2$.
A keeps only the second-component term ($2 \cdot 0.6 = 1.2$), forgetting the contribution from $\hat\theta_1$. B keeps only the first-component term ($4 \cdot 0.5 = 2.0$). C adds the loadings without weighting by the $\hat\theta_m$'s ($0.5 + 0.6 = 1.1$).
Atoms: principal-component-regression. Lecture: L14-modelsel-3. ISLP §6.3.1.
PCR with $M = 1$ keeps only one component out of $p = 20$, producing a model with effectively one feature. Why is this still not a form of variable selection?
- A Variable selection requires explicit hypothesis testing or $p$-value thresholding on each individual $X_j$, machinery PCR omits entirely in favour of unsupervised variance-driven thresholds at the component level.
- B $M = 1$ produces a model too simple to count as having identified any predictors, regardless of which $X_j$ end up in the back-transform — no method with only one degree of freedom can be doing real selection.
- C PCR uses an $L_2$ penalty internally on the principal-component coefficients, and $L_2$ regularization cannot drive coefficients exactly to zero in finite samples — so no predictor is ever eliminated from the fitted model.
- D The single component is a linear combination of all $p$ predictors, so the back-transform $\hat\beta_j = \hat\theta_1 \phi_{j1}$ is nonzero on every $X_j$ — no predictor has been excluded.
Show answer
Correct answer: D
The core insight is the back-transform. The model in PC-space looks one-dimensional, but the equivalent model in $X$-space uses all 20 predictors with weights $\hat\theta_1 \phi_{j1}$ each. Variable selection requires actual zeros on $\hat\beta_j$ — and unless $\phi_{j1} = 0$ exactly (almost never the case), every original predictor stays in.
A confuses procedure with effect — variable selection can be achieved without hypothesis tests (lasso is the obvious example, and best-subset another). B is too vague; the question is about the structural property of the fit, not whether the model is good. C misnames PCR's mechanism: PCR has no explicit penalty, it has a hard truncation. The $L_2$-can't-zero claim is true for ridge but irrelevant to PCR.
Atoms: principal-component-regression, lasso. Lecture: L15-modelsel-4. ISLP §6.3.1, closing: "PCR does not perform feature selection."
What is the principled way to choose the number of components $M$ in PCR?
- A Pick the $M$ at the elbow of the scree plot of $X$, since PCA's diagnostic identifies the natural variance cutoff.
- B Set $M = p$ to retain 100% of the variance in $X$ and let OLS handle the rest of the variance budget.
- C Cross-validate the downstream regression on $Y$ and pick the $M$ at the CV-MSE minimum, optionally via the 1-SE rule.
- D Pick the smallest $M$ for which the cumulative PVE in $X$ reaches a threshold like 95%, since that captures most of the input variance.
Show answer
Correct answer: C
For supervised use ($Y$ available), CV the downstream regression. This is what the prof actually does in PCR; from the dimensionality-reduction atom: "Treat $M$ like any other tuning parameter. … Cross-validate $M$ against the downstream model — much more principled than the elbow on a scree plot."
A is the unsupervised criterion; it ignores how well the kept components actually predict $Y$. B is no reduction at all — defeats PCR's purpose. D uses an $X$-only criterion, which can miss directions that drive $Y$ (the Credit example: 95% PVE arrives well before the income-driving direction, so the $X$-only cutoff would be wrong here).
Atoms: principal-component-regression, cross-validation, explained-variance-and-scree-plot. Lecture: L15-modelsel-4. ISLP §6.3.1.
Why does PCR with $M = p$ produce exactly the same fitted values $\hat y$ as OLS on the standardized predictors?
- A A rotation to all $p$ orthogonal PC directions preserves $\text{col}(X)$, so fitting $Y$ on the rotated columns gives the same projection of $Y$ onto $\text{col}(X)$ that OLS does.
- B With $M = p$, PCR's tuning parameter effectively equals zero, so the discretized-ridge analogy means it coincides with ridge at $\lambda = 0$, which itself equals OLS.
- C When $M = p$, the back-transform formula $\hat\beta_j = \sum_m \hat\theta_m \phi_{jm}$ collapses to the closed-form lasso solution because the loadings span all of $\mathbb{R}^p$.
- D When $M = p$, the back-transform sums over all $p$ components and so multiplies the OLS fit by an extra $\Phi^\top \Phi$ rotation factor, but the orthogonality of $\Phi$ cancels it.
Show answer
Correct answer: A
This is the linear-algebra anchor. OLS finds $\hat y$ as the projection of $Y$ onto the column space of $X$. A change of basis to orthogonal PCs does not change which subspace you are projecting onto — it just expresses the same projection in different coordinates. Reduction occurs only when $M < p$, because then you are projecting onto a strict subspace of $\text{col}(X)$.
B routes through ridge unnecessarily — both PCR ($M=p$) and ridge ($\lambda=0$) coincide with OLS, but the equivalence comes directly from the rotation argument, not from passing through ridge. C invokes lasso for no reason; PCR's back-transform is a sum over loadings, not a soft-thresholding solution. D invents a $\Phi^\top \Phi$ correction: $\Phi$ is orthonormal, so $\Phi^\top \Phi = I$ — there is no extra factor to cancel, the OLS fit is recovered immediately.
Atoms: principal-component-regression. Lecture: L14-modelsel-3. ISLP §6.3.1, opening pages.
Mark each statement as true or false.
Show answer
- True — L1 produces exact zeros at corners of the constraint region; PCR's back-transform sums over loadings and is generically nonzero.
- True — the canonical "discretized ridge" framing.
- False — the directions are reversed: lasso typically keeps one and zeroes the other (it picks); ridge typically holds both back and shares the weight.
- True — verbatim from the prof: "Use PCR/ridge when you want predictive power and don't care which raw variables drive it. Use lasso or subset selection when you need this or that one." PLS sits with PCR/ridge in this triage.
Atoms: principal-component-regression, ridge-regression, lasso. Lecture: L15-modelsel-4. ISLP §6.2 + §6.3.
In partial least squares (PLS) regression, the first component $Z_1 = \sum_j \phi_{j1} X_j$ is constructed to maximize:
- A $\text{Var}(Z_1)$ subject to $\sum_j \phi_{j1}^2 = 1$.
- B $\sum_j |\phi_{j1}|$ subject to a fixed $\text{Var}(Z_1)$.
- C the joint log-likelihood of $(X, Y)$ under a Gaussian model.
- D $\text{Cov}(Z_1, Y)$ subject to $\sum_j \phi_{j1}^2 = 1$.
Show answer
Correct answer: D
Verbatim from the prof: "It's the same idea as the principal component analysis, only now you're finding the principal components not as the directions of maximal variance of $X$, but the maximal covariance of $X$ and $Y$." So PLS is supervised; PCA / PCR is unsupervised.
A is the PCA / PCR criterion (variance, no $Y$). B mixes PLS with the L1 penalty of lasso — PLS has no sparsity penalty. C invokes a generative likelihood that PLS does not assume; PLS is a deterministic algorithm, not a likelihood method.
Atoms: partial-least-squares. Lecture: L15-modelsel-4. ISLP §6.3.2.
In the PLS algorithm, the loading $\phi_{j1}$ for the first component is set to:
- A the $j$-th component of the eigenvector of $X^\top X$ corresponding to the largest eigenvalue, normalized to unit length.
- B the OLS coefficient from the simple regression of $Y$ on $X_j$ alone, which is proportional to $\text{Corr}(Y, X_j)$.
- C $1/p$ for every $j$, so PLS averages all $p$ predictors with equal weight when forming its first component $Z_1$.
- D zero for every $j$ except the predictor with the largest sample variance, with that single loading set to $1$.
Show answer
Correct answer: B
Verbatim slide: "$\phi_{j1}$ is the coefficient from the simple linear regression of $Y$ onto $X_j$. This coefficient is proportional to the correlation between $Y$ and $X_j$. PLS puts highest weight on the variables that are most strongly related to the response." That is the entire supervised twist over PCA.
A is the PCA recipe (variance, no $Y$). C is a flat-weight average that PLS does not use — and would carry no $Y$-information. D is variable selection, which neither PLS nor PCR performs.
Atoms: partial-least-squares. Lecture: L15-modelsel-4. ISLP §6.3.2.
PLS sets $\phi_{j1}$ proportional to the simple-regression coefficient of $Y$ on $X_j$. What is the intuition — why is this rule the right one for the supervised criterion PLS optimizes?
- A Simple regression of $Y$ on each $X_j$ is the closed-form solution to PCA's variance-maximization problem when the unit-norm constraint $\sum_j \phi_{j1}^2 = 1$ is dropped, and PLS adopts that closed-form because it inherits PCA's optimization directly with no further modification.
- B The rule is required for the unit-norm constraint $\sum_j \phi_{j1}^2 = 1$ to hold automatically without an explicit normalization step in the algorithm — simple-regression coefficients on standardized predictors already form a unit vector by construction.
- C To maximize $\text{Cov}(Z_1, Y)$ you weight each $X_j$ by how strongly it alone predicts $Y$, and the simple-regression coefficient of $Y$ on $X_j$ is exactly that relevance metric ($\propto \text{Corr}(Y, X_j)$).
- D Simple-regression coefficients are unit-invariant under linear rescaling of each column, so weighting by them makes PLS fully scale-invariant — unlike PCA, this removes any need to standardize the predictors before fitting the algorithm.
Show answer
Correct answer: C
This is the supervised twist: PLS asks "which $X_j$'s predict $Y$ on their own?" and uses the answer as loadings. Variables with strong simple-regression coefficients (equivalently, high $|\text{Corr}(Y, X_j)|$) dominate $Z_1$ — directly serving the $\text{Cov}(Z_1, Y)$ objective. PCA, by contrast, asks "which $X_j$'s vary most and co-vary together?" with no $Y$ reference.
A confuses PLS with PCA — and PCA without the unit-norm constraint is unbounded, not solved by simple regression. B is wrong on mechanics: the unit-norm constraint is enforced by an explicit normalization step in the algorithm, not by the choice of loading values. D is doubly wrong — PLS still requires standardization (same as PCA), and even if simple-regression coefficients were unit-invariant they would not make the method scale-invariant in the way claimed.
Atoms: partial-least-squares. Lecture: L15-modelsel-4. ISLP §6.3.2.
Mark each statement about partial least squares (PLS) as true or false.
Show answer
- True — PLS maximizes $\text{Cov}(\phi^\top X, Y)$ at every step; that is the entire reason PLS exists separate from PCR.
- True — like PCR, PLS rotates predictors into composite components, so back-transformed coefficients are linear combinations of all $X_j$ and generically nonzero.
- False — PLS is not a variable-selection method. Selecting variables (exact zeros) is what lasso does.
- True — slide-flagged: "We regress each variable on $Z_1$ and take the residuals … We then compute $Z_2$ using this orthogonalized data."
Atoms: partial-least-squares, principal-component-regression. Lecture: L15-modelsel-4. ISLP §6.3.2.
You are predicting $Y$ from standardized predictors. Through prior knowledge you know that $Y$ depends strongly on a particular predictor $X_3$ that happens to have a small variance relative to the other predictors. Compared with PCR, PLS will generally:
- A throw $X_3$ away earlier than PCR, since PLS prioritizes high-variance directions of $X$ and drops low-variance ones outright at each step.
- B behave identically to PCR on $X_3$, since both methods are unsupervised dimension-reduction tools that only see the design matrix $X$ when constructing components and so weight each column by exactly the same variance-driven criterion.
- C weight $X_3$ identically to PCR, since both methods rank predictors by sample variance after standardization brings every $X_j$ to unit variance.
- D weight $X_3$ heavily in $Z_1$ via its large simple-regression coefficient on $Y$, while PCR pushes $X_3$ into a late component because its $X$-variance is small.
Show answer
Correct answer: D
This is the textbook PCR-vs-PLS contrast. PCR picks directions by $\text{Var}(X)$ — small-variance predictive directions get buried late in the ordering (the Credit / income story). PLS picks by $\text{Cov}(X, Y)$ — a strong simple-regression coefficient on $Y$ feeds directly into a heavy loading on the first component.
A reverses the direction: PCR (not PLS) is the one prioritizing high-$X$-variance. B misclassifies PLS — PCA is unsupervised, but PLS is supervised, that is the whole point. C confuses standardization with relevance: standardization makes variances equal but not covariances with $Y$; PLS still picks by covariance with $Y$, PCR still by variance, so they disagree on $X_3$.
Atoms: partial-least-squares, principal-component-regression. Lecture: L15-modelsel-4. ISLP §6.3.2.
Mark each statement about PLS's empirical behavior as true or false.
Show answer
- False — slide-flagged and read aloud: "In practice, PLS often performs no better than ridge regression or PCR but it's Swedish, so it's like they're meatballs, they're not better but they sound good."
- True — slide bullet: "Supervised dimension reduction of PLS can reduce bias. It also has the potential to increase variance."
- True — verbatim slide: "PLS, PCR and ridge regression tend to behave similarly. Ridge regression may be preferred because it shrinks smoothly, rather than in discrete steps."
- True — same as PCR; the back-transform is a sum over loadings, generically nonzero.
Atoms: partial-least-squares, ridge-regression. Lecture: L15-modelsel-4. ISLP §6.3.2.
Mark each statement about PCA, PCR, PLS, ridge, and lasso as true or false.
Show answer
- False — ridge handles collinearity by sharing weight smoothly across correlated predictors, and PCR is built explicitly to attack collinearity by rotating to orthogonal components. Reverses the truth.
- True — only the L1 penalty produces exact-zero solutions among these four; ridge shrinks to small but nonzero values, and PCR's back-transform is a sum over loadings.
- True — the canonical supervised-vs-unsupervised contrast.
- True — slide-flagged: lasso wins when sparsity is the right structural assumption; PCR / PLS / ridge are non-sparse smoothers and dilute the signal across components.
- False — PCA maximizes $\text{Var}(Z)$ on $X$ alone; PLS maximizes $\text{Cov}(Z, Y)$. Substituting $Y = X$ into the PLS criterion would give $\text{Cov}(Z, X)$, not $\text{Var}(Z)$ — and the PLS algorithm uses $Y$ to set $\phi_{j1}$ via the simple regression of $Y$ on $X_j$, which is degenerate when $Y = X$.
Atoms: principal-component-regression, partial-least-squares, ridge-regression, lasso. Lecture: L15-modelsel-4. ISLP §6.3.
Two analyses fit the same regression: one with PCR ($M = 4$), one with ridge (with $\lambda$ chosen by CV). They report essentially identical CV-MSE. The prof would lean toward reporting which, and why?
- A PCR — keeping $M = 4$ components means a more parsimonious model than ridge's full-rank fit on every direction, and Occam's razor favours the smaller model when two candidates tie on cross-validated test error.
- B Ridge — its $\lambda$ moves continuously, so a small CV reshuffle doesn't flip the model; PCR's $M$ jumps in unit steps and can swing $\hat\beta$ across resamples.
- C PCR — the $M = 4$ representation is a discrete, easily reportable choice and so is more interpretable than a continuous $\lambda$.
- D PCR — its hard truncation produces a sparser back-transformed $\hat\beta$ in original-$X$ space than ridge does, giving a cleaner variable-importance story.
Show answer
Correct answer: B
When two methods tie on test performance, the prof prefers the more stable one — the one whose answer wouldn't shift under a small CV-fold reshuffle. Verbatim: "ridge regression behaves smoothly instead of this kind of discrete thing which PCR and PLS both have." Ridge wins on smoothness: $\lambda$ moves continuously, so a tiny change in CV doesn't flip the model. PCR's $M$ jumps in unit steps; a small CV change can switch $M$ from 4 to 5 and substantially change the back-transformed $\hat\beta$.
A confuses parsimony in PC-space with parsimony in $X$-space — PCR's back-transform is generically nonzero on every $X_j$, so neither model is more parsimonious in raw variables. C is a weak interpretability argument: $\lambda$ is just as reportable as $M$ once you understand each method, and stability is the more load-bearing tie-break. D is the canonical wrong belief tested in Q20: PCR does not produce sparse $\hat\beta_j$ — every back-transformed coefficient is generically nonzero.
Atoms: principal-component-regression, ridge-regression, cross-validation. Lecture: L15-modelsel-4. ISLP §6.3.1 + §6.6 (exercises 9–10 contrast PCR/PLS/ridge/lasso on identical datasets).
You are predicting credit-card spend from $p = 50$ candidate predictors, many of which are nearly collinear. A stated goal is interpretability: you want to report which raw variables drive $Y$. Which method best fits the goals?
- A Ridge regression: handles collinearity by shrinking smoothly and sharing weight across correlated predictors.
- B PCR: rotates correlated predictors into orthogonal PCs and drops the small-variance ones via the $M$ cutoff.
- C Lasso: handles collinearity by zeroing out non-driving predictors, leaving an interpretable subset of raw variables.
- D PLS: same skeleton as PCR but supervised, using $\text{Cov}(X, Y)$ to pick directions while still rotating predictors.
Show answer
Correct answer: C
"Use lasso or subset selection when you need 'this or that one.'" Lasso both regularizes and selects, so it is the only method here whose output points at specific raw variables.
A retains all 50 predictors with shrunken coefficients — every $\hat\beta_j$ is nonzero, so no variable selection. B and D both produce components that are linear combinations of all 50 variables, so the back-transformed $\hat\beta_j$ are generically nonzero — no clean variable interpretation. The prof's verbatim triage: "Use PCR/ridge when you want predictive power and don't care which raw variables drive it. Use lasso or subset selection when you need this or that one."
Atoms: lasso, principal-component-regression, ridge-regression. Lecture: L15-modelsel-4. ISLP §6.2 + §6.3.
Mark each statement about choosing $M$ and reading PCA diagnostics as true or false.
Show answer
- True — supervised task → cross-validate the supervised criterion. ISL is candid that the scree elbow "is inherently ad hoc"; CV is the principled alternative when $Y$ is available.
- False — ISL: "this type of visual analysis is inherently ad hoc … there is no well-accepted objective way to decide." Two readers can differ by ±1 PC.
- True — each added eigenvalue is non-negative, so the running sum cannot decrease. A drop signals a coding bug.
- False — the opposite. Flat scree means no real low-dim structure; PCA is not earning its keep. Strong structure shows up as a sharp early drop.
Atoms: explained-variance-and-scree-plot, principal-component-regression, cross-validation. ISLP §12.2.3 + §12.2.4.
Which of the following is best classified as variable selection rather than dimensionality reduction?
- A Lasso with $\lambda > 0$.
- B PLS with $M = 2$.
- C An autoencoder mapping $\mathbb{R}^{100} \to \mathbb{R}^{10}$.
- D LDA as a $K-1$ projection.
Show answer
Correct answer: A
The prof's distinction: dimensionality reduction "gives you components, and each component is a combination of the original axes," while variable selection (lasso, subset selection) "actually selects the parameters." Lasso's L1 penalty zeroes some $\hat\beta_j$ outright — that is selection.
B and C are dimensionality reduction (PLS produces composite components; the autoencoder produces 10 learned features that are nonlinear combinations of the 100 inputs). D is the LDA-as-projection view from module 4 — also dimension reduction; a $K-1$-dim projection of the predictors is a compressed representation, not a selection of original variables.
Atoms: lasso, dimensionality-reduction, partial-least-squares. ISLP §6.2.2 (lasso) + §6.3 (dim reduction umbrella).
When the true response depends on a small subset of the original predictors, which method's signature failure is to scatter that signal across many components, leaving no clean variable interpretation in the back-transformed coefficients?
- A Lasso, because the $L_1$ penalty enforces sparsity in $\hat\beta$ that mirrors the sparse-truth assumption directly.
- B Ridge regression, because its $L_2$ penalty shrinks every original-$X$ coefficient toward zero by the same proportional factor, blurring the boundary between the truly active subset and the truly null variables across the whole coefficient vector.
- C PCR, because every PC is a linear combination of all original predictors, so the back-transformed $\hat\beta_j$ are generically all nonzero.
- D Best-subset selection, because enumerating every subset spreads the variance across many candidate models and obscures the sparse truth.
Show answer
Correct answer: C
PCR rotates the predictors before regressing, so a sparse truth in original-$X$ space becomes a dense truth in PC space — and the back-transformed coefficients are generically all nonzero. PLS has the same issue. The prof's framing: "Use lasso or subset selection when you need this or that one."
A inverts the direction: lasso's sparsity is the cure, not the failure. B is wrong on details: ridge does not shrink every coefficient by the same proportion — the per-PC factor $\lambda_j^2/(\lambda_j^2 + \lambda)$ is heavier on small eigenvalues. D inverts what best-subset does: it picks one subset and reports its $\hat\beta$, giving clean variable identification (at compute cost) — it does not scatter signal across models.
Atoms: principal-component-regression, lasso, partial-least-squares. Lecture: L15-modelsel-4. ISLP §6.3.1.