← Back to wiki

Module 06 — PCA, PCR & PLS

35 questions · 130 points · ~55 min

Focused drill on the dimension-reduction route through module 6: principal-component analysis, principal-components regression, and partial least squares. The deck mixes recall, computation, and scenario-interpretation questions with Socratic understanding questions designed to probe the why behind each method — active recall against ISLP §6.3 and §12.2. Click an option to lock the answer; the explanation auto-opens. Score tracker bottom-left.

Question 1 4 points

In the PCR pipeline, the first principal component $Z_1 = \sum_j \phi_{j1} X_j$ of the standardized predictors is the linear combination that:

Show answer
Correct answer: D

The prof's verbatim framing: "find the best way of rotating our shit such that now it has a maximum variance." The unit-norm constraint exists exactly so the algorithm rotates rather than rescales.

A confuses PCA with K-means (within-cluster sum-of-squares). B describes PLS, which is supervised and uses $Y$ — PCA is unsupervised, so $Y$ never enters. C confuses "direction of largest variance" with "variable with largest variance"; PCA finds linear combinations, not single columns.

Atoms: principal-component-analysis, principal-component-regression. Lecture: L21-unsupervised-1. ISLP §12.2.1.

Question 2 3 points

Suppose someone proposed dropping the unit-norm constraint $\sum_j \phi_{j1}^2 = 1$ from the PCA optimization. Why does the constraint exist — what specifically goes wrong without it?

Show answer
Correct answer: B

The prof's verbatim explanation: "If you didn't subject that constraint, you could just make it … arbitrarily create high numbers, but also because we don't want to rescale the data." The unit-norm forces PCA to pick a direction, not a magnification — the variance comparison is meaningful only if all candidate $\phi$ live on the unit sphere.

A confuses what the constraint controls — orthogonality between PCs is enforced separately (each new PC is constrained orthogonal to previous ones), independent of unit-norm. C is wrong on dependency: column reordering doesn't change PCA's answer either way; orthonormal rotations preserve the constraint anyway. D inverts the relationship — the loading-as-correlation reading is a downstream property; the unit-norm constraint exists for boundedness, not interpretability.

Atoms: principal-component-analysis. Lecture: L21-unsupervised-1. ISLP §12.2.1.

Question 3 5 points

Mark each statement about principal component analysis as true or false.

Show answer
  1. False — the prof's slide bullet read aloud: "PCA is not scale invariant." Without standardization, the largest-unit variable dominates PC1.
  2. True — loading vectors are unit norm by construction; this is the constraint that prevents PCA from cheating by inflating variance through scaling.
  3. True — "the principal component vector is unique. Of course, the sign flip is boring. It just means which direction it is."
  4. False — PCA is unsupervised; it sees only $X$. This is precisely the failure mode that motivates PLS.
  5. False — keeping all $p$ PCs is a rotation of coordinates. Reduction comes from truncating to the first $M < p$.

Atoms: principal-component-analysis, dimensionality-reduction. Lecture: L21-unsupervised-1. ISLP §12.2.

Question 4 3 points

PCA is sometimes called a "rotation." In what precise sense is this language exact, and where does it become merely useful shorthand?

Show answer
Correct answer: C

The rotation is exact on the full set of $p$ PCs — you've just changed basis with an orthonormal matrix, which preserves all distances and total variance. The reduction comes from chopping at $M < p$ (the projection step). "PCA reduces dimension" is true only after the truncation, not at the rotation.

A confuses preprocessing with geometry — standardization affects which directions get picked, not whether the basis change is a rotation. B inverts what's rotated: PCA rotates the coordinate axes, leaving the points themselves fixed in physical space. D invents a degeneracy: tied eigenvalues leave the eigenvector basis non-unique within the tied subspace, but every choice is still orthonormal — the rotation interpretation survives.

Atoms: principal-component-analysis, dimensionality-reduction. Lecture: L21-unsupervised-1. ISLP §12.2.2 develops the orthonormal-basis-and-projection geometry.

Question 5 5 points Exam 2025 P2f

PCA is performed on a dataset with 6 standardized predictors. The eigenvalues of the sample covariance are:

$\lambda_1 = 3.0,\ \lambda_2 = 1.5,\ \lambda_3 = 0.8,\ \lambda_4 = 0.4,\ \lambda_5 = 0.2,\ \lambda_6 = 0.1.$

How many principal components must you keep to retain at least 90% of the total variance?

Show answer
Correct answer: A

Total variance $= 3.0 + 1.5 + 0.8 + 0.4 + 0.2 + 0.1 = 6$ (which equals $p$, since the variables are standardized). Cumulative PVE: $3.0/6 = 0.500$; $+1.5/6 = 0.750$; $+0.8/6 = 0.883$ (still under 0.90); $+0.4/6 = 0.950$ — threshold crossed at $M = 4$.

B stops at 0.883 and reads "almost 90%" as enough — the threshold is 0.90, not 0.85. C and D add components past the threshold; once 0.95 is reached, more components do not earn their keep.

Atoms: explained-variance-and-scree-plot, principal-component-analysis. Lecture: L27-summary. ISLP §12.2.3.

Question 6 5 points Exam 2025 P2f

Continuing the setup: the loading vector for PC1 is $\phi_1 = (0.7,\ 0.5,\ 0.4,\ 0.3,\ 0.1,\ 0.0)^\top$. For an observation with standardized values $x = (1,\ -1,\ 0.5,\ 0,\ 2,\ -2)$, what is the score $z_1 = \phi_1^\top x$?

Show answer
Correct answer: B

$z_1 = 0.7(1) + 0.5(-1) + 0.4(0.5) + 0.3(0) + 0.1(2) + 0.0(-2) = 0.7 - 0.5 + 0.2 + 0 + 0.2 + 0 = 0.6$. Show your work for partial credit.

A drops the signs on the negative entries of $x$ ($0.7 + 0.5 + 0.2 + 0 + 0.2 = 1.6$) — the absolute-value mistake. C ignores the multiplications and sums the loadings instead ($0.7 + 0.5 + 0.4 + 0.3 + 0.1 + 0 = 2.0$). D keeps only the largest contribution $\phi_{11} x_1 = 0.7$ and drops the rest.

Atoms: principal-component-analysis. Lecture: L27-summary. ISLP §12.2.1.

Question 7 3 points

PCA on a centered dataset gives PC1 with variance $\lambda_1 = 7.2$, while the total variance of the data is $9.6$. What is the proportion of variance explained by PC1?

Show answer
Correct answer: C

$\text{PVE}_1 = \lambda_1 / \sum_k \lambda_k = 7.2 / 9.6 = 0.75$.

A halves to 0.50 by mis-totalling the variance (e.g. doubling: $7.2 / 14.4$). B reports the residual fraction $1 - \text{PVE}_1 = 2.4 / 9.6 = 0.25$ — the proportion not explained by PC1, not the proportion explained. D computes $\sqrt{0.75} \approx 0.87$ — confuses variance with standard deviation.

Atoms: explained-variance-and-scree-plot, principal-component-analysis. ISLP §12.2.3 (PVE definition and the $R^2 = 1 - \text{RSS}/\text{TSS}$ identity).

Question 8 3 points

Suppose your $n$ observations lie almost exactly on a unit circle in $\mathbb{R}^2$, so the underlying structure is one-dimensional and curved. What does PCA report — and what does this tell you about its scope?

Show answer
Correct answer: D

PCA assumes linear structure. The prof: "PCA is kind of stuck just making linear stuff. It can't do anything non-linear." A circle is intrinsically 1-D but not linearly 1-D — the variance spreads across both x and y axes equally, so the scree plot looks flat and PCA gives no useful reduction. Nonlinear methods (kernel PCA, autoencoders) handle this; PCA does not.

A confuses "1-D manifold" with "1-D linear subspace" — only the latter is what PCA can find. B invents a non-linear PC: PCA's loadings define a linear combination of the original columns, never a curve. C is wrong on arithmetic: a unit circle has nonzero linear variance along both axes (that's the whole point — variance is spread, not absent).

Atoms: principal-component-analysis, dimensionality-reduction. Lecture: L21-unsupervised-1. ISLP §12.2 (linear-projection framing throughout).

Question 9 4 points

Mark each statement as true or false.

Show answer
  1. True — PCA is variance-driven; mixed units make the eigenvalue ranking meaningless without standardization.
  2. False — PCR inherits PCA's scale-(non)-invariance: rescaling one column changes which directions the PCs align with, and therefore the fit.
  3. True — every PC is a linear combination of all original $X_j$, so the back-transform $\hat\beta_j = \sum_m \hat\theta_m \phi_{jm}$ is generically nonzero. PCR is not a variable-selection method.
  4. True — the prof's USArrests demo: without scaling, PC1 loaded almost entirely on Assault (variance 6945) — meaningless artifact of units.

Atoms: standardization, principal-component-regression. Lecture: L14-modelsel-3. ISLP §12.2.4.

Question 10 4 points

You run PCA on four standardized macroeconomic predictors ($X_1$ = GDP growth, $X_2$ = interest rate, $X_3$ = inflation, $X_4$ = unemployment) and obtain the first PC loadings $\phi_1 = (0.7,\ -0.6,\ 0.3,\ -0.2)^\top$. Which is the most accurate reading of PC1?

Show answer
Correct answer: A

Sign and magnitude both matter. Big positive loading on GDP and big negative loading on interest rate define the axis; smaller loadings contribute proportionally less. Standard prof move on USArrests: PC1 loads roughly equally on Murder/Assault/Rape (an "overall criminality" axis); same kind of reading here.

B is wrong on form: $|0.7|$ and $|0.2|$ differ by a factor of $3.5$ — not "roughly equal." C describes PLS, not PCA; PCA never sees $Y$. D confuses negative loadings with zero loadings; a negative loading means the variable contributes in the opposite direction, not that it is excluded.

Atoms: principal-component-analysis, dimensionality-reduction. ISLP §12.2.1 (loading interpretation, USArrests example).

Question 11 3 points

Why is standardization mandatory for PCA / PCR / PLS but only cosmetic for OLS? Pick the explanation that most directly captures the asymmetry.

Show answer
Correct answer: A

This is the load-bearing distinction between coefficient-fitting and direction-finding methods. OLS is equivariant under linear rescaling — the model accommodates the unit change automatically. PCA's objective is the variance of a particular linear combination, and variance scales with the square of any unit change, so the ranking of directions is unit-dependent. The prof's slide: "PCA is not scale invariant."

B fabricates an internal CV pass for OLS — there is none; OLS is a closed-form solver with no scale-correction machinery. C confuses likelihood with equivariance: under a Gaussian likelihood, OLS's invariance under rescaling is the same equivariance argument, not a likelihood-specific property — and PCA can be given a likelihood interpretation too (probabilistic PCA) without becoming scale-invariant. D fabricates a convergence dependency: eigendecomposition operates on a covariance matrix regardless of input units; the issue is interpretation of the eigenvalue ranking, not whether the algorithm runs.

Atoms: standardization, principal-component-analysis. ISLP §12.2.4 ("Scaling the variables").

Question 12 4 points

Which of the following describes the canonical PCR procedure with $M$ chosen by cross-validation?

Show answer
Correct answer: D

The prof's verbatim recipe: "Standardize $X$, run PCA, get $Z_1, \ldots, Z_p$. Fit a standard linear regression of $Y$ on $Z_1, \ldots, Z_M$, where $M$ is a tuning parameter. Sweep $M$ from 1 up to $p$. Pick the $M$ that minimizes CV-MSE."

A standardizes the wrong object — components are already on the variance scale of the standardized $X$. B inverts the role of $X$ and $Y$ (PCA is on the predictors, not the response). C is residual-PCA — a different idea entirely; no method in the course works this way.

Atoms: principal-component-regression, cross-validation. Lecture: L15-modelsel-4. ISLP §6.3.1.

Question 13 3 points

What is the relationship between PCA and PCR?

Show answer
Correct answer: B

PCR = PCA front-end + OLS back-end. The PCA step is the same unsupervised algorithm covered in module 10; the regression step on the first $M$ PCs makes the pipeline supervised. The prof: "you're taking the X, you're squishing it down to fit your model, and then you go backwards to the original model again."

A is wrong on the criterion: PCA does not condition on $Y$ in any form — there is no $\text{Var}(Z \mid Y)$ step in either PCA or PCR; both use the unsupervised $\text{Var}(Z)$ objective on $X$ alone. C invents an $L_2$ penalty that PCR does not have — PCR's mechanism is hard truncation at $M$, not shrinkage on PC scores; PCA also handles correlated predictors fine. D describes PLS, not PCR; PCR uses PCA's variance-maximizing components.

Atoms: principal-component-regression, principal-component-analysis. Lecture: L14-modelsel-3. ISLP §6.3.1.

Question 14 4 points

The prof framed PCR as "a discretized version of ridge regression." Which sentence best captures the analogy?

Show answer
Correct answer: A

Verbatim: "Higher pressure on less important PCs. PCR discards the $p - M$ smallest eigenvalue components." Ridge applies the smooth shrinkage factor $\lambda_j^2 / (\lambda_j^2 + \lambda)$ on each PC direction — heavier on small eigenvalues. Same target (small-variance directions), different shape (hard truncation vs. smooth shrinkage).

B invents a budget formulation: PCR has no penalized-loss representation; its mechanism is hard truncation in PC space, not a constraint on $\sum_j \beta_j^2$. C inverts the relationship — ridge with $\lambda = 0$ is OLS, and PCR with $M = p$ is also OLS, so they meet at the no-shrinkage endpoint, not at $\lambda \to 0$. D is wrong on both counts: neither PCR nor ridge does exact variable selection — that is lasso's job.

Atoms: principal-component-regression, ridge-regression. Lecture: L15-modelsel-4. ISLP §6.3.1.

Question 15 3 points

Both ridge and PCR put pressure on the small-eigenvalue PC directions. Why does this happen — what is the deeper mathematical reason these two methods end up doing similar things?

Show answer
Correct answer: B

This is the load-bearing reason behind the discretized-ridge framing. Decomposing OLS along the eigen-basis of $X^\top X$, the coefficient on the $j$-th eigendirection has variance proportional to $1/\lambda_j$ — small-eigenvalue directions are the noisy ones. Both regularizers know this implicitly: ridge attenuates each direction by a smooth factor that bites hardest on small $\lambda_j$, and PCR drops the smallest $\lambda_j$ directions outright. Same target, different shape.

A is wrong about PCR's formulation: ridge minimizes $\|y - X\beta\|^2 + \lambda\|\beta\|^2$, but PCR has no penalty term — it works by hard truncation in PC space, not by penalized loss. C is mechanically wrong: PCR uses an eigendecomposition / SVD of $X$, while standard ridge solvers do not require one. D inverts the family relations: lasso is its own animal, not a parent of ridge or PCR; ridge is not "elastic net at $\alpha = 0$" running through a lasso limit.

Atoms: principal-component-regression, ridge-regression. Lecture: L15-modelsel-4. ISLP §6.2.3 + §6.3.1 (the eigenbasis view of ridge and the PCR-as-discretized-ridge bridge).

Question 16 5 points

Mark each statement about principal-components regression (PCR) as true or false.

Show answer
  1. False — every PC is a linear combination of all $X_j$, so all back-transformed $\hat\beta_j$ are typically nonzero. The prof: "Just like Ridge, [PCR] doesn't actually select the parameters." Lifted from Exam 2023 P5.
  2. True — verbatim slide quote: "PCR can be seen as a discretized version of ridge regression."
  3. True — direct from the slide-flagged "constrained interpretation": dimension reduction constrains $\hat\beta$ to live in the $M$-dimensional subspace spanned by the kept loadings.
  4. False — PCA (the front end of PCR) is unsupervised. This is exactly the no-guarantee-the-high-variance-directions-relate-to-$Y$ pitfall, fixed by PLS.
  5. True — at $M = p$ the rotation is invertible and OLS on the rotated predictors gives the same fit as OLS on the originals (under standardization, the $\hat y$'s match exactly).

Atoms: principal-component-regression, ridge-regression. Lecture: L15-modelsel-4. ISLP §6.3.1.

Question 17 3 points

On the Credit dataset, the prof reported that PCR needed $M \approx 10$ components (out of $\sim 11$) to perform well, while ridge regression performed comparably keeping all directions but shrinking. Why doesn't ridge suffer the same failure?

Show answer
Correct answer: C

This is the resolution of the Credit puzzle. Income is a low-$X$-variance but $Y$-predictive direction. PCR's hard threshold throws it away unless $M$ reaches the rank of that direction (which happened to be 10). Ridge keeps it in the model from the start, just shrunk — so a small-but-nonzero contribution is recovered at every $\lambda$. Smoothness wins exactly when the relevant direction sits low in the variance ranking.

A misnames ridge's penalty (it's $L_2$, not $L_1$) and conflates penalty type with selection — even with $L_1$ ridge wouldn't be picking "components" anyway. B fabricates a per-predictor CV: ridge does a single CV pass over $\lambda$, not one per variable. D contradicts the slide-flagged factor $\lambda_j^2 / (\lambda_j^2 + \lambda)$, which gives different shrinkage to different directions (heavier on small eigenvalues, not equal).

Atoms: principal-component-regression, ridge-regression. Lecture: L15-modelsel-4. ISLP §6.3.1, Figure 6.20 (PCR / PLS / ridge / lasso compared on simulated data).

Question 18 4 points

On the Credit dataset, the prof reported that PCR's CV-MSE was minimized at $M = 10$ out of $\sim 11$ available components — almost no dimension reduction. What does this tell you?

Show answer
Correct answer: B

Verbatim: "Even though the things that you care about were just ended up being four variables, primarily, it took you 10 PCs to get there." Income happened not to vary much in $X$-space, so the relevant direction was buried late in the PC ordering. This is the canonical failure of PCR's assumption (2) — that high-$X$-variance directions are also $Y$-predictive.

A inverts cause and diagnosis: $M$ being close to $p$ is the symptom of the failure mode, not its cause, and overfitting would push CV-MSE up at large $M$, the opposite of what was observed. C describes a different failure mode (rank-deficient collinearity) and contradicts PCA's geometry — strong collinearity collapses signal into fewer PCs, not more. D is a real PCR pitfall but not the one at play here — the prof's pipeline standardized $X$ correctly.

Atoms: principal-component-regression, partial-least-squares. Lecture: L15-modelsel-4. ISLP §6.3.1.

Question 19 5 points

You run PCR with $M = 2$ on three standardized predictors. The first two loading vectors are $\phi_1 = (0.7,\ 0.5,\ 0.5)^\top$ and $\phi_2 = (-0.3,\ 0.6,\ 0.7)^\top$. The OLS regression of $Y$ on the components gives $\hat\theta_1 = 4$ and $\hat\theta_2 = 2$. What is the back-transformed coefficient $\hat\beta_2$ on the second original predictor?

Show answer
Correct answer: D

Back-transform formula: $\hat\beta_j = \sum_{m=1}^M \hat\theta_m \phi_{jm}$. So $\hat\beta_2 = \hat\theta_1 \phi_{21} + \hat\theta_2 \phi_{22} = 4(0.5) + 2(0.6) = 2 + 1.2 = 3.2$.

A keeps only the second-component term ($2 \cdot 0.6 = 1.2$), forgetting the contribution from $\hat\theta_1$. B keeps only the first-component term ($4 \cdot 0.5 = 2.0$). C adds the loadings without weighting by the $\hat\theta_m$'s ($0.5 + 0.6 = 1.1$).

Atoms: principal-component-regression. Lecture: L14-modelsel-3. ISLP §6.3.1.

Question 20 3 points

PCR with $M = 1$ keeps only one component out of $p = 20$, producing a model with effectively one feature. Why is this still not a form of variable selection?

Show answer
Correct answer: D

The core insight is the back-transform. The model in PC-space looks one-dimensional, but the equivalent model in $X$-space uses all 20 predictors with weights $\hat\theta_1 \phi_{j1}$ each. Variable selection requires actual zeros on $\hat\beta_j$ — and unless $\phi_{j1} = 0$ exactly (almost never the case), every original predictor stays in.

A confuses procedure with effect — variable selection can be achieved without hypothesis tests (lasso is the obvious example, and best-subset another). B is too vague; the question is about the structural property of the fit, not whether the model is good. C misnames PCR's mechanism: PCR has no explicit penalty, it has a hard truncation. The $L_2$-can't-zero claim is true for ridge but irrelevant to PCR.

Atoms: principal-component-regression, lasso. Lecture: L15-modelsel-4. ISLP §6.3.1, closing: "PCR does not perform feature selection."

Question 21 4 points

What is the principled way to choose the number of components $M$ in PCR?

Show answer
Correct answer: C

For supervised use ($Y$ available), CV the downstream regression. This is what the prof actually does in PCR; from the dimensionality-reduction atom: "Treat $M$ like any other tuning parameter. … Cross-validate $M$ against the downstream model — much more principled than the elbow on a scree plot."

A is the unsupervised criterion; it ignores how well the kept components actually predict $Y$. B is no reduction at all — defeats PCR's purpose. D uses an $X$-only criterion, which can miss directions that drive $Y$ (the Credit example: 95% PVE arrives well before the income-driving direction, so the $X$-only cutoff would be wrong here).

Atoms: principal-component-regression, cross-validation, explained-variance-and-scree-plot. Lecture: L15-modelsel-4. ISLP §6.3.1.

Question 22 3 points

Why does PCR with $M = p$ produce exactly the same fitted values $\hat y$ as OLS on the standardized predictors?

Show answer
Correct answer: A

This is the linear-algebra anchor. OLS finds $\hat y$ as the projection of $Y$ onto the column space of $X$. A change of basis to orthogonal PCs does not change which subspace you are projecting onto — it just expresses the same projection in different coordinates. Reduction occurs only when $M < p$, because then you are projecting onto a strict subspace of $\text{col}(X)$.

B routes through ridge unnecessarily — both PCR ($M=p$) and ridge ($\lambda=0$) coincide with OLS, but the equivalence comes directly from the rotation argument, not from passing through ridge. C invokes lasso for no reason; PCR's back-transform is a sum over loadings, not a soft-thresholding solution. D invents a $\Phi^\top \Phi$ correction: $\Phi$ is orthonormal, so $\Phi^\top \Phi = I$ — there is no extra factor to cancel, the OLS fit is recovered immediately.

Atoms: principal-component-regression. Lecture: L14-modelsel-3. ISLP §6.3.1, opening pages.

Question 23 4 points

Mark each statement as true or false.

Show answer
  1. True — L1 produces exact zeros at corners of the constraint region; PCR's back-transform sums over loadings and is generically nonzero.
  2. True — the canonical "discretized ridge" framing.
  3. False — the directions are reversed: lasso typically keeps one and zeroes the other (it picks); ridge typically holds both back and shares the weight.
  4. True — verbatim from the prof: "Use PCR/ridge when you want predictive power and don't care which raw variables drive it. Use lasso or subset selection when you need this or that one." PLS sits with PCR/ridge in this triage.

Atoms: principal-component-regression, ridge-regression, lasso. Lecture: L15-modelsel-4. ISLP §6.2 + §6.3.

Question 24 3 points

In partial least squares (PLS) regression, the first component $Z_1 = \sum_j \phi_{j1} X_j$ is constructed to maximize:

Show answer
Correct answer: D

Verbatim from the prof: "It's the same idea as the principal component analysis, only now you're finding the principal components not as the directions of maximal variance of $X$, but the maximal covariance of $X$ and $Y$." So PLS is supervised; PCA / PCR is unsupervised.

A is the PCA / PCR criterion (variance, no $Y$). B mixes PLS with the L1 penalty of lasso — PLS has no sparsity penalty. C invokes a generative likelihood that PLS does not assume; PLS is a deterministic algorithm, not a likelihood method.

Atoms: partial-least-squares. Lecture: L15-modelsel-4. ISLP §6.3.2.

Question 25 3 points

In the PLS algorithm, the loading $\phi_{j1}$ for the first component is set to:

Show answer
Correct answer: B

Verbatim slide: "$\phi_{j1}$ is the coefficient from the simple linear regression of $Y$ onto $X_j$. This coefficient is proportional to the correlation between $Y$ and $X_j$. PLS puts highest weight on the variables that are most strongly related to the response." That is the entire supervised twist over PCA.

A is the PCA recipe (variance, no $Y$). C is a flat-weight average that PLS does not use — and would carry no $Y$-information. D is variable selection, which neither PLS nor PCR performs.

Atoms: partial-least-squares. Lecture: L15-modelsel-4. ISLP §6.3.2.

Question 26 3 points

PLS sets $\phi_{j1}$ proportional to the simple-regression coefficient of $Y$ on $X_j$. What is the intuition — why is this rule the right one for the supervised criterion PLS optimizes?

Show answer
Correct answer: C

This is the supervised twist: PLS asks "which $X_j$'s predict $Y$ on their own?" and uses the answer as loadings. Variables with strong simple-regression coefficients (equivalently, high $|\text{Corr}(Y, X_j)|$) dominate $Z_1$ — directly serving the $\text{Cov}(Z_1, Y)$ objective. PCA, by contrast, asks "which $X_j$'s vary most and co-vary together?" with no $Y$ reference.

A confuses PLS with PCA — and PCA without the unit-norm constraint is unbounded, not solved by simple regression. B is wrong on mechanics: the unit-norm constraint is enforced by an explicit normalization step in the algorithm, not by the choice of loading values. D is doubly wrong — PLS still requires standardization (same as PCA), and even if simple-regression coefficients were unit-invariant they would not make the method scale-invariant in the way claimed.

Atoms: partial-least-squares. Lecture: L15-modelsel-4. ISLP §6.3.2.

Question 27 4 points

Mark each statement about partial least squares (PLS) as true or false.

Show answer
  1. True — PLS maximizes $\text{Cov}(\phi^\top X, Y)$ at every step; that is the entire reason PLS exists separate from PCR.
  2. True — like PCR, PLS rotates predictors into composite components, so back-transformed coefficients are linear combinations of all $X_j$ and generically nonzero.
  3. False — PLS is not a variable-selection method. Selecting variables (exact zeros) is what lasso does.
  4. True — slide-flagged: "We regress each variable on $Z_1$ and take the residuals … We then compute $Z_2$ using this orthogonalized data."

Atoms: partial-least-squares, principal-component-regression. Lecture: L15-modelsel-4. ISLP §6.3.2.

Question 28 4 points

You are predicting $Y$ from standardized predictors. Through prior knowledge you know that $Y$ depends strongly on a particular predictor $X_3$ that happens to have a small variance relative to the other predictors. Compared with PCR, PLS will generally:

Show answer
Correct answer: D

This is the textbook PCR-vs-PLS contrast. PCR picks directions by $\text{Var}(X)$ — small-variance predictive directions get buried late in the ordering (the Credit / income story). PLS picks by $\text{Cov}(X, Y)$ — a strong simple-regression coefficient on $Y$ feeds directly into a heavy loading on the first component.

A reverses the direction: PCR (not PLS) is the one prioritizing high-$X$-variance. B misclassifies PLS — PCA is unsupervised, but PLS is supervised, that is the whole point. C confuses standardization with relevance: standardization makes variances equal but not covariances with $Y$; PLS still picks by covariance with $Y$, PCR still by variance, so they disagree on $X_3$.

Atoms: partial-least-squares, principal-component-regression. Lecture: L15-modelsel-4. ISLP §6.3.2.

Question 29 4 points

Mark each statement about PLS's empirical behavior as true or false.

Show answer
  1. False — slide-flagged and read aloud: "In practice, PLS often performs no better than ridge regression or PCR but it's Swedish, so it's like they're meatballs, they're not better but they sound good."
  2. True — slide bullet: "Supervised dimension reduction of PLS can reduce bias. It also has the potential to increase variance."
  3. True — verbatim slide: "PLS, PCR and ridge regression tend to behave similarly. Ridge regression may be preferred because it shrinks smoothly, rather than in discrete steps."
  4. True — same as PCR; the back-transform is a sum over loadings, generically nonzero.

Atoms: partial-least-squares, ridge-regression. Lecture: L15-modelsel-4. ISLP §6.3.2.

Question 30 5 points

Mark each statement about PCA, PCR, PLS, ridge, and lasso as true or false.

Show answer
  1. False — ridge handles collinearity by sharing weight smoothly across correlated predictors, and PCR is built explicitly to attack collinearity by rotating to orthogonal components. Reverses the truth.
  2. True — only the L1 penalty produces exact-zero solutions among these four; ridge shrinks to small but nonzero values, and PCR's back-transform is a sum over loadings.
  3. True — the canonical supervised-vs-unsupervised contrast.
  4. True — slide-flagged: lasso wins when sparsity is the right structural assumption; PCR / PLS / ridge are non-sparse smoothers and dilute the signal across components.
  5. False — PCA maximizes $\text{Var}(Z)$ on $X$ alone; PLS maximizes $\text{Cov}(Z, Y)$. Substituting $Y = X$ into the PLS criterion would give $\text{Cov}(Z, X)$, not $\text{Var}(Z)$ — and the PLS algorithm uses $Y$ to set $\phi_{j1}$ via the simple regression of $Y$ on $X_j$, which is degenerate when $Y = X$.

Atoms: principal-component-regression, partial-least-squares, ridge-regression, lasso. Lecture: L15-modelsel-4. ISLP §6.3.

Question 31 3 points

Two analyses fit the same regression: one with PCR ($M = 4$), one with ridge (with $\lambda$ chosen by CV). They report essentially identical CV-MSE. The prof would lean toward reporting which, and why?

Show answer
Correct answer: B

When two methods tie on test performance, the prof prefers the more stable one — the one whose answer wouldn't shift under a small CV-fold reshuffle. Verbatim: "ridge regression behaves smoothly instead of this kind of discrete thing which PCR and PLS both have." Ridge wins on smoothness: $\lambda$ moves continuously, so a tiny change in CV doesn't flip the model. PCR's $M$ jumps in unit steps; a small CV change can switch $M$ from 4 to 5 and substantially change the back-transformed $\hat\beta$.

A confuses parsimony in PC-space with parsimony in $X$-space — PCR's back-transform is generically nonzero on every $X_j$, so neither model is more parsimonious in raw variables. C is a weak interpretability argument: $\lambda$ is just as reportable as $M$ once you understand each method, and stability is the more load-bearing tie-break. D is the canonical wrong belief tested in Q20: PCR does not produce sparse $\hat\beta_j$ — every back-transformed coefficient is generically nonzero.

Atoms: principal-component-regression, ridge-regression, cross-validation. Lecture: L15-modelsel-4. ISLP §6.3.1 + §6.6 (exercises 9–10 contrast PCR/PLS/ridge/lasso on identical datasets).

Question 32 4 points

You are predicting credit-card spend from $p = 50$ candidate predictors, many of which are nearly collinear. A stated goal is interpretability: you want to report which raw variables drive $Y$. Which method best fits the goals?

Show answer
Correct answer: C

"Use lasso or subset selection when you need 'this or that one.'" Lasso both regularizes and selects, so it is the only method here whose output points at specific raw variables.

A retains all 50 predictors with shrunken coefficients — every $\hat\beta_j$ is nonzero, so no variable selection. B and D both produce components that are linear combinations of all 50 variables, so the back-transformed $\hat\beta_j$ are generically nonzero — no clean variable interpretation. The prof's verbatim triage: "Use PCR/ridge when you want predictive power and don't care which raw variables drive it. Use lasso or subset selection when you need this or that one."

Atoms: lasso, principal-component-regression, ridge-regression. Lecture: L15-modelsel-4. ISLP §6.2 + §6.3.

Question 33 4 points

Mark each statement about choosing $M$ and reading PCA diagnostics as true or false.

Show answer
  1. True — supervised task → cross-validate the supervised criterion. ISL is candid that the scree elbow "is inherently ad hoc"; CV is the principled alternative when $Y$ is available.
  2. False — ISL: "this type of visual analysis is inherently ad hoc … there is no well-accepted objective way to decide." Two readers can differ by ±1 PC.
  3. True — each added eigenvalue is non-negative, so the running sum cannot decrease. A drop signals a coding bug.
  4. False — the opposite. Flat scree means no real low-dim structure; PCA is not earning its keep. Strong structure shows up as a sharp early drop.

Atoms: explained-variance-and-scree-plot, principal-component-regression, cross-validation. ISLP §12.2.3 + §12.2.4.

Question 34 3 points

Which of the following is best classified as variable selection rather than dimensionality reduction?

Show answer
Correct answer: A

The prof's distinction: dimensionality reduction "gives you components, and each component is a combination of the original axes," while variable selection (lasso, subset selection) "actually selects the parameters." Lasso's L1 penalty zeroes some $\hat\beta_j$ outright — that is selection.

B and C are dimensionality reduction (PLS produces composite components; the autoencoder produces 10 learned features that are nonlinear combinations of the 100 inputs). D is the LDA-as-projection view from module 4 — also dimension reduction; a $K-1$-dim projection of the predictors is a compressed representation, not a selection of original variables.

Atoms: lasso, dimensionality-reduction, partial-least-squares. ISLP §6.2.2 (lasso) + §6.3 (dim reduction umbrella).

Question 35 3 points

When the true response depends on a small subset of the original predictors, which method's signature failure is to scatter that signal across many components, leaving no clean variable interpretation in the back-transformed coefficients?

Show answer
Correct answer: C

PCR rotates the predictors before regressing, so a sparse truth in original-$X$ space becomes a dense truth in PC space — and the back-transformed coefficients are generically all nonzero. PLS has the same issue. The prof's framing: "Use lasso or subset selection when you need this or that one."

A inverts the direction: lasso's sparsity is the cure, not the failure. B is wrong on details: ridge does not shrink every coefficient by the same proportion — the per-PC factor $\lambda_j^2/(\lambda_j^2 + \lambda)$ is heavier on small eigenvalues. D inverts what best-subset does: it picks one subset and reports its $\hat\beta$, giving clean variable identification (at compute cost) — it does not scatter signal across models.

Atoms: principal-component-regression, lasso, partial-least-squares. Lecture: L15-modelsel-4. ISLP §6.3.1.