← Back to wiki
Module 06 — Model selection & regularisation
30 questions · 100 points · ~45 min
Click an option to lock the answer; correct turns green, wrong turns red,
and the explanation auto-opens. The score panel at the bottom-left tracks running points.
A dataset has $p = 10$ candidate predictors. How many distinct linear models must best-subset selection fit and compare across all subset sizes (excluding the null model from the count, or including it — same answer either way)?
- A $1 + p(p+1)/2 = 56$
- B $2^p = 1024$
- C $p! = 3{,}628{,}800$
- D $2^p - 1 = 1023$
Show answer
Correct answer: B
Best-subset enumerates every subset of $\{1,\dots,p\}$, including the empty (intercept-only) model. The count is $\sum_{k=0}^{p} \binom{p}{k} = 2^p$. With $p = 10$ that's $1024$. The prof flagged the formula on slide and added: "Slow as shit for $p$ big, or even like impossibly slow."
A is the count for forward / backward / hybrid stepwise ($1 + p(p+1)/2$). C confuses subsets with orderings (permutations) of predictors. D is the count of non-empty subsets — off by one because the question explicitly states the null model doesn't change the answer (best-subset compares all $2^p$ subsets, so the null model is implicitly counted in the procedure even if you exclude it from a per-size summary).
Atoms: subset-selection. Lecture: L12-modelsel-1.
Question 2
4 points
ISLP §6 Q1
Best-subset, forward stepwise, and backward stepwise selection are run on the same dataset and produce a model of each size $k = 0, 1, \dots, p$. Mark each statement as true or false.
Show answer
- True — forward stepwise can only add, never drop, so each step's set strictly contains the previous.
- True — backward stepwise only drops; the smaller model is a subset of the larger.
- False — best-subset minimises RSS at each $k$ independently. The Credit dataset showed this: at $k=4$ best-subset drops
rating and adds cards, so $\mathcal{M}_3 \not\subset \mathcal{M}_4$.
- False — backward starts from the full $p$-predictor OLS fit, which requires $X^\top X$ invertible, i.e. $n > p$. Forward stepwise has no such requirement.
Each sub-statement scores $4/4 = 1$ point.
Atoms: subset-selection, high-dimensional-regression. Lecture: L12-modelsel-1.
Which optimisation problem defines the ridge regression coefficients $\hat\beta^R_\lambda$?
- A $\arg\min_\beta \big\{\text{RSS} + \lambda \sum_{j=1}^p |\beta_j|\big\}$
- B $\arg\min_\beta \big\{\text{RSS} + \lambda \sum_{j=0}^p \beta_j^2\big\}$
- C $\arg\min_\beta \big\{\text{RSS} + \lambda \sum_{j=1}^p \beta_j^2\big\}$
- D $\arg\min_\beta \big\{\text{RSS} + \lambda \big(\sum_{j=1}^p |\beta_j|\big)^2\big\}$
Show answer
Correct answer: C
Ridge minimises RSS plus the $L^2$ penalty $\lambda \sum_{j=1}^p \beta_j^2$, where the sum runs over slope coefficients only — the intercept $\beta_0$ is not penalised (slide-flagged: "if we included the intercept, $\beta^R$ would depend on the average of the response").
A is the lasso ($L^1$). B mistakenly penalises the intercept, which makes the fit depend on the response mean. D uses the squared $L^1$ norm instead of $L^2$ (sum of squares); these aren't the same penalty — the squared $L^1$ couples coefficients (cross-terms $|\beta_j||\beta_k|$ enter the objective), which is not ridge.
Atoms: ridge-regression. Lecture: L12-modelsel-1.
Mark each statement about ridge regression and its tuning parameter $\lambda$ as true or false.
Show answer
- True — ridge shrinks smoothly to zero only asymptotically (the squared penalty has zero gradient at $\beta=0$, so there's nothing pulling coefficients onto zero, only toward it).
- False — direction is reversed: more penalty $\Rightarrow$ less flexible $\Rightarrow$ variance down, (squared) bias up. Classic direction-of-effect trap.
- False — verbatim from the prof: "Importantly, ridge regression is not scale-invariant." Different units for $X_j$ change which coefficient the penalty hits hardest. Standardise first.
- True — adding $\lambda I$ to $X^\top X$ regularises the inverse, so $\hat\beta^R = (X^\top X + \lambda I)^{-1} X^\top y$ is unique even when $X^\top X$ is singular. The flagship reason ridge survives high-dimensional regression.
Each sub-statement scores $1$ point.
Atoms: ridge-regression, standardization, high-dimensional-regression. Lecture: L12-modelsel-1.
Question 5
3 points
ISLP §6 Q2
Relative to ordinary least squares, the lasso is:
- A Equally flexible; improves prediction accuracy whenever the design matrix has no exact collinearities.
- B More flexible; improves prediction accuracy when its increase in bias is less than its decrease in variance.
- C Less flexible; improves prediction accuracy whenever the irreducible error $\sigma^2$ exceeds the OLS training MSE.
- D Less flexible; improves prediction accuracy when its increase in bias is less than its decrease in variance.
Show answer
Correct answer: D
Lasso shrinks coefficients (and zeros some), so it is less flexible than unconstrained OLS. It improves prediction accuracy precisely when the bias bump from shrinkage is outweighed by the variance reduction. Verbatim L27 framing of the prof's preferred 2025 Q3a answer: "less flexible than LS, improved accuracy when the increase in bias is less than the decrease in variance."
A is wrong on flexibility and substitutes a collinearity claim for the bias-variance trade. B mis-classifies lasso as more flexible. C correctly identifies lasso as less flexible but the qualifier ("$\sigma^2$ exceeds OLS training MSE") is unrelated to when shrinkage helps — irreducible error is a property of the data, not a comparison criterion. Only D names the actual bias-variance inequality.
Atoms: lasso, bias-variance-tradeoff, regularization. Lecture: L27-summary.
In the simple-special case $n = p$ with $X = I$, lasso has the closed-form solution $\hat\beta_j^L = \mathrm{sign}(\hat\beta_j^{\text{OLS}}) \cdot (|\hat\beta_j^{\text{OLS}}| - \lambda/2)_+$ (ISLP eq. 6.15). Suppose $\hat\beta_1^{\text{OLS}} = 0.6$, $\hat\beta_2^{\text{OLS}} = -0.3$, $\hat\beta_3^{\text{OLS}} = 0.8$, and $\lambda = 1.0$. What are the soft-thresholded lasso estimates $(\hat\beta_1^L, \hat\beta_2^L, \hat\beta_3^L)$?
- A $(0.10,\ 0,\ 0.30)$
- B $(0.6,\ -0.3,\ 0.8)$
- C $(-0.10,\ 0.20,\ -0.30)$
- D $(0,\ 0,\ 0)$
Show answer
Correct answer: A
Soft-threshold: subtract $\lambda/2 = 0.5$ from $|\hat\beta_j|$, clip negatives to zero, restore the sign.
- $j=1$: $|0.6| - 0.5 = 0.10 > 0$, keep sign $+$ $\Rightarrow 0.10$.
- $j=2$: $|-0.3| - 0.5 = -0.20 < 0$, clip to $0$.
- $j=3$: $|0.8| - 0.5 = 0.30 > 0$, keep sign $+$ $\Rightarrow 0.30$.
So $(0.10, 0, 0.30)$. B forgets the penalty entirely (returns OLS). C uses ridge-style proportional shrinkage with the wrong direction of sign. D forgets that the threshold is $\lambda/2$, not $\lambda$, and zeros everything.
Atoms: lasso, ridge-vs-lasso-geometry.
In the constrained-optimisation picture (ISL Fig 6.7), the RSS contours are concentric ellipses centred at $\hat\beta^{\text{OLS}}$, and the regularisation constraint defines a region centred at the origin. Which statement best explains why lasso produces sparse solutions but ridge does not?
- A The lasso region is smaller in volume than the ridge region at any $\lambda$, so its solution is forced to lie on a coordinate axis at every fit.
- B Ridge is convex while lasso is non-convex, so only lasso can settle at coordinate corners during the optimisation.
- C Lasso uses cross-validation internally to select variables, whereas ridge does not select variables at all.
- D Lasso's region is a diamond with corners on the axes (the RSS ellipse tends to first touch a corner), while ridge's region is a smooth circle.
Show answer
Correct answer: D
Geometric story: the $L^1$ ball is a diamond with sharp corners on the axes; the $L^2$ ball is smooth. The smallest RSS ellipse that touches the constraint region tends to land on a diamond corner (which forces some $\beta_j = 0$) but at a generic interior-direction point on the circle.
A is wrong: nothing about the diamond always sits on an axis — it just typically does. B inverts convexity: both the $L^1$ and $L^2$ penalties define convex problems; convexity isn't the discriminator. C confuses fitting (the penalty itself produces sparsity) with hyperparameter tuning. The corners-vs-smooth distinction is what matters.
Atoms: ridge-vs-lasso-geometry, lasso, ridge-regression. Lecture: L13-modelsel-2.
You are shown two coefficient-trace plots from the Credit dataset. In plot X, every coefficient curve approaches but never crosses the zero line as $\lambda$ grows. In plot Y, several coefficient curves drop to exactly zero at finite $\lambda$ values and stay there. Which method produced which plot?
- A X = elastic net (any $\alpha \in (0,1)$), Y = lasso — only elastic net has the smooth no-zero behaviour of plot X.
- B X = ridge, Y = lasso.
- C Both are ridge at different $\lambda$ resolutions.
- D Both are lasso; the difference is just the standardisation choice.
Show answer
Correct answer: B
The defining visual difference: ridge paths approach zero asymptotically but never touch (no kink in the $\beta^2$ penalty at zero); lasso paths drop to exactly zero and stay (the $|\beta|$ kink at zero is what produces sparsity).
A misattributes ridge's smooth-no-zeros pattern to elastic net — but elastic net has an $L^1$ component, so it still produces exact zeros (just less aggressively than pure lasso). C ignores the plot's distinguishing feature (zeros vs no zeros). D blames standardisation, which affects scale of the trace but not whether paths reach zero.
Atoms: ridge-regression, lasso.
Mark each statement about elastic net as true or false.
Show answer
- True — that's the definition: $\text{RSS} + \lambda \sum \beta_j^2 + \gamma \sum |\beta_j|$.
- False — the $L^1$ component still gives sparsity. Some coefficients are zeroed, just less aggressively than with pure lasso.
- True — that's the practical pitch: "this is probably the one that people use the most", because the $L^2$ part rescues lasso's instability under correlated predictors.
Atoms: elastic-net, lasso, ridge-regression. Lecture: L13-modelsel-2.
Which of the following best describes the canonical PCR procedure?
- A Choose components by maximising $\mathrm{Cov}(X\phi, Y)$, regress $Y$ on the first $M$ components, pick $M$ by CV.
- B Standardise $X$, run PCA on $X$ alone, regress $Y$ on the first $M$ principal components, pick $M$ by cross-validation.
- C Standardise $X$, drop the original predictors $X_j$ whose marginal correlation with $Y$ is smallest, refit OLS on the survivors.
- D Apply a lasso penalty inside the PCA decomposition step so that some loading entries $\phi_{jm}$ are driven exactly to zero.
Show answer
Correct answer: B
PCR: standardise $X$, run PCA on $X$ (unsupervised — does not look at $Y$), regress $Y$ on the first $M$ PCs, sweep $M$ by CV.
A is partial least squares — it uses $Y$ via covariance maximisation. C is variable filtering by marginal correlation, a different (and cruder) idea. D mixes lasso onto the loadings — not a method covered.
Atoms: principal-component-regression, partial-least-squares, standardization. Lecture: L15-modelsel-4.
PCA on a standardised design matrix returns five eigenvalues $\lambda_1 = 2.5,\ \lambda_2 = 1.2,\ \lambda_3 = 0.7,\ \lambda_4 = 0.4,\ \lambda_5 = 0.2$. What fraction of the total variance is explained by the first three principal components?
- A $0.60$
- B $0.74$
- C $0.88$
- D $0.94$
Show answer
Correct answer: C
Total variance $\sum_j \lambda_j = 2.5 + 1.2 + 0.7 + 0.4 + 0.2 = 5.0$. First three: $2.5 + 1.2 + 0.7 = 4.4$. PVE $= 4.4 / 5.0 = 0.88$.
A is $3/5 = 0.60$ — the share if you (wrongly) just count "3 out of 5 components" instead of summing eigenvalues. B returns the first two PCs ($3.7/5$). D returns the first four PCs ($4.8/5$).
Atoms: principal-component-regression, principal-component-analysis.
The prof framed PCR as a "discretised ridge regression." Which sentence best captures the analogy?
- A Both shrink the small-eigenvalue directions; PCR truncates them abruptly, ridge shrinks smoothly.
- B PCR and ridge shrink every coefficient estimate by the same factor for every value of $\lambda$.
- C Both methods perform automatic variable selection by zeroing out the small-magnitude coefficients.
- D Ridge regression is a special case of PCR taken with $M = p$ (no truncation at all).
Show answer
Correct answer: A
PCR drops the $p - M$ smallest-eigenvalue components outright (hard threshold). Ridge applies a smooth per-PC shrinkage factor $\lambda_j^2/(\lambda_j^2 + \lambda)$ that hits the small-eigenvalue directions hardest. Same target, different shape.
B confuses "discretised analogue" with "identical fits." C is wrong on both counts: neither method does variable selection (that's lasso). D is the reverse implication: PCR with $M = p$ recovers OLS, not ridge.
Atoms: principal-component-regression, ridge-regression. Lecture: L15-modelsel-4.
Mark each statement about partial least squares (PLS) as true or false.
Show answer
- False — that's PCR. PLS maximises $\mathrm{Cov}(X\phi, Y)$, which uses $Y$.
- True — supervised because $Y$ enters the choice of directions.
- False — neither PCR nor PLS does variable selection. After back-transformation, all original-$X$ coefficients are typically nonzero (because each component is a linear combination of all predictors).
Atoms: partial-least-squares, principal-component-regression. Lecture: L15-modelsel-4.
Question 14
4 points
Exam 2024 P2
For each method, decide whether it can in principle be fit when $p > n$.
Show answer
- False — $X^\top X$ is singular, infinitely many zero-RSS solutions, training $R^2 = 1$ for any random labels.
- True — $\lambda I$ regularises the inverse: $(X^\top X + \lambda I)^{-1}$ is well-defined and the solution is unique.
- False — this was the canonical wrong-answer trap on the 2024 exam. Lasso fits fine at $p > n$; the active set is just bounded by $\min(n, p)$.
- True — slide-flagged: forward stepwise survives high-dim by capping at $\mathcal{M}_{n-1}$. Backward stepwise does not (it needs the full OLS fit).
Atoms: high-dimensional-regression, ridge-regression, lasso, subset-selection. Lecture: L15-modelsel-4.
A regularised regression on $n = 200$ patients with $p = 5000$ candidate gene-expression predictors picks an "active set" of 30 features. What is the most defensible interpretation?
- A These 30 — or features correlated with the truly predictive ones — give one of many predictive models; the true set isn't identifiable.
- B These 30 are the truly causally predictive features; the other 4970 have no relationship at all to the response variable here.
- C Because $p \gg n$, no inference is possible whatsoever and the active set should be ignored as a methodological artifact.
- D The fitted model achieves training $R^2 \approx 1$, which is direct evidence that these 30 features are the right set.
Show answer
Correct answer: A
Slide-flagged verbatim: "In the high-dimensional setting, the multicollinearity problem is extreme. We can never know exactly which variables, if any, truly are predictive of the outcome … At most, we can hope to assign large regression coefficients to variables that are correlated with the variables that truly are predictive." The active set is one of many suitable predictive models; "fishing around," not confirmation.
B overclaims causality and exclusivity. C overcorrects; the active set is informative even if not definitive. D mistakes a near-1 training $R^2$ (automatic at $p \gg n$) for evidence.
Atoms: high-dimensional-regression, lasso. Lecture: L15-modelsel-4.
The prof calls regularisation "the most important variant of model selection that we talk about throughout." Which sentence best captures why regularisation typically improves test error in module 6?
- A It eliminates the irreducible noise term $\sigma^2$ from the test-MSE decomposition entirely.
- B It increases the model's flexibility, lowering bias and variance simultaneously across the entire $\lambda$ range.
- C It guarantees that training MSE equals test MSE, so overfitting becomes structurally impossible.
- D It accepts a small increase in bias in exchange for a larger decrease in variance.
Show answer
Correct answer: D
Verbatim: "Often we can substantially reduce the variance at the cost of a negligible increase in bias." That bias-for-variance trade is the universal sales pitch behind ridge, lasso, smoothing splines, cost-complexity pruning, dropout, weight decay, and bagging.
A is wrong: $\sigma^2$ is irreducible by definition; no method removes it. B inverts the direction: regularisation makes the model less flexible (more constrained). C confuses regularisation with infinite data.
Atoms: regularization, bias-variance-tradeoff. Lecture: L14-modelsel-3.
You want to tune $\lambda$ for ridge regression on a training set $\mathcal{D}$ via $k$-fold CV. Which procedure is correct?
- A Refit ridge on $\mathcal{D}$ for many $\lambda$, pick the $\lambda$ giving the smallest training RSS, refit ridge at that $\lambda$ on $\mathcal{D}$.
- B Run lasso first to identify nonzero coefficients on $\mathcal{D}$, then run ridge with $\lambda$ chosen so all those coefficients are retained.
- C Pick $\lambda$ analytically using $C_p$ or BIC computed on $\mathcal{D}$, then refit ridge at the chosen $\lambda$ on $\mathcal{D}$.
- D For each $\lambda$ in a grid, $k$-fold-CV on $\mathcal{D}$, pick $\lambda$ minimising mean CV-MSE, refit ridge at $\hat\lambda$ on $\mathcal{D}$.
Show answer
Correct answer: D
Standard recipe: grid of $\lambda$, $k$-fold CV per grid point, pick $\hat\lambda$ minimising mean CV-MSE (or one-SE rule), refit on full training data with $\hat\lambda$.
A picks the unregularised end ($\lambda = 0$) every time — training RSS strictly decreases as $\lambda$ shrinks. B conflates two methods and breaks the bias-variance balance ridge is supposed to achieve. C uses penalty criteria the prof distrusts: "their assumptions are always wrong, and they're typically always wrong."
Atoms: cross-validation, ridge-regression, regularization. Lecture: L14-modelsel-3.
Question 18
4 points
Exam 2025 P6b
A 10-fold CV plot of MSE versus $\log(\lambda)$ for lasso on a regression problem shows the minimum CV-MSE at $\lambda_{\min}$ corresponding to lasso test MSE $\approx 50.80$. The unregularised OLS test MSE on the same hold-out is $\approx 50.78$. Which conclusion is best supported?
- A Lasso has decisively beaten OLS; switch to lasso for prediction.
- B The CV plot is broken; an MSE difference this small must be a numerical bug.
- C Lasso did not improve on OLS, suggesting the variance reduction from regularisation does not offset the bias bump — keep all parameters (use OLS).
- D Increasing $\lambda$ further would reduce the test MSE more, so $\lambda_{\min}$ is a local rather than a global minimum.
Show answer
Correct answer: C
Verbatim L27 framing: "normally with regularization … you trade off — by including it, you trade off between an increase in bias by getting a reduction in variance. But in this case, the reduction in variance is not offset by the increase in bias. So you don't want to add a regularizer, meaning you want to keep all the parameters."
A overclaims given the 0.02 gap (in lasso's disfavour). B is unfounded; both numbers can sit that close. D is internally inconsistent: $\lambda_{\min}$ is by construction the CV minimum across the grid; you don't pick $\lambda$ "further" without re-doing the CV.
Atoms: lasso, cross-validation, bias-variance-tradeoff, regularization. Lecture: L27-summary.
Among the candidate $\lambda$ values along the CV grid, the one-standard-error rule selects:
- A The smallest $\lambda$ whose mean CV-MSE is within one standard error of the minimum.
- B The $\lambda$ at which CV-MSE is minimised.
- C The largest $\lambda$ whose mean CV-MSE is within one standard error of the minimum.
- D The $\lambda$ at which the standard error of the per-fold MSEs is itself minimised.
Show answer
Correct answer: C
Largest $\lambda$ (= simplest model) whose mean CV-MSE is within one SE of the minimum. The 2024 and 2025 past exams both demanded this rule. Picks the simplest model that's statistically indistinguishable from the best.
A picks the smallest $\lambda$ within the one-SE band — that's near lambda.min on the low-$\lambda$ side, not the conservative simpler-model end. B is the standard lambda.min rule — fine, but not the one-SE rule (which deliberately picks a simpler model statistically indistinguishable from the best). D conflates SE of the CV-MSE estimate with the variance of fold-MSEs at a given $\lambda$ — that quantity tends to be smallest in regions where folds happen to agree, which has nothing to do with simplicity.
Atoms: cross-validation, ridge-regression, lasso.
For each method, mark whether it requires standardising the predictors before fitting.
Show answer
- True — the $L^2$ penalty treats all $\beta_j$ symmetrically, so the $X_j$ must be on a common scale; otherwise small-unit predictors get crushed.
- True — same logic for $L^1$.
- True — PCA (the workhorse of PCR) is not scale-invariant: a kilometre vs centimetre column would dominate the first PC. Slide-flagged: "Standardize all the $p$ variables before applying PCA."
- False — OLS is scale-invariant in fit. The $\hat\beta_j$ rescale to compensate, but predictions are unchanged. Standardisation for OLS is a cosmetic / interpretation choice, not a correctness requirement.
Each sub-statement scores $1$ point.
Atoms: standardization, ridge-regression, lasso, principal-component-regression.
Question 21
4 points
ISLP §6 Q3
Estimate $\beta$ by minimising $\sum_i (y_i - \beta_0 - \sum_j \beta_j x_{ij})^2$ subject to $\sum_j |\beta_j| \le s$ (the constraint form of lasso). As $s$ increases from $0$ toward $\infty$, mark each statement as true or false.
Show answer
- True — relaxing the constraint $s$ enlarges the feasible region, so the optimiser can only do as well or better on training RSS.
- False — test RSS is U-shaped (initially decreases, then increases). At $s = 0$ all coefficients are zero (high bias, useless model); at $s = \infty$ you recover OLS, which can overfit.
- True — at $s = 0$ the model is constant (zero variance); as $s$ grows, the fitted function becomes more sensitive to data, variance grows.
- True — at $s = 0$ the model is maximally biased (constant); as $s$ grows toward OLS, bias shrinks toward the OLS bias.
Each sub-statement scores $1$ point.
Atoms: lasso, bias-variance-tradeoff.
As the ridge penalty $\lambda$ is increased from $0$, in what direction do the squared bias and variance of the ridge estimator typically move?
- A Squared bias goes up; variance goes down.
- B Both fall together initially, then both rise past a critical $\lambda$.
- C Squared bias stays constant; only variance moves, and it goes down.
- D Squared bias goes up; variance goes up.
Show answer
Correct answer: A
Increasing $\lambda$ shrinks coefficients toward zero, biasing the estimator (squared bias goes up) but stabilising it across resamples (variance goes down). The U-shape of test MSE is what you see when these two trends cross.
B describes a non-monotone path with no basis: both quantities are individually monotone in $\lambda$ for ridge. C falsely treats ridge as unbiased — ridge shrinks the OLS estimate by $1/(1+\lambda)$ even on orthogonal $X$, so it's biased the moment $\lambda > 0$. D contradicts the variance-reduction reason regularisation exists.
Atoms: bias-variance-tradeoff, ridge-regression, regularization.
"Double descent" / benign overfitting refers to the empirical observation that:
- A Test error always goes down monotonically with the number of fitted parameters in the model class.
- B Past the interpolation point ($p \approx n$), the bias-variance decomposition itself no longer holds for the test MSE.
- C Cross-validation error is always U-shaped in the model's degree of flexibility, no matter the size of $p$.
- D Past the interpolation point, test error can drop again because the optimisation picks a minimum-norm interpolator among infinitely many.
Show answer
Correct answer: D
The prof's headline framing: "the optimisation changes from 'fit + penalty' to 'min penalty subject to fitting'." Past $p \approx n$, infinitely many zero-training-error solutions exist; the pseudoinverse / SGD picks the smallest-norm one, which is implicit ridge. That's the "benign" in benign overfitting.
A is too strong — there's still the U-shape before the interpolation point. B is wrong: the decomposition stays exact at every $p$; double descent is just a non-U profile of test MSE = $\sigma^2$ + bias² + variance. C ignores the second descent entirely — that's the whole point of the phenomenon.
Atoms: double-descent, bias-variance-tradeoff, ridge-regression. Lecture: L13-modelsel-2.
On the Credit dataset, best-subset and forward stepwise both pick the same 3-variable model, but at $k = 4$ best-subset drops rating and adds cards, while forward stepwise keeps rating and adds limit. The most accurate explanation is:
- A Forward stepwise is greedy and cannot drop a previously-added predictor, so it locks in early choices that best-subset is free to revisit.
- B Forward stepwise standardises predictors internally while best-subset does not, so the two procedures see effectively different design matrices at $k = 4$.
- C The two methods are minimising different criteria at $k = 4$: best-subset minimises $C_p$, forward stepwise minimises training RSS, so divergence is expected.
- D Forward stepwise produced a wrong answer because the predictors are correlated; the best-subset answer is the unique correct one.
Show answer
Correct answer: A
Forward stepwise can only add. Once rating is in $\mathcal{M}_3$, it stays in $\mathcal{M}_4, \mathcal{M}_5, \dots$. Best-subset re-evaluates every $k$-subset independently, so it can swap variables between sizes. Hybrid / sequential-replacement stepwise was designed precisely to recover this flexibility.
B invents a standardisation difference that doesn't exist: both procedures operate on the same design matrix once you've decided to standardise (or not). C garbles the criterion at fixed $k$: both best-subset and forward stepwise compare candidate $k$-models by training RSS at that $k$; $C_p$ enters only later when picking across different $k$. The divergence at $k = 4$ comes from the search strategy, not the within-$k$ criterion. D claims a uniqueness that doesn't exist: under collinearity, multiple subsets are essentially indistinguishable on the data.
Atoms: subset-selection, lasso. Lecture: L12-modelsel-1.
Question 25
4 points
Exam 2024 P3v
Three columns of estimated coefficients on the same standardised predictors are reported:
Predictor OLS Method P Method Q
Income -7.80 -5.84 -7.07
Limit 0.19 0.14 0.16
Rating 1.14 0.86 0.00
Cards 17.7 13.2 16.8
Age -0.61 -0.45 0.00
Which assignment is most consistent with the patterns?
- A P = best-subset selection at $k = 5$ (all five variables retained), Q = lasso.
- B P = ridge, Q = lasso.
- C P = OLS on a smaller training set, Q = lasso.
- D P = elastic net with $\alpha = 0$, Q = elastic net with $\alpha = 1$ — but the column patterns would be reversed.
Show answer
Correct answer: B
Method P shrinks every coefficient toward zero but no coefficient is exactly zero — that's ridge. Method Q has two coefficients exactly at $0.00$ (Rating, Age) and the rest near OLS — that's lasso's variable-selection signature.
A is wrong on P: best-subset at $k = 5$ retaining all five variables would just give back the OLS column, not P's uniformly-shrunken pattern. C ignores that ridge-style uniform shrinkage on a single training fit is precisely what column P shows; OLS on a smaller subset would not produce that proportionality. D's parenthetical is internally contradictory: $\alpha = 0$ in the standard glmnet parameterisation is ridge (no sparsity), $\alpha = 1$ is lasso (sparsity) — that matches B, not D's claim.
Atoms: ridge-regression, lasso, elastic-net.
A genomics study has $n = 80$ patients, $p = 2000$ gene expressions, and the analyst expects fewer than $\sim 20$ genes to actually drive the response. Interpretability of which genes matter is essential. Which method is most appropriate?
- A Plain OLS on all $p = 2000$ predictors.
- B Ridge regression with $\lambda$ chosen by 10-fold CV.
- C Lasso with $\lambda$ chosen by 10-fold CV.
- D Backward stepwise selection from the full $p = 2000$ model.
Show answer
Correct answer: C
Sparse truth + interpretability + $p > n$ is the canonical lasso use case. Lasso fits at $p > n$ (active set bounded by $n$), zeros most coefficients, leaves a small interpretable subset.
A: OLS at $p > n$ is degenerate (singular $X^\top X$). B: ridge fits but never zeros, so doesn't deliver interpretability. D: backward stepwise needs the full OLS fit and so fails at $p > n$.
Atoms: lasso, ridge-regression, high-dimensional-regression, subset-selection.
A simulation has $n = 100$ training observations with $20$ truly predictive features. The analyst pads the design matrix with random noise features and watches test MSE. Mark each statement as true or false.
Show answer
- True — slide-flagged: "adding noise features that are not associated with the response increases test error." Regularisation slows the climb but doesn't eliminate it.
- True — verbatim from L15: "When $p = 2000$ the lasso performed poorly regardless of the amount of regularization." Regularisation helps but cannot rescue an arbitrarily noise-padded design.
- False — at $p \ge n$, training $R^2 = 1$ for any random labels and any random design. It says nothing about generalisation.
Atoms: high-dimensional-regression, lasso, regularization.
Two predictors $X_1, X_2$ are nearly perfectly correlated. Each method is fit at a moderate penalty / constraint level. Which row best describes the typical behaviour?
- A Both methods give nearly identical fits — under collinearity, $L^1$ and $L^2$ become equivalent up to rescaling of $\lambda$.
- B Both methods zero both; only OLS keeps them at non-zero values.
- C Both methods average them; the only difference is the magnitude of the joint coefficient.
- D Lasso zeros one of the two; ridge averages them at moderate values.
Show answer
Correct answer: D
The prof's "capitalist vs socialist" framing: lasso picks one and zeros the other (corner of the diamond on an axis); ridge averages them at moderate values (smooth circle, no corners). The choice of which coefficient lasso zeros is data-dependent and unstable across resamples — this is the failure mode elastic net was designed to fix.
A invents an "$L^1 \approx L^2$ under collinearity" equivalence that doesn't exist — the geometric difference (corner vs smooth ball) is exactly what makes their behaviour diverge most sharply on collinear predictors. B zeros both, which neither method does at moderate $\lambda$. C misses lasso's sparsity.
Atoms: lasso, ridge-regression, ridge-vs-lasso-geometry, elastic-net. Lecture: L13-modelsel-2.
Question 29
4 points
ISLP §6 Q4
Estimate $\beta$ by minimising $\sum_i (y_i - \beta_0 - \sum_j \beta_j x_{ij})^2 + \lambda \sum_j \beta_j^2$ for a particular $\lambda \ge 0$. As $\lambda$ increases from $0$, mark each statement as true or false.
Show answer
- True — increasing $\lambda$ tightens the constraint; the fit can only get worse on training data.
- True — at $\lambda = 0$ you have OLS (high variance); at $\lambda = \infty$ you have the all-zero model (high bias). The CV-MSE-vs-$\lambda$ curve dips through a minimum in between.
- True — bigger penalty ⇒ smaller coefficients ⇒ less wobble across resamples.
- False — irreducible error $\sigma^2$ is a property of the data-generating noise, not the model. No method changes it.
Each sub-statement scores $1$ point.
Atoms: ridge-regression, bias-variance-tradeoff, regularization.
The prof framed lasso as "subset selection without the combinatorial cost." Which sentence best captures the practical pitch?
- A Lasso always finds exactly the same model as best-subset selection at every dataset and every $\lambda$.
- B A single convex optimisation produces a sparse coefficient vector; the nonzero entries pick the active set without enumerating $2^p$ candidates.
- C Lasso runs forward stepwise and backward stepwise in parallel and returns whichever gives smaller training RSS.
- D Lasso prunes the design matrix beforehand and then runs ordinary least squares — there is no penalty in the actual fit.
Show answer
Correct answer: B
The selling point: one convex optimisation, sparsity for free. Verbatim: "Instead of trying many, many, many models, you just run lasso once. Boom, same place." A small caveat the prof notes: lasso's active set may differ from best-subset under collinearity, but for variable-selection purposes you get a comparable answer at far lower computational cost.
A overclaims equivalence — they often agree but aren't guaranteed to. C misdescribes the algorithm entirely. D is the optional refit-on-active-set step, which happens after lasso, not instead of it.
Atoms: lasso, subset-selection, regularization. Lecture: L13-modelsel-2.