← Back to wiki

Module 06 — Model selection & regularisation

30 questions · 100 points · ~45 min

Click an option to lock the answer; correct turns green, wrong turns red, and the explanation auto-opens. The score panel at the bottom-left tracks running points.

Question 1 3 points

A dataset has $p = 10$ candidate predictors. How many distinct linear models must best-subset selection fit and compare across all subset sizes (excluding the null model from the count, or including it — same answer either way)?

Show answer
Correct answer: B

Best-subset enumerates every subset of $\{1,\dots,p\}$, including the empty (intercept-only) model. The count is $\sum_{k=0}^{p} \binom{p}{k} = 2^p$. With $p = 10$ that's $1024$. The prof flagged the formula on slide and added: "Slow as shit for $p$ big, or even like impossibly slow."

A is the count for forward / backward / hybrid stepwise ($1 + p(p+1)/2$). C confuses subsets with orderings (permutations) of predictors. D is the count of non-empty subsets — off by one because the question explicitly states the null model doesn't change the answer (best-subset compares all $2^p$ subsets, so the null model is implicitly counted in the procedure even if you exclude it from a per-size summary).

Atoms: subset-selection. Lecture: L12-modelsel-1.

Question 2 4 points ISLP §6 Q1

Best-subset, forward stepwise, and backward stepwise selection are run on the same dataset and produce a model of each size $k = 0, 1, \dots, p$. Mark each statement as true or false.

Show answer
  1. True — forward stepwise can only add, never drop, so each step's set strictly contains the previous.
  2. True — backward stepwise only drops; the smaller model is a subset of the larger.
  3. False — best-subset minimises RSS at each $k$ independently. The Credit dataset showed this: at $k=4$ best-subset drops rating and adds cards, so $\mathcal{M}_3 \not\subset \mathcal{M}_4$.
  4. False — backward starts from the full $p$-predictor OLS fit, which requires $X^\top X$ invertible, i.e. $n > p$. Forward stepwise has no such requirement.

Each sub-statement scores $4/4 = 1$ point.

Atoms: subset-selection, high-dimensional-regression. Lecture: L12-modelsel-1.

Question 3 3 points

Which optimisation problem defines the ridge regression coefficients $\hat\beta^R_\lambda$?

Show answer
Correct answer: C

Ridge minimises RSS plus the $L^2$ penalty $\lambda \sum_{j=1}^p \beta_j^2$, where the sum runs over slope coefficients only — the intercept $\beta_0$ is not penalised (slide-flagged: "if we included the intercept, $\beta^R$ would depend on the average of the response").

A is the lasso ($L^1$). B mistakenly penalises the intercept, which makes the fit depend on the response mean. D uses the squared $L^1$ norm instead of $L^2$ (sum of squares); these aren't the same penalty — the squared $L^1$ couples coefficients (cross-terms $|\beta_j||\beta_k|$ enter the objective), which is not ridge.

Atoms: ridge-regression. Lecture: L12-modelsel-1.

Question 4 4 points

Mark each statement about ridge regression and its tuning parameter $\lambda$ as true or false.

Show answer
  1. True — ridge shrinks smoothly to zero only asymptotically (the squared penalty has zero gradient at $\beta=0$, so there's nothing pulling coefficients onto zero, only toward it).
  2. False — direction is reversed: more penalty $\Rightarrow$ less flexible $\Rightarrow$ variance down, (squared) bias up. Classic direction-of-effect trap.
  3. False — verbatim from the prof: "Importantly, ridge regression is not scale-invariant." Different units for $X_j$ change which coefficient the penalty hits hardest. Standardise first.
  4. True — adding $\lambda I$ to $X^\top X$ regularises the inverse, so $\hat\beta^R = (X^\top X + \lambda I)^{-1} X^\top y$ is unique even when $X^\top X$ is singular. The flagship reason ridge survives high-dimensional regression.

Each sub-statement scores $1$ point.

Atoms: ridge-regression, standardization, high-dimensional-regression. Lecture: L12-modelsel-1.

Question 5 3 points ISLP §6 Q2

Relative to ordinary least squares, the lasso is:

Show answer
Correct answer: D

Lasso shrinks coefficients (and zeros some), so it is less flexible than unconstrained OLS. It improves prediction accuracy precisely when the bias bump from shrinkage is outweighed by the variance reduction. Verbatim L27 framing of the prof's preferred 2025 Q3a answer: "less flexible than LS, improved accuracy when the increase in bias is less than the decrease in variance."

A is wrong on flexibility and substitutes a collinearity claim for the bias-variance trade. B mis-classifies lasso as more flexible. C correctly identifies lasso as less flexible but the qualifier ("$\sigma^2$ exceeds OLS training MSE") is unrelated to when shrinkage helps — irreducible error is a property of the data, not a comparison criterion. Only D names the actual bias-variance inequality.

Atoms: lasso, bias-variance-tradeoff, regularization. Lecture: L27-summary.

Question 6 4 points

In the simple-special case $n = p$ with $X = I$, lasso has the closed-form solution $\hat\beta_j^L = \mathrm{sign}(\hat\beta_j^{\text{OLS}}) \cdot (|\hat\beta_j^{\text{OLS}}| - \lambda/2)_+$ (ISLP eq. 6.15). Suppose $\hat\beta_1^{\text{OLS}} = 0.6$, $\hat\beta_2^{\text{OLS}} = -0.3$, $\hat\beta_3^{\text{OLS}} = 0.8$, and $\lambda = 1.0$. What are the soft-thresholded lasso estimates $(\hat\beta_1^L, \hat\beta_2^L, \hat\beta_3^L)$?

Show answer
Correct answer: A

Soft-threshold: subtract $\lambda/2 = 0.5$ from $|\hat\beta_j|$, clip negatives to zero, restore the sign.

  • $j=1$: $|0.6| - 0.5 = 0.10 > 0$, keep sign $+$ $\Rightarrow 0.10$.
  • $j=2$: $|-0.3| - 0.5 = -0.20 < 0$, clip to $0$.
  • $j=3$: $|0.8| - 0.5 = 0.30 > 0$, keep sign $+$ $\Rightarrow 0.30$.

So $(0.10, 0, 0.30)$. B forgets the penalty entirely (returns OLS). C uses ridge-style proportional shrinkage with the wrong direction of sign. D forgets that the threshold is $\lambda/2$, not $\lambda$, and zeros everything.

Atoms: lasso, ridge-vs-lasso-geometry.

Question 7 4 points

In the constrained-optimisation picture (ISL Fig 6.7), the RSS contours are concentric ellipses centred at $\hat\beta^{\text{OLS}}$, and the regularisation constraint defines a region centred at the origin. Which statement best explains why lasso produces sparse solutions but ridge does not?

Show answer
Correct answer: D

Geometric story: the $L^1$ ball is a diamond with sharp corners on the axes; the $L^2$ ball is smooth. The smallest RSS ellipse that touches the constraint region tends to land on a diamond corner (which forces some $\beta_j = 0$) but at a generic interior-direction point on the circle.

A is wrong: nothing about the diamond always sits on an axis — it just typically does. B inverts convexity: both the $L^1$ and $L^2$ penalties define convex problems; convexity isn't the discriminator. C confuses fitting (the penalty itself produces sparsity) with hyperparameter tuning. The corners-vs-smooth distinction is what matters.

Atoms: ridge-vs-lasso-geometry, lasso, ridge-regression. Lecture: L13-modelsel-2.

Question 8 3 points

You are shown two coefficient-trace plots from the Credit dataset. In plot X, every coefficient curve approaches but never crosses the zero line as $\lambda$ grows. In plot Y, several coefficient curves drop to exactly zero at finite $\lambda$ values and stay there. Which method produced which plot?

Show answer
Correct answer: B

The defining visual difference: ridge paths approach zero asymptotically but never touch (no kink in the $\beta^2$ penalty at zero); lasso paths drop to exactly zero and stay (the $|\beta|$ kink at zero is what produces sparsity).

A misattributes ridge's smooth-no-zeros pattern to elastic net — but elastic net has an $L^1$ component, so it still produces exact zeros (just less aggressively than pure lasso). C ignores the plot's distinguishing feature (zeros vs no zeros). D blames standardisation, which affects scale of the trace but not whether paths reach zero.

Atoms: ridge-regression, lasso.

Question 9 3 points

Mark each statement about elastic net as true or false.

Show answer
  1. True — that's the definition: $\text{RSS} + \lambda \sum \beta_j^2 + \gamma \sum |\beta_j|$.
  2. False — the $L^1$ component still gives sparsity. Some coefficients are zeroed, just less aggressively than with pure lasso.
  3. True — that's the practical pitch: "this is probably the one that people use the most", because the $L^2$ part rescues lasso's instability under correlated predictors.

Atoms: elastic-net, lasso, ridge-regression. Lecture: L13-modelsel-2.

Question 10 3 points

Which of the following best describes the canonical PCR procedure?

Show answer
Correct answer: B

PCR: standardise $X$, run PCA on $X$ (unsupervised — does not look at $Y$), regress $Y$ on the first $M$ PCs, sweep $M$ by CV.

A is partial least squares — it uses $Y$ via covariance maximisation. C is variable filtering by marginal correlation, a different (and cruder) idea. D mixes lasso onto the loadings — not a method covered.

Atoms: principal-component-regression, partial-least-squares, standardization. Lecture: L15-modelsel-4.

Question 11 3 points

PCA on a standardised design matrix returns five eigenvalues $\lambda_1 = 2.5,\ \lambda_2 = 1.2,\ \lambda_3 = 0.7,\ \lambda_4 = 0.4,\ \lambda_5 = 0.2$. What fraction of the total variance is explained by the first three principal components?

Show answer
Correct answer: C

Total variance $\sum_j \lambda_j = 2.5 + 1.2 + 0.7 + 0.4 + 0.2 = 5.0$. First three: $2.5 + 1.2 + 0.7 = 4.4$. PVE $= 4.4 / 5.0 = 0.88$.

A is $3/5 = 0.60$ — the share if you (wrongly) just count "3 out of 5 components" instead of summing eigenvalues. B returns the first two PCs ($3.7/5$). D returns the first four PCs ($4.8/5$).

Atoms: principal-component-regression, principal-component-analysis.

Question 12 3 points

The prof framed PCR as a "discretised ridge regression." Which sentence best captures the analogy?

Show answer
Correct answer: A

PCR drops the $p - M$ smallest-eigenvalue components outright (hard threshold). Ridge applies a smooth per-PC shrinkage factor $\lambda_j^2/(\lambda_j^2 + \lambda)$ that hits the small-eigenvalue directions hardest. Same target, different shape.

B confuses "discretised analogue" with "identical fits." C is wrong on both counts: neither method does variable selection (that's lasso). D is the reverse implication: PCR with $M = p$ recovers OLS, not ridge.

Atoms: principal-component-regression, ridge-regression. Lecture: L15-modelsel-4.

Question 13 3 points

Mark each statement about partial least squares (PLS) as true or false.

Show answer
  1. False — that's PCR. PLS maximises $\mathrm{Cov}(X\phi, Y)$, which uses $Y$.
  2. True — supervised because $Y$ enters the choice of directions.
  3. False — neither PCR nor PLS does variable selection. After back-transformation, all original-$X$ coefficients are typically nonzero (because each component is a linear combination of all predictors).

Atoms: partial-least-squares, principal-component-regression. Lecture: L15-modelsel-4.

Question 14 4 points Exam 2024 P2

For each method, decide whether it can in principle be fit when $p > n$.

Show answer
  1. False — $X^\top X$ is singular, infinitely many zero-RSS solutions, training $R^2 = 1$ for any random labels.
  2. True — $\lambda I$ regularises the inverse: $(X^\top X + \lambda I)^{-1}$ is well-defined and the solution is unique.
  3. False — this was the canonical wrong-answer trap on the 2024 exam. Lasso fits fine at $p > n$; the active set is just bounded by $\min(n, p)$.
  4. True — slide-flagged: forward stepwise survives high-dim by capping at $\mathcal{M}_{n-1}$. Backward stepwise does not (it needs the full OLS fit).

Atoms: high-dimensional-regression, ridge-regression, lasso, subset-selection. Lecture: L15-modelsel-4.

Question 15 3 points

A regularised regression on $n = 200$ patients with $p = 5000$ candidate gene-expression predictors picks an "active set" of 30 features. What is the most defensible interpretation?

Show answer
Correct answer: A

Slide-flagged verbatim: "In the high-dimensional setting, the multicollinearity problem is extreme. We can never know exactly which variables, if any, truly are predictive of the outcome … At most, we can hope to assign large regression coefficients to variables that are correlated with the variables that truly are predictive." The active set is one of many suitable predictive models; "fishing around," not confirmation.

B overclaims causality and exclusivity. C overcorrects; the active set is informative even if not definitive. D mistakes a near-1 training $R^2$ (automatic at $p \gg n$) for evidence.

Atoms: high-dimensional-regression, lasso. Lecture: L15-modelsel-4.

Question 16 3 points

The prof calls regularisation "the most important variant of model selection that we talk about throughout." Which sentence best captures why regularisation typically improves test error in module 6?

Show answer
Correct answer: D

Verbatim: "Often we can substantially reduce the variance at the cost of a negligible increase in bias." That bias-for-variance trade is the universal sales pitch behind ridge, lasso, smoothing splines, cost-complexity pruning, dropout, weight decay, and bagging.

A is wrong: $\sigma^2$ is irreducible by definition; no method removes it. B inverts the direction: regularisation makes the model less flexible (more constrained). C confuses regularisation with infinite data.

Atoms: regularization, bias-variance-tradeoff. Lecture: L14-modelsel-3.

Question 17 3 points

You want to tune $\lambda$ for ridge regression on a training set $\mathcal{D}$ via $k$-fold CV. Which procedure is correct?

Show answer
Correct answer: D

Standard recipe: grid of $\lambda$, $k$-fold CV per grid point, pick $\hat\lambda$ minimising mean CV-MSE (or one-SE rule), refit on full training data with $\hat\lambda$.

A picks the unregularised end ($\lambda = 0$) every time — training RSS strictly decreases as $\lambda$ shrinks. B conflates two methods and breaks the bias-variance balance ridge is supposed to achieve. C uses penalty criteria the prof distrusts: "their assumptions are always wrong, and they're typically always wrong."

Atoms: cross-validation, ridge-regression, regularization. Lecture: L14-modelsel-3.

Question 18 4 points Exam 2025 P6b

A 10-fold CV plot of MSE versus $\log(\lambda)$ for lasso on a regression problem shows the minimum CV-MSE at $\lambda_{\min}$ corresponding to lasso test MSE $\approx 50.80$. The unregularised OLS test MSE on the same hold-out is $\approx 50.78$. Which conclusion is best supported?

Show answer
Correct answer: C

Verbatim L27 framing: "normally with regularization … you trade off — by including it, you trade off between an increase in bias by getting a reduction in variance. But in this case, the reduction in variance is not offset by the increase in bias. So you don't want to add a regularizer, meaning you want to keep all the parameters."

A overclaims given the 0.02 gap (in lasso's disfavour). B is unfounded; both numbers can sit that close. D is internally inconsistent: $\lambda_{\min}$ is by construction the CV minimum across the grid; you don't pick $\lambda$ "further" without re-doing the CV.

Atoms: lasso, cross-validation, bias-variance-tradeoff, regularization. Lecture: L27-summary.

Question 19 3 points

Among the candidate $\lambda$ values along the CV grid, the one-standard-error rule selects:

Show answer
Correct answer: C

Largest $\lambda$ (= simplest model) whose mean CV-MSE is within one SE of the minimum. The 2024 and 2025 past exams both demanded this rule. Picks the simplest model that's statistically indistinguishable from the best.

A picks the smallest $\lambda$ within the one-SE band — that's near lambda.min on the low-$\lambda$ side, not the conservative simpler-model end. B is the standard lambda.min rule — fine, but not the one-SE rule (which deliberately picks a simpler model statistically indistinguishable from the best). D conflates SE of the CV-MSE estimate with the variance of fold-MSEs at a given $\lambda$ — that quantity tends to be smallest in regions where folds happen to agree, which has nothing to do with simplicity.

Atoms: cross-validation, ridge-regression, lasso.

Question 20 4 points

For each method, mark whether it requires standardising the predictors before fitting.

Show answer
  1. True — the $L^2$ penalty treats all $\beta_j$ symmetrically, so the $X_j$ must be on a common scale; otherwise small-unit predictors get crushed.
  2. True — same logic for $L^1$.
  3. True — PCA (the workhorse of PCR) is not scale-invariant: a kilometre vs centimetre column would dominate the first PC. Slide-flagged: "Standardize all the $p$ variables before applying PCA."
  4. False — OLS is scale-invariant in fit. The $\hat\beta_j$ rescale to compensate, but predictions are unchanged. Standardisation for OLS is a cosmetic / interpretation choice, not a correctness requirement.

Each sub-statement scores $1$ point.

Atoms: standardization, ridge-regression, lasso, principal-component-regression.

Question 21 4 points ISLP §6 Q3

Estimate $\beta$ by minimising $\sum_i (y_i - \beta_0 - \sum_j \beta_j x_{ij})^2$ subject to $\sum_j |\beta_j| \le s$ (the constraint form of lasso). As $s$ increases from $0$ toward $\infty$, mark each statement as true or false.

Show answer
  1. True — relaxing the constraint $s$ enlarges the feasible region, so the optimiser can only do as well or better on training RSS.
  2. False — test RSS is U-shaped (initially decreases, then increases). At $s = 0$ all coefficients are zero (high bias, useless model); at $s = \infty$ you recover OLS, which can overfit.
  3. True — at $s = 0$ the model is constant (zero variance); as $s$ grows, the fitted function becomes more sensitive to data, variance grows.
  4. True — at $s = 0$ the model is maximally biased (constant); as $s$ grows toward OLS, bias shrinks toward the OLS bias.

Each sub-statement scores $1$ point.

Atoms: lasso, bias-variance-tradeoff.

Question 22 3 points

As the ridge penalty $\lambda$ is increased from $0$, in what direction do the squared bias and variance of the ridge estimator typically move?

Show answer
Correct answer: A

Increasing $\lambda$ shrinks coefficients toward zero, biasing the estimator (squared bias goes up) but stabilising it across resamples (variance goes down). The U-shape of test MSE is what you see when these two trends cross.

B describes a non-monotone path with no basis: both quantities are individually monotone in $\lambda$ for ridge. C falsely treats ridge as unbiased — ridge shrinks the OLS estimate by $1/(1+\lambda)$ even on orthogonal $X$, so it's biased the moment $\lambda > 0$. D contradicts the variance-reduction reason regularisation exists.

Atoms: bias-variance-tradeoff, ridge-regression, regularization.

Question 23 3 points

"Double descent" / benign overfitting refers to the empirical observation that:

Show answer
Correct answer: D

The prof's headline framing: "the optimisation changes from 'fit + penalty' to 'min penalty subject to fitting'." Past $p \approx n$, infinitely many zero-training-error solutions exist; the pseudoinverse / SGD picks the smallest-norm one, which is implicit ridge. That's the "benign" in benign overfitting.

A is too strong — there's still the U-shape before the interpolation point. B is wrong: the decomposition stays exact at every $p$; double descent is just a non-U profile of test MSE = $\sigma^2$ + bias² + variance. C ignores the second descent entirely — that's the whole point of the phenomenon.

Atoms: double-descent, bias-variance-tradeoff, ridge-regression. Lecture: L13-modelsel-2.

Question 24 3 points

On the Credit dataset, best-subset and forward stepwise both pick the same 3-variable model, but at $k = 4$ best-subset drops rating and adds cards, while forward stepwise keeps rating and adds limit. The most accurate explanation is:

Show answer
Correct answer: A

Forward stepwise can only add. Once rating is in $\mathcal{M}_3$, it stays in $\mathcal{M}_4, \mathcal{M}_5, \dots$. Best-subset re-evaluates every $k$-subset independently, so it can swap variables between sizes. Hybrid / sequential-replacement stepwise was designed precisely to recover this flexibility.

B invents a standardisation difference that doesn't exist: both procedures operate on the same design matrix once you've decided to standardise (or not). C garbles the criterion at fixed $k$: both best-subset and forward stepwise compare candidate $k$-models by training RSS at that $k$; $C_p$ enters only later when picking across different $k$. The divergence at $k = 4$ comes from the search strategy, not the within-$k$ criterion. D claims a uniqueness that doesn't exist: under collinearity, multiple subsets are essentially indistinguishable on the data.

Atoms: subset-selection, lasso. Lecture: L12-modelsel-1.

Question 25 4 points Exam 2024 P3v

Three columns of estimated coefficients on the same standardised predictors are reported:

Predictor    OLS      Method P      Method Q
Income      -7.80     -5.84        -7.07
Limit         0.19      0.14         0.16
Rating        1.14      0.86         0.00
Cards        17.7      13.2        16.8
Age          -0.61     -0.45        0.00

Which assignment is most consistent with the patterns?

Show answer
Correct answer: B

Method P shrinks every coefficient toward zero but no coefficient is exactly zero — that's ridge. Method Q has two coefficients exactly at $0.00$ (Rating, Age) and the rest near OLS — that's lasso's variable-selection signature.

A is wrong on P: best-subset at $k = 5$ retaining all five variables would just give back the OLS column, not P's uniformly-shrunken pattern. C ignores that ridge-style uniform shrinkage on a single training fit is precisely what column P shows; OLS on a smaller subset would not produce that proportionality. D's parenthetical is internally contradictory: $\alpha = 0$ in the standard glmnet parameterisation is ridge (no sparsity), $\alpha = 1$ is lasso (sparsity) — that matches B, not D's claim.

Atoms: ridge-regression, lasso, elastic-net.

Question 26 3 points

A genomics study has $n = 80$ patients, $p = 2000$ gene expressions, and the analyst expects fewer than $\sim 20$ genes to actually drive the response. Interpretability of which genes matter is essential. Which method is most appropriate?

Show answer
Correct answer: C

Sparse truth + interpretability + $p > n$ is the canonical lasso use case. Lasso fits at $p > n$ (active set bounded by $n$), zeros most coefficients, leaves a small interpretable subset.

A: OLS at $p > n$ is degenerate (singular $X^\top X$). B: ridge fits but never zeros, so doesn't deliver interpretability. D: backward stepwise needs the full OLS fit and so fails at $p > n$.

Atoms: lasso, ridge-regression, high-dimensional-regression, subset-selection.

Question 27 3 points

A simulation has $n = 100$ training observations with $20$ truly predictive features. The analyst pads the design matrix with random noise features and watches test MSE. Mark each statement as true or false.

Show answer
  1. True — slide-flagged: "adding noise features that are not associated with the response increases test error." Regularisation slows the climb but doesn't eliminate it.
  2. True — verbatim from L15: "When $p = 2000$ the lasso performed poorly regardless of the amount of regularization." Regularisation helps but cannot rescue an arbitrarily noise-padded design.
  3. False — at $p \ge n$, training $R^2 = 1$ for any random labels and any random design. It says nothing about generalisation.

Atoms: high-dimensional-regression, lasso, regularization.

Question 28 3 points

Two predictors $X_1, X_2$ are nearly perfectly correlated. Each method is fit at a moderate penalty / constraint level. Which row best describes the typical behaviour?

Show answer
Correct answer: D

The prof's "capitalist vs socialist" framing: lasso picks one and zeros the other (corner of the diamond on an axis); ridge averages them at moderate values (smooth circle, no corners). The choice of which coefficient lasso zeros is data-dependent and unstable across resamples — this is the failure mode elastic net was designed to fix.

A invents an "$L^1 \approx L^2$ under collinearity" equivalence that doesn't exist — the geometric difference (corner vs smooth ball) is exactly what makes their behaviour diverge most sharply on collinear predictors. B zeros both, which neither method does at moderate $\lambda$. C misses lasso's sparsity.

Atoms: lasso, ridge-regression, ridge-vs-lasso-geometry, elastic-net. Lecture: L13-modelsel-2.

Question 29 4 points ISLP §6 Q4

Estimate $\beta$ by minimising $\sum_i (y_i - \beta_0 - \sum_j \beta_j x_{ij})^2 + \lambda \sum_j \beta_j^2$ for a particular $\lambda \ge 0$. As $\lambda$ increases from $0$, mark each statement as true or false.

Show answer
  1. True — increasing $\lambda$ tightens the constraint; the fit can only get worse on training data.
  2. True — at $\lambda = 0$ you have OLS (high variance); at $\lambda = \infty$ you have the all-zero model (high bias). The CV-MSE-vs-$\lambda$ curve dips through a minimum in between.
  3. True — bigger penalty ⇒ smaller coefficients ⇒ less wobble across resamples.
  4. False — irreducible error $\sigma^2$ is a property of the data-generating noise, not the model. No method changes it.

Each sub-statement scores $1$ point.

Atoms: ridge-regression, bias-variance-tradeoff, regularization.

Question 30 3 points

The prof framed lasso as "subset selection without the combinatorial cost." Which sentence best captures the practical pitch?

Show answer
Correct answer: B

The selling point: one convex optimisation, sparsity for free. Verbatim: "Instead of trying many, many, many models, you just run lasso once. Boom, same place." A small caveat the prof notes: lasso's active set may differ from best-subset under collinearity, but for variable-selection purposes you get a comparable answer at far lower computational cost.

A overclaims equivalence — they often agree but aren't guaranteed to. C misdescribes the algorithm entirely. D is the optional refit-on-active-set step, which happens after lasso, not instead of it.

Atoms: lasso, subset-selection, regularization. Lecture: L13-modelsel-2.