Module 02 — Statistical Learning

Question 1 4 points

Which of the following methods is best described as nonparametric?

A Linear regression with a polynomial basis of fixed degree.
B Linear discriminant analysis with pooled covariance.
C $K$-nearest neighbors with $K$ chosen by cross-validation.
D Logistic regression with two predictors.

Show answer

Correct answer: C

KNN assumes no functional form for $f$; the "model" is the training data plus the rule "average / vote over the $K$ nearest neighbors." That is the prof's canonical nonparametric example.

A is parametric: even with polynomial features the model is linear in $\beta$ and you estimate a finite parameter vector. B (LDA) and D (logistic regression) are also parametric — both posit a specific Gaussian / Bernoulli-GLM form for the data and estimate a finite-dimensional $\theta$. The "$K$" in KNN is a hyperparameter, not a structural parameter of an assumed family.

Atoms: parametric-vs-nonparametric, knn-classification.

Question 2 4 points

Assume $Y = f(X) + \varepsilon$ with $\mathbb E[\varepsilon] = 0$, $\mathrm{Var}(\varepsilon) = \sigma^2$, and $\varepsilon \perp X$. Predict $\hat Y = \hat f(X)$ at a fixed $x$. Which expression equals $\mathbb E[(Y - \hat Y)^2 \mid X = x]$ before taking any further expectation over the training set?

A $(f(x) - \hat f(x))^2$
B $(f(x) - \hat f(x))^2 - \sigma^2$
C $(f(x) - \hat f(x))^2 + 2\sigma^2$
D $(f(x) - \hat f(x))^2 + \sigma^2$

Show answer

Correct answer: D

Substitute $Y = f(x) + \varepsilon$ and expand the square: the cross term $-2(f(x)-\hat f(x))\varepsilon$ vanishes under expectation because $\mathbb E[\varepsilon] = 0$, and $\mathbb E[\varepsilon^2] = \mathrm{Var}(\varepsilon) = \sigma^2$. So the pointwise error splits into a reducible $(f - \hat f)^2$ and an irreducible $\sigma^2$.

A drops the noise floor entirely; the irreducible term never goes away. B subtracts the variance, which sign-flips the noise contribution. C double-counts $\sigma^2$, the standard mistake when students confuse $\mathbb E[\varepsilon^2]$ with $2\,\mathrm{Var}(\varepsilon)$.

Atoms: reducible-vs-irreducible-error, bias-variance-tradeoff.

Question 3 4 points CE1 P1b

In the bias-variance derivation, after writing $f(x_0) - \hat f(x_0) = (f(x_0) - \mathbb E[\hat f(x_0)]) + (\mathbb E[\hat f(x_0)] - \hat f(x_0))$ and squaring, the cross term vanishes because:

A $f(x_0)$ and $\hat f(x_0)$ are independent across resamples of the training set.
B Taking expectation over training sets, $\mathbb E[\,\mathbb E[\hat f(x_0)] - \hat f(x_0)\,] = 0$.
C The noise $\varepsilon$ in the response has expected value zero by assumption.
D The training observations are assumed to be jointly Gaussian distributed.

Show answer

Correct answer: B

The cross term is $2(f(x_0) - \mathbb E[\hat f(x_0)]) \cdot (\mathbb E[\hat f(x_0)] - \hat f(x_0))$. The first factor is a deterministic constant. Taking expectation over the training set, the second factor's mean is $\mathbb E[\hat f(x_0)] - \mathbb E[\hat f(x_0)] = 0$. That kills the cross term and leaves Bias$^2$ + Variance.

A confuses $f$ (the unknown truth) with $\hat f$ (a function of the training data) — they're not independent in the relevant sense. C is the cross term that vanishes in the earlier step (separating reducible vs irreducible), not in the bias/variance split. D is irrelevant: the decomposition does not require Gaussian errors at all.

Atoms: bias-variance-tradeoff, reducible-vs-irreducible-error. Lecture: L03-statlearn-2.

Question 4 4 points

A polynomial regression is fit to a fixed dataset for degrees $d = 1, 2, \dots, 20$. Mark each statement as true or false.

a) Training MSE is non-increasing as $d$ grows. True False
b) Test MSE is also non-increasing as $d$ grows. True False
c) The squared bias of $\hat f(x_0)$ typically decreases as $d$ grows. True False
d) The variance of $\hat f(x_0)$ typically decreases as $d$ grows. True False

Show answer

True — each more-flexible model contains the previous as a special case (set the higher coefficients to zero), so on the data it was fit to, MSE never increases.
False — test MSE is U-shaped: it falls while bias dominates, then rises as variance takes over.
True — more flexibility lets $\mathbb E[\hat f]$ approach the truth; bias drops with $d$.
False — variance rises with flexibility because the high-degree fit chases noise; this is the right side of the U.

Atoms: flexibility-overfitting-underfitting, bias-variance-tradeoff, polynomial-regression. Lecture: L03-statlearn-2.

Question 5 4 points Ex4.1

A 7-point training set has predictors $(X_1, X_2)$ and class $Y$: $(3,3,A), (2,0,A), (1,1,A), (0,1,B), (-1,0,B), (2,1,B), (1,0,B)$. Using Euclidean distance and $K = 4$ majority vote, classify the test point $(X_1, X_2) = (1, 2)$.

A Class A, by a 4-to-0 unanimous vote among the four nearest neighbours.
B Class B, by a 3-to-1 vote among the four nearest neighbours.
C Tied 2-to-2; the prediction is undetermined without a tiebreaker.
D Class A, because the closest single training point belongs to class A.

Show answer

Correct answer: B

Euclidean distances to $(1,2)$: $(3,3)\!\to\!\sqrt 5$, $(2,0)\!\to\!\sqrt 5$, $(1,1)\!\to\!1$, $(0,1)\!\to\!\sqrt 2$, $(-1,0)\!\to\!\sqrt 8$, $(2,1)\!\to\!\sqrt 2$, $(1,0)\!\to\!2$. Sorted ascending: $1,\sqrt 2,\sqrt 2, 2, \sqrt 5, \sqrt 5, \sqrt 8$. The four nearest are $(1,1)$ class A, $(0,1)$ class B, $(2,1)$ class B, $(1,0)$ class B → vote 1 A vs 3 B → predict B.

A overstates A's support — only one of the four neighbours, $(1,1)$, is class A; a 4-0 sweep would require all four to share that class. C invents a tie that doesn't occur with $K = 4$ on this data. D is the $K = 1$ answer ($(1,1)$ is closest, class A) — correct only if you forgot the question said $K = 4$, the canonical "wrong-$K$" trap from this exercise.

Atoms: knn-classification.

Question 6 4 points

A KNN classifier is fit to a fixed training set. Mark each statement about increasing $K$ as true or false.

a) Increasing $K$ makes the decision boundary smoother. True False
b) Increasing $K$ increases the flexibility of the fit. True False
c) Increasing $K$ tends to lower the variance of $\hat f(x_0)$ across resamples. True False
d) Setting $K$ equal to the training size $n$ gives the global majority class for any test point. True False

Show answer

True — averaging over more neighbours blurs out local wobbles.
False — small $K$ is the flexible regime (KNN's flexibility knob is inverted; that's the canonical T/F trap).
True — averaging more $y$'s reduces $\mathrm{Var}(\hat f(x_0))$; this is the standard variance-reduction-by-averaging story.
True — at $K = n$ every point has the same neighbour set (the entire training data), so the prediction is the global majority class everywhere.

Atoms: knn-classification, flexibility-overfitting-underfitting, bias-variance-tradeoff.

Question 7 4 points CE1 P1f

Let $\mathbf X$ be a 2-dimensional random vector with covariance matrix $$\boldsymbol\Sigma = \begin{bmatrix} 16 & 0.6 \\ 0.6 & 9 \end{bmatrix}.$$ What is the correlation $\rho_{12}$ between $X_1$ and $X_2$?

A $0.05$
B $0.15$
C $0.0042$
D $0.20$

Show answer

Correct answer: A

$\rho_{12} = \sigma_{12} / \sqrt{\sigma_1^2 \sigma_2^2} = 0.6 / \sqrt{16 \cdot 9} = 0.6 / \sqrt{144} = 0.6 / 12 = 0.05$.

B comes from dividing by $\sigma_1 + \sigma_2 = 4 + 3 = 7 \;... \approx 0.086$ — wrong combination, but close to a remembered "hand-mash" mistake. C divides by $\sigma_1^2 \sigma_2^2 = 144$ instead of the square root: $0.6/144 \approx 0.0042$ — the canonical "forgot the square root" trap. D drops the variance scaling and just halves: $0.6 / 3 = 0.2$.

Atoms: random-vector-and-covariance, multivariate-normal.

Question 8 4 points CE1 P1g

$\mathbf X$ is bivariate normal with mean $\boldsymbol\mu = (0, 0)^\top$ and covariance $$\boldsymbol\Sigma = \begin{bmatrix} 1 & -1.5 \\ -1.5 & 4 \end{bmatrix}.$$ Which qualitative description matches the contour plot of the density?

A Circular contours centred at the origin, no diagonal pull.
B Axis-aligned ellipse stretched along the $X_2$ axis, no diagonal tilt.
C Tilted ellipse stretched along $X_2$, with the long axis pointing upper-left to lower-right.
D Tilted ellipse stretched along $X_1$, with the long axis pointing lower-left to upper-right.

Show answer

Correct answer: C

The diagonal entries $1, 4$ say $X_2$ has larger variance, so the ellipse stretches along $X_2$. The off-diagonal $-1.5$ is negative, so $\rho < 0$ and the ellipse tilts upper-left to lower-right (negative-correlation diagonal).

A would correspond to $\boldsymbol\Sigma = c\,\mathbf I$ (equal variances, zero correlation). B drops the negative covariance and gives the axis-aligned shape that occurs only when $\rho = 0$. D has the tilt direction backwards and the wrong stretching axis (would require $\sigma_1^2 > \sigma_2^2$ and $\rho > 0$).

Atoms: multivariate-normal, random-vector-and-covariance. Lecture: L05-linreg-1.

Question 9 4 points

Mark each statement about a $p$-dimensional random vector $\mathbf X$ with covariance matrix $\boldsymbol\Sigma$ as true or false.

a) For a constant matrix $C$, $\mathrm{Cov}(C\mathbf X) = C\,\boldsymbol\Sigma\, C^\top$. True False
b) If $\mathrm{Cov}(X_i, X_j) = 0$, then $X_i$ and $X_j$ are independent. True False
c) The diagonal entries of $\boldsymbol\Sigma$ are the variances $\mathrm{Var}(X_i)$. True False
d) $\boldsymbol\Sigma$ is always invertible whenever $p \geq 2$. True False

Show answer

True — direct from $\mathrm{Cov}(\mathbf Z) = \mathbb E[(\mathbf Z - \mathbb E\mathbf Z)(\mathbf Z - \mathbb E\mathbf Z)^\top]$ with $\mathbf Z = C\mathbf X$.
False — covariance only measures linear co-variation. Zero covariance implies independence only under joint normality. The classic counterexample is $Y = X^2$ with $X$ symmetric around zero.
True — by definition $\Sigma_{ii} = \mathrm{Cov}(X_i, X_i) = \mathrm{Var}(X_i)$.
False — if some linear combination $\mathbf b^\top \mathbf X$ has zero variance (i.e. one component is a deterministic linear function of the others), $\boldsymbol\Sigma$ is singular. This is the multicollinearity / "$|\boldsymbol\Sigma| = 0$" pathology.

Atoms: random-vector-and-covariance, multivariate-normal, collinearity. Lecture: L04-statlearn-3.

Question 10 4 points

Let $\mathbf X = (X_1, X_2)^\top$ have covariance matrix $\boldsymbol\Sigma = \begin{bmatrix} 4 & 1 \\ 1 & 9 \end{bmatrix}$, and define the contrast $Y = X_1 - X_2$ (so $C = (1, -1)$). What is $\mathrm{Var}(Y)$?

A $11$
B $13$
C $5$
D $15$

Show answer

Correct answer: A

$\mathrm{Var}(Y) = C\,\boldsymbol\Sigma\,C^\top = \sigma_1^2 + \sigma_2^2 - 2\sigma_{12} = 4 + 9 - 2(1) = 11$.

B forgets the minus sign on the cross term: $4 + 9 + 2(1) = 13$ — that's $\mathrm{Var}(X_1 + X_2)$, not the contrast. C drops the second variance entirely. D ignores the covariance and adds $4 + 9 + 2 \cdot 1 \cdot ?$ via a wrong sign convention.

Atoms: contrasts, random-vector-and-covariance. Lecture: L04-statlearn-3.

Question 11 4 points CE1 P1d

Mark each statement about the bias-variance tradeoff as true or false.

a) The bias-variance tradeoff is more relevant to inference than to prediction. True False
b) As the training sample size $n$ grows, the expected test MSE approaches zero. True False
c) Given two estimators of $f$, the one with lower squared bias always gives the better test predictions. True False
d) When $\sigma^2$ (the noise variance) is very large, a more flexible method is preferable for estimating $f$. True False

Show answer

False — bias-variance is fundamentally about prediction error $\mathbb E[(y_0 - \hat f(x_0))^2]$. Inference focuses on the sampling distribution of $\hat\beta$, which is a different concern.
False — the $\sigma^2$ floor stays fixed regardless of $n$. Variance shrinks with more data, but the irreducible noise is "irreducible" by definition.
False — lower bias often comes paired with higher variance. The relevant comparison is the sum bias$^2$ + variance + $\sigma^2$, not bias alone.
False — large $\sigma^2$ means the noise floor is already high; adding flexibility just inflates variance on top. You want a less flexible (more biased, lower variance) method.

Atoms: bias-variance-tradeoff, reducible-vs-irreducible-error, flexibility-overfitting-underfitting.

Question 12 4 points CE1 P1e

On a bias-variance plot for KNN regression with $K$ on the $x$-axis, four curves are drawn: squared bias, variance, irreducible error, and total expected test error. Which curve is monotonically decreasing as $K$ increases (within a moderate range)?

A Squared bias.
B Variance.
C Irreducible error.
D Total expected test error.

Show answer

Correct answer: B

Larger $K$ averages over more neighbours, shrinking $\mathrm{Var}(\hat f(x_0))$. Variance falls monotonically with $K$.

A goes the wrong direction: bias grows with $K$ because a smoother fit cannot track local wiggles in $f$. C is the flat horizontal line $\sigma^2$ (the floor; not strictly monotone). D is the U-shape — falls then rises — not monotone.

Atoms: bias-variance-tradeoff, knn-regression.

Question 13 4 points

For KNN regression at a test point $x_0$ with neighbour set $\mathcal N_0$ of size $K$, the prediction is

A $\hat f(x_0) = \arg\max_j \frac{1}{K}\sum_{i \in \mathcal N_0} I(y_i = j)$.
B $\hat f(x_0) = \frac{1}{K} \sum_{i \in \mathcal N_0} y_i$.
C $\hat f(x_0) = \frac{1}{K}\sum_{i \in \mathcal N_0} (y_i - \bar y)$.
D $\hat f(x_0) = \min_{i \in \mathcal N_0} y_i$.

Show answer

Correct answer: B

KNN regression averages the $y$-values of the $K$ nearest training points: $\hat f(x_0) = \frac{1}{K}\sum_{\mathcal N_0} y_i$.

A is the KNN classification rule (majority vote, $\arg\max$ over class labels). C centres on the global mean $\bar y$, which has expected value zero — it's not a regression prediction. D returns the smallest $y$ among neighbours; ignores all but one point and discards the averaging benefit.

Atoms: knn-regression, knn-classification. Lecture: L10-resample-1.

Question 14 4 points

On a plot of expected test MSE vs. flexibility (showing bias$^2$, variance, and a horizontal dashed line that the total error curve never crosses), what does that horizontal dashed line represent?

A The training MSE at the optimal flexibility.
B The squared bias at infinite flexibility.
C The cross-validated standard error of the optimal model.
D The irreducible error $\mathrm{Var}(\varepsilon)$.

Show answer

Correct answer: D

The horizontal asymptote is $\sigma^2 = \mathrm{Var}(\varepsilon)$. No estimator $\hat f$ can drive expected squared error below it because it captures noise that is, by assumption, independent of the predictors.

A is wrong: training MSE is on a different curve (always falling) and isn't an asymptote of the test curve. B confuses bias-at-infinite-flexibility with the noise floor — a fully flexible model can drive bias to zero, but $\sigma^2$ remains. C is a CV bookkeeping quantity, not a feature of the bias-variance plot.

Atoms: reducible-vs-irreducible-error, bias-variance-tradeoff.

Question 15 4 points

A linear regression and a KNN regression are both fit to the same training set on $[0, 5]$. You then predict at $x_0 = 12$, far outside the training range. Which behaviour best describes the two predictions?

A Linear regression extrapolates along its fitted slope; KNN averages the rightmost training points and tracks no trend.
B Both methods extrapolate along the local slope implied by the rightmost training points.
C KNN extrapolates more accurately because it assumes nothing about the global form of $f$.
D Linear regression refuses to predict outside the training range; KNN returns the global mean of the training $y$'s.

Show answer

Correct answer: A

The linear model has a global form ($\hat\beta_0 + \hat\beta_1 x$) so it keeps going in the same direction past the data — sometimes useful, sometimes nonsense. KNN has no global structure: at $x_0 = 12$ its "nearest neighbours" are still the rightmost training points and the prediction is just their average. So KNN flatlines off to the side instead of extrapolating.

B confuses parametric extrapolation with nonparametric averaging — KNN doesn't know about a trend. C reverses the truth: nonparametric methods are worse at extrapolation, not better; "no assumption" cuts both ways. D is wrong on both halves: linear regression happily extrapolates (that's the trap, not the safety), and KNN returns a local average, not the global mean (unless $K = n$).

Atoms: parametric-vs-nonparametric, knn-regression.

Question 16 4 points

Two regression methods, $A$ and $B$, are fit to the same dataset. Method $A$ has training MSE $0.20$ and test MSE $0.35$. Method $B$ has training MSE $0.05$ and test MSE $0.60$. Which diagnosis is most consistent with these numbers?

A Both methods underfit; the test/training gap reflects only the irreducible error $\sigma^2$.
B Method $A$ is overfitting and method $B$ is underfitting; $A$'s smaller train/test gap suggests bias.
C Method $B$ is overfitting relative to $A$, and $A$ is the safer choice for prediction.
D Method $B$ should be preferred because it achieves the lowest training MSE on this dataset.

Show answer

Correct answer: C

$B$'s training error is much lower than $A$'s, but its test error is much higher — the canonical overfit signature. $A$'s smaller train/test gap and lower test MSE make it the better predictor.

A misreads both methods as underfit; the differences in training MSE rule that out. B reverses which method is overfitting (overfit = good train, bad test, i.e. $B$). D is the standard "training MSE is what counts" trap; training MSE always falls with flexibility and is not a model-selection signal.

Atoms: flexibility-overfitting-underfitting, bias-variance-tradeoff.

Question 17 4 points

Mark each direction-of-effect statement as true or false.

a) Increasing the polynomial degree $d$ in polynomial regression typically increases the variance of $\hat f(x_0)$. True False
b) For KNN, the variance of $\hat f(x_0)$ across resamples grows as $K$ grows. True False
c) A method with high training MSE and similar test MSE is more likely underfitting than overfitting. True False
d) For nonparametric methods, "no parameters" literally means the model has zero hyperparameters to choose. True False

Show answer

True — flexibility ↑ → variance ↑ in the classical regime; degree-20 polynomials wobble enormously across resamples.
False — variance falls with $K$ because averaging more neighbours stabilises the prediction. KNN's flexibility knob is inverted.
True — both errors high and roughly equal points to bias dominating: the model class can't capture the truth even on its own training data.
False — nonparametric methods (KNN, smoothing splines, GAMs) still have hyperparameters ($K$, $\lambda$, df). What's missing is a global parametric form for $f$, not all knobs.

Atoms: flexibility-overfitting-underfitting, knn-regression, parametric-vs-nonparametric, bias-variance-tradeoff.

Question 18 4 points

A binary classification problem has a Bayes decision boundary that is highly non-linear and curls back on itself in several places. Among $K = 1$, $K = 7$, and $K = 50$ in KNN (training size $n = 200$), which is most likely to be near-optimal in test error, all else equal?

A $K = 1$, because it has the lowest bias and tracks every wiggle.
B $K = 50$, because larger $K$ wins by reducing variance.
C All three should give roughly the same test error; the Bayes boundary's shape doesn't matter to KNN.
D $K = 7$, because a small-but-not-tiny $K$ tracks the wiggly truth without chasing every noisy point.

Show answer

Correct answer: D

A wiggly Bayes boundary needs a flexible classifier — small $K$ — but $K = 1$ over-commits to individual points and inflates variance. The intermediate $K \approx 7$ trades a little bias for much less variance and is typically near the U's minimum.

A confuses "low bias" with "low test error" — variance dominates at $K = 1$. B ignores the bias side: $K = 50$ on a wiggly boundary smooths the truth away. C is wrong: KNN's optimal $K$ depends on how complex the boundary is (small $K$ for wiggly, large $K$ for nearly-linear).

Atoms: knn-classification, flexibility-overfitting-underfitting, bias-variance-tradeoff.

Question 19 4 points

The prof said in lecture: "if you increase the bias a little bit, you can reduce the variance a lot, because of the squared term." Which fact about the bias-variance decomposition does this argument rely on?

A Bias and variance are uncorrelated, so any change in one cancels the other.
B Variance is always larger than bias$^2$ at the optimal flexibility.
C A small absolute change in bias contributes only its square to the MSE, while variance enters linearly.
D The cross term in the decomposition is bounded by twice the geometric mean of bias and variance.

Show answer

Correct answer: C

Test MSE = $\sigma^2$ + Bias$^2$ + Variance. Bias enters squared, so a small absolute increase in bias has a tiny effect on MSE; variance enters linearly, so a comparable absolute decrease moves MSE much more. That asymmetry is why ridge / lasso / smoothing splines / dropout / bagging all work.

A confuses "uncorrelated" with "additive in MSE"; bias and variance are not random variables being correlated — they are deterministic summands in the decomposition. B is empirical and not the mechanism — it's a consequence of being near the U's minimum, not a fact about the decomposition. D invents a non-existent cross term: in the standard derivation cross terms vanish exactly, not by inequality.

Atoms: bias-variance-tradeoff, regularization. Lecture: L13-modelsel-2.

Question 20 4 points

The prof showed simulations where polynomials of degree up to $100{,}000$ were fit to $n = 100$ points using the pseudoinverse. Test error rose sharply near $d \approx n$ and then fell again in the heavily over-parameterised regime. Which statement best captures his explanation of why this "second descent" happens?

A Past the interpolation point, the optimisation picks the minimum-norm solution among infinitely many interpolators, which acts like implicit ridge regularisation.
B The bias-variance decomposition stops holding past the interpolation point, allowing test error to drop without a corresponding increase elsewhere.
C The model recovers the true $f$ exactly once the polynomial degree exceeds the number of data points.
D The irreducible error $\sigma^2$ shrinks past the interpolation point because the model fits the noise exactly on the training set.

Show answer

Correct answer: A

Past $p \approx n$ the data-fit constraint is satisfied by infinitely many models; the pseudoinverse / SGD selects the smallest-norm one. That implicit norm penalty is itself a variance-control mechanism — the prof's headline framing of double descent.

B is wrong: the decomposition stays exact at every $p$. The U-shape just isn't the only possible profile. C is wrong: the truth is generally not in the model class (the misspecified-model regime), and the prof explicitly showed that when the truth is in the class (e.g. $f(x) = x^2$), the second descent disappears. D confuses training fit with the irreducible floor — $\sigma^2 = \mathrm{Var}(\varepsilon)$ is a property of the data-generating process and does not change with $p$; what changes is how much of the noise the model absorbs in-sample.

Atoms: double-descent, bias-variance-tradeoff, regularization. Lecture: L04-statlearn-3.

Question 21 4 points

Mark each statement about a multivariate normal $\mathbf X \sim N_p(\boldsymbol\mu, \boldsymbol\Sigma)$ as true or false.

a) Any linear combination $C\mathbf X$ is also multivariate normal. True False
b) Each marginal $X_i$ is normally distributed. True False
c) If $X_1$ and $X_2$ are each marginally normal, then $(X_1, X_2)$ is multivariate normal. True False
d) If $\boldsymbol\Sigma$ is singular ($|\boldsymbol\Sigma| = 0$), the density formula does not exist on $\mathbb R^p$. True False

Show answer

True — one of the four key MVN properties: $C\mathbf X \sim N(C\boldsymbol\mu, C\boldsymbol\Sigma C^\top)$.
True — marginals of an MVN are normal (just project onto the relevant coordinate axis in the density's argument).
False — joint normality of $(X_1, X_2)$ is strictly stronger than each component being marginally normal. Counter-examples exist (e.g. construct $(X, Y)$ where each is standard normal but the joint isn't).
True — the density has $|\boldsymbol\Sigma|^{1/2}$ in the denominator and $\boldsymbol\Sigma^{-1}$ in the exponent. If $\boldsymbol\Sigma$ is singular, the formula breaks down on $\mathbb R^p$ (the distribution lives on a lower-dimensional subspace).

Atoms: multivariate-normal, random-vector-and-covariance.

Question 22 4 points

The prof is on record as critical of calling bias-variance a "trade-off". Which best captures his objection?

A The decomposition is mathematically incorrect; bias and variance do not add up to test MSE in the way the formula suggests.
B The decomposition is exact, but a regularised flexible class can shrink variance without paying full bias cost, and double descent reduces both at once.
C The "trade-off" framing implies bias is more important than variance; in modern machine learning practice, the opposite is true.
D Bias and variance cannot in principle be estimated from any finite dataset, so any "trade-off" claim is unfalsifiable from data.

Show answer

Correct answer: B

The decomposition is exact at every $p$. The prof's objection is to the implication that movement on bias requires opposite movement on variance: regularisation can flatten the variance curve, and double descent shows both shrinking past the interpolation point. So "trade-off" is locally accurate near the U-minimum, but misleads as a global rule.

A reverses the prof's actual position: he cites the decomposition as exact when defending it. C invents a hierarchy between bias and variance that he never claims. D is wrong: in simulations where the truth is known you can estimate both terms exactly (Exercise 2.5). The objection is conceptual, not estimation-based.

Atoms: bias-variance-tradeoff, double-descent, regularization. Lecture: L04-statlearn-3.

Question 23 4 points Ex2.5

At a fixed test point $x_0$, $M = 100$ training-set replicates produce predictions $\hat f^{(m)}(x_0)$ for $m = 1, \dots, 100$. The empirical mean of those predictions is $1.80$, the empirical variance is $0.40$, the true value is $f(x_0) = 2.00$, and the noise variance is $\sigma^2 = 4$. Estimate the expected test MSE at $x_0$.

A $4.44$
B $0.44$
C $4.40$
D $4.04$

Show answer

Correct answer: A

$\mathbb E[(y_0 - \hat f(x_0))^2] = \sigma^2 + \mathrm{Bias}^2 + \mathrm{Var}(\hat f) = 4 + (2.00 - 1.80)^2 + 0.40 = 4 + 0.04 + 0.40 = 4.44$.

B drops $\sigma^2$ — only the reducible part. C forgets to square the bias and uses $|2.00 - 1.80| = 0.20$ instead of $0.04$: $4 + 0.20 + 0.40 - 0.20 = 4.40$ (but there's a sign-mistake path that lands here too). D omits the variance term entirely: $4 + 0.04 = 4.04$, the standard "treated $\hat f$ as deterministic" mistake.

Atoms: bias-variance-tradeoff, reducible-vs-irreducible-error.

Question 24 4 points

You consider KNN classification with $n = 500$ training points in $p$ predictors. Holding $K$ fixed, you observe that test error grows steeply as $p$ goes from 5 to 50. Which mechanism best explains this?

A KNN's bias scales with $p$, so each new predictor mechanically adds bias to the prediction at any test point.
B KNN's training error rises with $p$; the rise in training error is what causes the corresponding decline in test performance.
C Adding predictors changes the Bayes error rate of the underlying problem, raising the irreducible classification floor.
D In high dimensions pairwise Euclidean distances become concentrated; "nearest neighbour" stops carrying local information.

Show answer

Correct answer: D

The curse of dimensionality: as $p$ grows, the distribution of pairwise distances collapses around its mean, so the "$K$ nearest" set is barely closer than a random sample. KNN's mechanism (locality) breaks down.

A invents a direct $p$-bias relationship that doesn't exist; bias depends on the truth and on $K$, not on $p$ alone. B reverses the standard pattern (training error need not rise with $p$ for KNN). C conflates the curse with the Bayes error rate; the Bayes floor depends on overlap of class densities, not on the metric's behaviour.

Atoms: curse-of-dimensionality, knn-classification.

Question 25 4 points

Two methods are compared on a regression task. Method $A$ has lower squared bias but higher variance than method $B$ at the same flexibility. Method $B$ is being considered as a regularised variant of $A$. Which conclusion is best supported by the bias-variance decomposition alone?

A $A$ always gives better predictions because it has lower squared bias and bias dominates the decomposition.
B $B$ always gives better predictions because regularisation reduces test MSE relative to the unregularised version.
C Whichever method's bias$^2$ + variance sum is smaller at the test point wins; that depends on whether $B$'s variance drop exceeds the bias it pays.
D The two methods must give identical test MSE because the irreducible noise floor $\sigma^2$ is shared by both.

Show answer

Correct answer: C

The decomposition is $\sigma^2 + \mathrm{Bias}^2 + \mathrm{Var}$; the noise floor is the same for both methods, so the comparison reduces to bias$^2$ + variance. Whichever sum is smaller wins, and that's an empirical question. The prof's framing of the 2025 lasso question is exactly this: "improved accuracy when the increase in bias is less than the decrease in variance."

A ignores variance — the standard "bias-only" trap from CE1.1d (iii). B asserts a universal regularisation win that is false in general (e.g. lasso lost to OLS in the prof's L27 walkthrough on Boston Housing). D confuses the noise floor with the total error: identical $\sigma^2$ does not imply identical MSE.

Atoms: bias-variance-tradeoff, regularization. Lecture: L27-summary.