← Back to wiki
Module 02 — Statistical Learning
25 questions · 100 points · ~40 min
Click an option to lock the answer; the explanation auto-opens.
Score tracker bottom-left.
Which of the following methods is best described as nonparametric?
- A Linear regression with a polynomial basis of fixed degree.
- B Linear discriminant analysis with pooled covariance.
- C $K$-nearest neighbors with $K$ chosen by cross-validation.
- D Logistic regression with two predictors.
Show answer
Correct answer: C
KNN assumes no functional form for $f$; the "model" is the training data plus the rule "average / vote over the $K$ nearest neighbors." That is the prof's canonical nonparametric example.
A is parametric: even with polynomial features the model is linear in $\beta$ and you estimate a finite parameter vector. B (LDA) and D (logistic regression) are also parametric — both posit a specific Gaussian / Bernoulli-GLM form for the data and estimate a finite-dimensional $\theta$. The "$K$" in KNN is a hyperparameter, not a structural parameter of an assumed family.
Atoms: parametric-vs-nonparametric, knn-classification.
Assume $Y = f(X) + \varepsilon$ with $\mathbb E[\varepsilon] = 0$, $\mathrm{Var}(\varepsilon) = \sigma^2$, and $\varepsilon \perp X$. Predict $\hat Y = \hat f(X)$ at a fixed $x$. Which expression equals $\mathbb E[(Y - \hat Y)^2 \mid X = x]$ before taking any further expectation over the training set?
- A $(f(x) - \hat f(x))^2$
- B $(f(x) - \hat f(x))^2 - \sigma^2$
- C $(f(x) - \hat f(x))^2 + 2\sigma^2$
- D $(f(x) - \hat f(x))^2 + \sigma^2$
Show answer
Correct answer: D
Substitute $Y = f(x) + \varepsilon$ and expand the square: the cross term $-2(f(x)-\hat f(x))\varepsilon$ vanishes under expectation because $\mathbb E[\varepsilon] = 0$, and $\mathbb E[\varepsilon^2] = \mathrm{Var}(\varepsilon) = \sigma^2$. So the pointwise error splits into a reducible $(f - \hat f)^2$ and an irreducible $\sigma^2$.
A drops the noise floor entirely; the irreducible term never goes away. B subtracts the variance, which sign-flips the noise contribution. C double-counts $\sigma^2$, the standard mistake when students confuse $\mathbb E[\varepsilon^2]$ with $2\,\mathrm{Var}(\varepsilon)$.
Atoms: reducible-vs-irreducible-error, bias-variance-tradeoff.
Question 3
4 points
CE1 P1b
In the bias-variance derivation, after writing $f(x_0) - \hat f(x_0) = (f(x_0) - \mathbb E[\hat f(x_0)]) + (\mathbb E[\hat f(x_0)] - \hat f(x_0))$ and squaring, the cross term vanishes because:
- A $f(x_0)$ and $\hat f(x_0)$ are independent across resamples of the training set.
- B Taking expectation over training sets, $\mathbb E[\,\mathbb E[\hat f(x_0)] - \hat f(x_0)\,] = 0$.
- C The noise $\varepsilon$ in the response has expected value zero by assumption.
- D The training observations are assumed to be jointly Gaussian distributed.
Show answer
Correct answer: B
The cross term is $2(f(x_0) - \mathbb E[\hat f(x_0)]) \cdot (\mathbb E[\hat f(x_0)] - \hat f(x_0))$. The first factor is a deterministic constant. Taking expectation over the training set, the second factor's mean is $\mathbb E[\hat f(x_0)] - \mathbb E[\hat f(x_0)] = 0$. That kills the cross term and leaves Bias$^2$ + Variance.
A confuses $f$ (the unknown truth) with $\hat f$ (a function of the training data) — they're not independent in the relevant sense. C is the cross term that vanishes in the earlier step (separating reducible vs irreducible), not in the bias/variance split. D is irrelevant: the decomposition does not require Gaussian errors at all.
Atoms: bias-variance-tradeoff, reducible-vs-irreducible-error. Lecture: L03-statlearn-2.
A polynomial regression is fit to a fixed dataset for degrees $d = 1, 2, \dots, 20$. Mark each statement as true or false.
Show answer
- True — each more-flexible model contains the previous as a special case (set the higher coefficients to zero), so on the data it was fit to, MSE never increases.
- False — test MSE is U-shaped: it falls while bias dominates, then rises as variance takes over.
- True — more flexibility lets $\mathbb E[\hat f]$ approach the truth; bias drops with $d$.
- False — variance rises with flexibility because the high-degree fit chases noise; this is the right side of the U.
Atoms: flexibility-overfitting-underfitting, bias-variance-tradeoff, polynomial-regression. Lecture: L03-statlearn-2.
Question 5
4 points
Ex4.1
A 7-point training set has predictors $(X_1, X_2)$ and class $Y$:
$(3,3,A), (2,0,A), (1,1,A), (0,1,B), (-1,0,B), (2,1,B), (1,0,B)$.
Using Euclidean distance and $K = 4$ majority vote, classify the test point $(X_1, X_2) = (1, 2)$.
- A Class A, by a 4-to-0 unanimous vote among the four nearest neighbours.
- B Class B, by a 3-to-1 vote among the four nearest neighbours.
- C Tied 2-to-2; the prediction is undetermined without a tiebreaker.
- D Class A, because the closest single training point belongs to class A.
Show answer
Correct answer: B
Euclidean distances to $(1,2)$: $(3,3)\!\to\!\sqrt 5$, $(2,0)\!\to\!\sqrt 5$, $(1,1)\!\to\!1$, $(0,1)\!\to\!\sqrt 2$, $(-1,0)\!\to\!\sqrt 8$, $(2,1)\!\to\!\sqrt 2$, $(1,0)\!\to\!2$. Sorted ascending: $1,\sqrt 2,\sqrt 2, 2, \sqrt 5, \sqrt 5, \sqrt 8$. The four nearest are $(1,1)$ class A, $(0,1)$ class B, $(2,1)$ class B, $(1,0)$ class B → vote 1 A vs 3 B → predict B.
A overstates A's support — only one of the four neighbours, $(1,1)$, is class A; a 4-0 sweep would require all four to share that class. C invents a tie that doesn't occur with $K = 4$ on this data. D is the $K = 1$ answer ($(1,1)$ is closest, class A) — correct only if you forgot the question said $K = 4$, the canonical "wrong-$K$" trap from this exercise.
Atoms: knn-classification.
A KNN classifier is fit to a fixed training set. Mark each statement about increasing $K$ as true or false.
Show answer
- True — averaging over more neighbours blurs out local wobbles.
- False — small $K$ is the flexible regime (KNN's flexibility knob is inverted; that's the canonical T/F trap).
- True — averaging more $y$'s reduces $\mathrm{Var}(\hat f(x_0))$; this is the standard variance-reduction-by-averaging story.
- True — at $K = n$ every point has the same neighbour set (the entire training data), so the prediction is the global majority class everywhere.
Atoms: knn-classification, flexibility-overfitting-underfitting, bias-variance-tradeoff.
Question 7
4 points
CE1 P1f
Let $\mathbf X$ be a 2-dimensional random vector with covariance matrix
$$\boldsymbol\Sigma = \begin{bmatrix} 16 & 0.6 \\ 0.6 & 9 \end{bmatrix}.$$
What is the correlation $\rho_{12}$ between $X_1$ and $X_2$?
- A $0.05$
- B $0.15$
- C $0.0042$
- D $0.20$
Show answer
Correct answer: A
$\rho_{12} = \sigma_{12} / \sqrt{\sigma_1^2 \sigma_2^2} = 0.6 / \sqrt{16 \cdot 9} = 0.6 / \sqrt{144} = 0.6 / 12 = 0.05$.
B comes from dividing by $\sigma_1 + \sigma_2 = 4 + 3 = 7 \;... \approx 0.086$ — wrong combination, but close to a remembered "hand-mash" mistake. C divides by $\sigma_1^2 \sigma_2^2 = 144$ instead of the square root: $0.6/144 \approx 0.0042$ — the canonical "forgot the square root" trap. D drops the variance scaling and just halves: $0.6 / 3 = 0.2$.
Atoms: random-vector-and-covariance, multivariate-normal.
Question 8
4 points
CE1 P1g
$\mathbf X$ is bivariate normal with mean $\boldsymbol\mu = (0, 0)^\top$ and covariance
$$\boldsymbol\Sigma = \begin{bmatrix} 1 & -1.5 \\ -1.5 & 4 \end{bmatrix}.$$
Which qualitative description matches the contour plot of the density?
- A Circular contours centred at the origin, no diagonal pull.
- B Axis-aligned ellipse stretched along the $X_2$ axis, no diagonal tilt.
- C Tilted ellipse stretched along $X_2$, with the long axis pointing upper-left to lower-right.
- D Tilted ellipse stretched along $X_1$, with the long axis pointing lower-left to upper-right.
Show answer
Correct answer: C
The diagonal entries $1, 4$ say $X_2$ has larger variance, so the ellipse stretches along $X_2$. The off-diagonal $-1.5$ is negative, so $\rho < 0$ and the ellipse tilts upper-left to lower-right (negative-correlation diagonal).
A would correspond to $\boldsymbol\Sigma = c\,\mathbf I$ (equal variances, zero correlation). B drops the negative covariance and gives the axis-aligned shape that occurs only when $\rho = 0$. D has the tilt direction backwards and the wrong stretching axis (would require $\sigma_1^2 > \sigma_2^2$ and $\rho > 0$).
Atoms: multivariate-normal, random-vector-and-covariance. Lecture: L05-linreg-1.
Mark each statement about a $p$-dimensional random vector $\mathbf X$ with covariance matrix $\boldsymbol\Sigma$ as true or false.
Show answer
- True — direct from $\mathrm{Cov}(\mathbf Z) = \mathbb E[(\mathbf Z - \mathbb E\mathbf Z)(\mathbf Z - \mathbb E\mathbf Z)^\top]$ with $\mathbf Z = C\mathbf X$.
- False — covariance only measures linear co-variation. Zero covariance implies independence only under joint normality. The classic counterexample is $Y = X^2$ with $X$ symmetric around zero.
- True — by definition $\Sigma_{ii} = \mathrm{Cov}(X_i, X_i) = \mathrm{Var}(X_i)$.
- False — if some linear combination $\mathbf b^\top \mathbf X$ has zero variance (i.e. one component is a deterministic linear function of the others), $\boldsymbol\Sigma$ is singular. This is the multicollinearity / "$|\boldsymbol\Sigma| = 0$" pathology.
Atoms: random-vector-and-covariance, multivariate-normal, collinearity. Lecture: L04-statlearn-3.
Let $\mathbf X = (X_1, X_2)^\top$ have covariance matrix
$\boldsymbol\Sigma = \begin{bmatrix} 4 & 1 \\ 1 & 9 \end{bmatrix}$,
and define the contrast $Y = X_1 - X_2$ (so $C = (1, -1)$). What is $\mathrm{Var}(Y)$?
- A $11$
- B $13$
- C $5$
- D $15$
Show answer
Correct answer: A
$\mathrm{Var}(Y) = C\,\boldsymbol\Sigma\,C^\top = \sigma_1^2 + \sigma_2^2 - 2\sigma_{12} = 4 + 9 - 2(1) = 11$.
B forgets the minus sign on the cross term: $4 + 9 + 2(1) = 13$ — that's $\mathrm{Var}(X_1 + X_2)$, not the contrast. C drops the second variance entirely. D ignores the covariance and adds $4 + 9 + 2 \cdot 1 \cdot ?$ via a wrong sign convention.
Atoms: contrasts, random-vector-and-covariance. Lecture: L04-statlearn-3.
Question 11
4 points
CE1 P1d
Mark each statement about the bias-variance tradeoff as true or false.
Show answer
- False — bias-variance is fundamentally about prediction error $\mathbb E[(y_0 - \hat f(x_0))^2]$. Inference focuses on the sampling distribution of $\hat\beta$, which is a different concern.
- False — the $\sigma^2$ floor stays fixed regardless of $n$. Variance shrinks with more data, but the irreducible noise is "irreducible" by definition.
- False — lower bias often comes paired with higher variance. The relevant comparison is the sum bias$^2$ + variance + $\sigma^2$, not bias alone.
- False — large $\sigma^2$ means the noise floor is already high; adding flexibility just inflates variance on top. You want a less flexible (more biased, lower variance) method.
Atoms: bias-variance-tradeoff, reducible-vs-irreducible-error, flexibility-overfitting-underfitting.
Question 12
4 points
CE1 P1e
On a bias-variance plot for KNN regression with $K$ on the $x$-axis, four curves are drawn: squared bias, variance, irreducible error, and total expected test error. Which curve is monotonically decreasing as $K$ increases (within a moderate range)?
- A Squared bias.
- B Variance.
- C Irreducible error.
- D Total expected test error.
Show answer
Correct answer: B
Larger $K$ averages over more neighbours, shrinking $\mathrm{Var}(\hat f(x_0))$. Variance falls monotonically with $K$.
A goes the wrong direction: bias grows with $K$ because a smoother fit cannot track local wiggles in $f$. C is the flat horizontal line $\sigma^2$ (the floor; not strictly monotone). D is the U-shape — falls then rises — not monotone.
Atoms: bias-variance-tradeoff, knn-regression.
For KNN regression at a test point $x_0$ with neighbour set $\mathcal N_0$ of size $K$, the prediction is
- A $\hat f(x_0) = \arg\max_j \frac{1}{K}\sum_{i \in \mathcal N_0} I(y_i = j)$.
- B $\hat f(x_0) = \frac{1}{K} \sum_{i \in \mathcal N_0} y_i$.
- C $\hat f(x_0) = \frac{1}{K}\sum_{i \in \mathcal N_0} (y_i - \bar y)$.
- D $\hat f(x_0) = \min_{i \in \mathcal N_0} y_i$.
Show answer
Correct answer: B
KNN regression averages the $y$-values of the $K$ nearest training points: $\hat f(x_0) = \frac{1}{K}\sum_{\mathcal N_0} y_i$.
A is the KNN classification rule (majority vote, $\arg\max$ over class labels). C centres on the global mean $\bar y$, which has expected value zero — it's not a regression prediction. D returns the smallest $y$ among neighbours; ignores all but one point and discards the averaging benefit.
Atoms: knn-regression, knn-classification. Lecture: L10-resample-1.
On a plot of expected test MSE vs. flexibility (showing bias$^2$, variance, and a horizontal dashed line that the total error curve never crosses), what does that horizontal dashed line represent?
- A The training MSE at the optimal flexibility.
- B The squared bias at infinite flexibility.
- C The cross-validated standard error of the optimal model.
- D The irreducible error $\mathrm{Var}(\varepsilon)$.
Show answer
Correct answer: D
The horizontal asymptote is $\sigma^2 = \mathrm{Var}(\varepsilon)$. No estimator $\hat f$ can drive expected squared error below it because it captures noise that is, by assumption, independent of the predictors.
A is wrong: training MSE is on a different curve (always falling) and isn't an asymptote of the test curve. B confuses bias-at-infinite-flexibility with the noise floor — a fully flexible model can drive bias to zero, but $\sigma^2$ remains. C is a CV bookkeeping quantity, not a feature of the bias-variance plot.
Atoms: reducible-vs-irreducible-error, bias-variance-tradeoff.
A linear regression and a KNN regression are both fit to the same training set on $[0, 5]$. You then predict at $x_0 = 12$, far outside the training range. Which behaviour best describes the two predictions?
- A Linear regression extrapolates along its fitted slope; KNN averages the rightmost training points and tracks no trend.
- B Both methods extrapolate along the local slope implied by the rightmost training points.
- C KNN extrapolates more accurately because it assumes nothing about the global form of $f$.
- D Linear regression refuses to predict outside the training range; KNN returns the global mean of the training $y$'s.
Show answer
Correct answer: A
The linear model has a global form ($\hat\beta_0 + \hat\beta_1 x$) so it keeps going in the same direction past the data — sometimes useful, sometimes nonsense. KNN has no global structure: at $x_0 = 12$ its "nearest neighbours" are still the rightmost training points and the prediction is just their average. So KNN flatlines off to the side instead of extrapolating.
B confuses parametric extrapolation with nonparametric averaging — KNN doesn't know about a trend. C reverses the truth: nonparametric methods are worse at extrapolation, not better; "no assumption" cuts both ways. D is wrong on both halves: linear regression happily extrapolates (that's the trap, not the safety), and KNN returns a local average, not the global mean (unless $K = n$).
Atoms: parametric-vs-nonparametric, knn-regression.
Two regression methods, $A$ and $B$, are fit to the same dataset. Method $A$ has training MSE $0.20$ and test MSE $0.35$. Method $B$ has training MSE $0.05$ and test MSE $0.60$. Which diagnosis is most consistent with these numbers?
- A Both methods underfit; the test/training gap reflects only the irreducible error $\sigma^2$.
- B Method $A$ is overfitting and method $B$ is underfitting; $A$'s smaller train/test gap suggests bias.
- C Method $B$ is overfitting relative to $A$, and $A$ is the safer choice for prediction.
- D Method $B$ should be preferred because it achieves the lowest training MSE on this dataset.
Show answer
Correct answer: C
$B$'s training error is much lower than $A$'s, but its test error is much higher — the canonical overfit signature. $A$'s smaller train/test gap and lower test MSE make it the better predictor.
A misreads both methods as underfit; the differences in training MSE rule that out. B reverses which method is overfitting (overfit = good train, bad test, i.e. $B$). D is the standard "training MSE is what counts" trap; training MSE always falls with flexibility and is not a model-selection signal.
Atoms: flexibility-overfitting-underfitting, bias-variance-tradeoff.
Mark each direction-of-effect statement as true or false.
Show answer
- True — flexibility ↑ → variance ↑ in the classical regime; degree-20 polynomials wobble enormously across resamples.
- False — variance falls with $K$ because averaging more neighbours stabilises the prediction. KNN's flexibility knob is inverted.
- True — both errors high and roughly equal points to bias dominating: the model class can't capture the truth even on its own training data.
- False — nonparametric methods (KNN, smoothing splines, GAMs) still have hyperparameters ($K$, $\lambda$, df). What's missing is a global parametric form for $f$, not all knobs.
Atoms: flexibility-overfitting-underfitting, knn-regression, parametric-vs-nonparametric, bias-variance-tradeoff.
A binary classification problem has a Bayes decision boundary that is highly non-linear and curls back on itself in several places. Among $K = 1$, $K = 7$, and $K = 50$ in KNN (training size $n = 200$), which is most likely to be near-optimal in test error, all else equal?
- A $K = 1$, because it has the lowest bias and tracks every wiggle.
- B $K = 50$, because larger $K$ wins by reducing variance.
- C All three should give roughly the same test error; the Bayes boundary's shape doesn't matter to KNN.
- D $K = 7$, because a small-but-not-tiny $K$ tracks the wiggly truth without chasing every noisy point.
Show answer
Correct answer: D
A wiggly Bayes boundary needs a flexible classifier — small $K$ — but $K = 1$ over-commits to individual points and inflates variance. The intermediate $K \approx 7$ trades a little bias for much less variance and is typically near the U's minimum.
A confuses "low bias" with "low test error" — variance dominates at $K = 1$. B ignores the bias side: $K = 50$ on a wiggly boundary smooths the truth away. C is wrong: KNN's optimal $K$ depends on how complex the boundary is (small $K$ for wiggly, large $K$ for nearly-linear).
Atoms: knn-classification, flexibility-overfitting-underfitting, bias-variance-tradeoff.
The prof said in lecture: "if you increase the bias a little bit, you can reduce the variance a lot, because of the squared term." Which fact about the bias-variance decomposition does this argument rely on?
- A Bias and variance are uncorrelated, so any change in one cancels the other.
- B Variance is always larger than bias$^2$ at the optimal flexibility.
- C A small absolute change in bias contributes only its square to the MSE, while variance enters linearly.
- D The cross term in the decomposition is bounded by twice the geometric mean of bias and variance.
Show answer
Correct answer: C
Test MSE = $\sigma^2$ + Bias$^2$ + Variance. Bias enters squared, so a small absolute increase in bias has a tiny effect on MSE; variance enters linearly, so a comparable absolute decrease moves MSE much more. That asymmetry is why ridge / lasso / smoothing splines / dropout / bagging all work.
A confuses "uncorrelated" with "additive in MSE"; bias and variance are not random variables being correlated — they are deterministic summands in the decomposition. B is empirical and not the mechanism — it's a consequence of being near the U's minimum, not a fact about the decomposition. D invents a non-existent cross term: in the standard derivation cross terms vanish exactly, not by inequality.
Atoms: bias-variance-tradeoff, regularization. Lecture: L13-modelsel-2.
The prof showed simulations where polynomials of degree up to $100{,}000$ were fit to $n = 100$ points using the pseudoinverse. Test error rose sharply near $d \approx n$ and then fell again in the heavily over-parameterised regime. Which statement best captures his explanation of why this "second descent" happens?
- A Past the interpolation point, the optimisation picks the minimum-norm solution among infinitely many interpolators, which acts like implicit ridge regularisation.
- B The bias-variance decomposition stops holding past the interpolation point, allowing test error to drop without a corresponding increase elsewhere.
- C The model recovers the true $f$ exactly once the polynomial degree exceeds the number of data points.
- D The irreducible error $\sigma^2$ shrinks past the interpolation point because the model fits the noise exactly on the training set.
Show answer
Correct answer: A
Past $p \approx n$ the data-fit constraint is satisfied by infinitely many models; the pseudoinverse / SGD selects the smallest-norm one. That implicit norm penalty is itself a variance-control mechanism — the prof's headline framing of double descent.
B is wrong: the decomposition stays exact at every $p$. The U-shape just isn't the only possible profile. C is wrong: the truth is generally not in the model class (the misspecified-model regime), and the prof explicitly showed that when the truth is in the class (e.g. $f(x) = x^2$), the second descent disappears. D confuses training fit with the irreducible floor — $\sigma^2 = \mathrm{Var}(\varepsilon)$ is a property of the data-generating process and does not change with $p$; what changes is how much of the noise the model absorbs in-sample.
Atoms: double-descent, bias-variance-tradeoff, regularization. Lecture: L04-statlearn-3.
Mark each statement about a multivariate normal $\mathbf X \sim N_p(\boldsymbol\mu, \boldsymbol\Sigma)$ as true or false.
Show answer
- True — one of the four key MVN properties: $C\mathbf X \sim N(C\boldsymbol\mu, C\boldsymbol\Sigma C^\top)$.
- True — marginals of an MVN are normal (just project onto the relevant coordinate axis in the density's argument).
- False — joint normality of $(X_1, X_2)$ is strictly stronger than each component being marginally normal. Counter-examples exist (e.g. construct $(X, Y)$ where each is standard normal but the joint isn't).
- True — the density has $|\boldsymbol\Sigma|^{1/2}$ in the denominator and $\boldsymbol\Sigma^{-1}$ in the exponent. If $\boldsymbol\Sigma$ is singular, the formula breaks down on $\mathbb R^p$ (the distribution lives on a lower-dimensional subspace).
Atoms: multivariate-normal, random-vector-and-covariance.
The prof is on record as critical of calling bias-variance a "trade-off". Which best captures his objection?
- A The decomposition is mathematically incorrect; bias and variance do not add up to test MSE in the way the formula suggests.
- B The decomposition is exact, but a regularised flexible class can shrink variance without paying full bias cost, and double descent reduces both at once.
- C The "trade-off" framing implies bias is more important than variance; in modern machine learning practice, the opposite is true.
- D Bias and variance cannot in principle be estimated from any finite dataset, so any "trade-off" claim is unfalsifiable from data.
Show answer
Correct answer: B
The decomposition is exact at every $p$. The prof's objection is to the implication that movement on bias requires opposite movement on variance: regularisation can flatten the variance curve, and double descent shows both shrinking past the interpolation point. So "trade-off" is locally accurate near the U-minimum, but misleads as a global rule.
A reverses the prof's actual position: he cites the decomposition as exact when defending it. C invents a hierarchy between bias and variance that he never claims. D is wrong: in simulations where the truth is known you can estimate both terms exactly (Exercise 2.5). The objection is conceptual, not estimation-based.
Atoms: bias-variance-tradeoff, double-descent, regularization. Lecture: L04-statlearn-3.
Question 23
4 points
Ex2.5
At a fixed test point $x_0$, $M = 100$ training-set replicates produce predictions $\hat f^{(m)}(x_0)$ for $m = 1, \dots, 100$. The empirical mean of those predictions is $1.80$, the empirical variance is $0.40$, the true value is $f(x_0) = 2.00$, and the noise variance is $\sigma^2 = 4$. Estimate the expected test MSE at $x_0$.
- A $4.44$
- B $0.44$
- C $4.40$
- D $4.04$
Show answer
Correct answer: A
$\mathbb E[(y_0 - \hat f(x_0))^2] = \sigma^2 + \mathrm{Bias}^2 + \mathrm{Var}(\hat f) = 4 + (2.00 - 1.80)^2 + 0.40 = 4 + 0.04 + 0.40 = 4.44$.
B drops $\sigma^2$ — only the reducible part. C forgets to square the bias and uses $|2.00 - 1.80| = 0.20$ instead of $0.04$: $4 + 0.20 + 0.40 - 0.20 = 4.40$ (but there's a sign-mistake path that lands here too). D omits the variance term entirely: $4 + 0.04 = 4.04$, the standard "treated $\hat f$ as deterministic" mistake.
Atoms: bias-variance-tradeoff, reducible-vs-irreducible-error.
You consider KNN classification with $n = 500$ training points in $p$ predictors. Holding $K$ fixed, you observe that test error grows steeply as $p$ goes from 5 to 50. Which mechanism best explains this?
- A KNN's bias scales with $p$, so each new predictor mechanically adds bias to the prediction at any test point.
- B KNN's training error rises with $p$; the rise in training error is what causes the corresponding decline in test performance.
- C Adding predictors changes the Bayes error rate of the underlying problem, raising the irreducible classification floor.
- D In high dimensions pairwise Euclidean distances become concentrated; "nearest neighbour" stops carrying local information.
Show answer
Correct answer: D
The curse of dimensionality: as $p$ grows, the distribution of pairwise distances collapses around its mean, so the "$K$ nearest" set is barely closer than a random sample. KNN's mechanism (locality) breaks down.
A invents a direct $p$-bias relationship that doesn't exist; bias depends on the truth and on $K$, not on $p$ alone. B reverses the standard pattern (training error need not rise with $p$ for KNN). C conflates the curse with the Bayes error rate; the Bayes floor depends on overlap of class densities, not on the metric's behaviour.
Atoms: curse-of-dimensionality, knn-classification.
Two methods are compared on a regression task. Method $A$ has lower squared bias but higher variance than method $B$ at the same flexibility. Method $B$ is being considered as a regularised variant of $A$. Which conclusion is best supported by the bias-variance decomposition alone?
- A $A$ always gives better predictions because it has lower squared bias and bias dominates the decomposition.
- B $B$ always gives better predictions because regularisation reduces test MSE relative to the unregularised version.
- C Whichever method's bias$^2$ + variance sum is smaller at the test point wins; that depends on whether $B$'s variance drop exceeds the bias it pays.
- D The two methods must give identical test MSE because the irreducible noise floor $\sigma^2$ is shared by both.
Show answer
Correct answer: C
The decomposition is $\sigma^2 + \mathrm{Bias}^2 + \mathrm{Var}$; the noise floor is the same for both methods, so the comparison reduces to bias$^2$ + variance. Whichever sum is smaller wins, and that's an empirical question. The prof's framing of the 2025 lasso question is exactly this: "improved accuracy when the increase in bias is less than the decrease in variance."
A ignores variance — the standard "bias-only" trap from CE1.1d (iii). B asserts a universal regularisation win that is false in general (e.g. lasso lost to OLS in the prof's L27 walkthrough on Boston Housing). D confuses the noise floor with the total error: identical $\sigma^2$ does not imply identical MSE.
Atoms: bias-variance-tradeoff, regularization. Lecture: L27-summary.