← Back to wiki

Module 02 — Statistical Learning

25 questions · 100 points · ~40 min

Click an option to lock the answer; the explanation auto-opens. Score tracker bottom-left.

Question 1 4 points

Which of the following methods is best described as nonparametric?

Show answer
Correct answer: C

KNN assumes no functional form for $f$; the "model" is the training data plus the rule "average / vote over the $K$ nearest neighbors." That is the prof's canonical nonparametric example.

A is parametric: even with polynomial features the model is linear in $\beta$ and you estimate a finite parameter vector. B (LDA) and D (logistic regression) are also parametric — both posit a specific Gaussian / Bernoulli-GLM form for the data and estimate a finite-dimensional $\theta$. The "$K$" in KNN is a hyperparameter, not a structural parameter of an assumed family.

Atoms: parametric-vs-nonparametric, knn-classification.

Question 2 4 points

Assume $Y = f(X) + \varepsilon$ with $\mathbb E[\varepsilon] = 0$, $\mathrm{Var}(\varepsilon) = \sigma^2$, and $\varepsilon \perp X$. Predict $\hat Y = \hat f(X)$ at a fixed $x$. Which expression equals $\mathbb E[(Y - \hat Y)^2 \mid X = x]$ before taking any further expectation over the training set?

Show answer
Correct answer: D

Substitute $Y = f(x) + \varepsilon$ and expand the square: the cross term $-2(f(x)-\hat f(x))\varepsilon$ vanishes under expectation because $\mathbb E[\varepsilon] = 0$, and $\mathbb E[\varepsilon^2] = \mathrm{Var}(\varepsilon) = \sigma^2$. So the pointwise error splits into a reducible $(f - \hat f)^2$ and an irreducible $\sigma^2$.

A drops the noise floor entirely; the irreducible term never goes away. B subtracts the variance, which sign-flips the noise contribution. C double-counts $\sigma^2$, the standard mistake when students confuse $\mathbb E[\varepsilon^2]$ with $2\,\mathrm{Var}(\varepsilon)$.

Atoms: reducible-vs-irreducible-error, bias-variance-tradeoff.

Question 3 4 points CE1 P1b

In the bias-variance derivation, after writing $f(x_0) - \hat f(x_0) = (f(x_0) - \mathbb E[\hat f(x_0)]) + (\mathbb E[\hat f(x_0)] - \hat f(x_0))$ and squaring, the cross term vanishes because:

Show answer
Correct answer: B

The cross term is $2(f(x_0) - \mathbb E[\hat f(x_0)]) \cdot (\mathbb E[\hat f(x_0)] - \hat f(x_0))$. The first factor is a deterministic constant. Taking expectation over the training set, the second factor's mean is $\mathbb E[\hat f(x_0)] - \mathbb E[\hat f(x_0)] = 0$. That kills the cross term and leaves Bias$^2$ + Variance.

A confuses $f$ (the unknown truth) with $\hat f$ (a function of the training data) — they're not independent in the relevant sense. C is the cross term that vanishes in the earlier step (separating reducible vs irreducible), not in the bias/variance split. D is irrelevant: the decomposition does not require Gaussian errors at all.

Atoms: bias-variance-tradeoff, reducible-vs-irreducible-error. Lecture: L03-statlearn-2.

Question 4 4 points

A polynomial regression is fit to a fixed dataset for degrees $d = 1, 2, \dots, 20$. Mark each statement as true or false.

Show answer
  1. True — each more-flexible model contains the previous as a special case (set the higher coefficients to zero), so on the data it was fit to, MSE never increases.
  2. False — test MSE is U-shaped: it falls while bias dominates, then rises as variance takes over.
  3. True — more flexibility lets $\mathbb E[\hat f]$ approach the truth; bias drops with $d$.
  4. False — variance rises with flexibility because the high-degree fit chases noise; this is the right side of the U.

Atoms: flexibility-overfitting-underfitting, bias-variance-tradeoff, polynomial-regression. Lecture: L03-statlearn-2.

Question 5 4 points Ex4.1

A 7-point training set has predictors $(X_1, X_2)$ and class $Y$: $(3,3,A), (2,0,A), (1,1,A), (0,1,B), (-1,0,B), (2,1,B), (1,0,B)$. Using Euclidean distance and $K = 4$ majority vote, classify the test point $(X_1, X_2) = (1, 2)$.

Show answer
Correct answer: B

Euclidean distances to $(1,2)$: $(3,3)\!\to\!\sqrt 5$, $(2,0)\!\to\!\sqrt 5$, $(1,1)\!\to\!1$, $(0,1)\!\to\!\sqrt 2$, $(-1,0)\!\to\!\sqrt 8$, $(2,1)\!\to\!\sqrt 2$, $(1,0)\!\to\!2$. Sorted ascending: $1,\sqrt 2,\sqrt 2, 2, \sqrt 5, \sqrt 5, \sqrt 8$. The four nearest are $(1,1)$ class A, $(0,1)$ class B, $(2,1)$ class B, $(1,0)$ class B → vote 1 A vs 3 B → predict B.

A overstates A's support — only one of the four neighbours, $(1,1)$, is class A; a 4-0 sweep would require all four to share that class. C invents a tie that doesn't occur with $K = 4$ on this data. D is the $K = 1$ answer ($(1,1)$ is closest, class A) — correct only if you forgot the question said $K = 4$, the canonical "wrong-$K$" trap from this exercise.

Atoms: knn-classification.

Question 6 4 points

A KNN classifier is fit to a fixed training set. Mark each statement about increasing $K$ as true or false.

Show answer
  1. True — averaging over more neighbours blurs out local wobbles.
  2. False — small $K$ is the flexible regime (KNN's flexibility knob is inverted; that's the canonical T/F trap).
  3. True — averaging more $y$'s reduces $\mathrm{Var}(\hat f(x_0))$; this is the standard variance-reduction-by-averaging story.
  4. True — at $K = n$ every point has the same neighbour set (the entire training data), so the prediction is the global majority class everywhere.

Atoms: knn-classification, flexibility-overfitting-underfitting, bias-variance-tradeoff.

Question 7 4 points CE1 P1f

Let $\mathbf X$ be a 2-dimensional random vector with covariance matrix $$\boldsymbol\Sigma = \begin{bmatrix} 16 & 0.6 \\ 0.6 & 9 \end{bmatrix}.$$ What is the correlation $\rho_{12}$ between $X_1$ and $X_2$?

Show answer
Correct answer: A

$\rho_{12} = \sigma_{12} / \sqrt{\sigma_1^2 \sigma_2^2} = 0.6 / \sqrt{16 \cdot 9} = 0.6 / \sqrt{144} = 0.6 / 12 = 0.05$.

B comes from dividing by $\sigma_1 + \sigma_2 = 4 + 3 = 7 \;... \approx 0.086$ — wrong combination, but close to a remembered "hand-mash" mistake. C divides by $\sigma_1^2 \sigma_2^2 = 144$ instead of the square root: $0.6/144 \approx 0.0042$ — the canonical "forgot the square root" trap. D drops the variance scaling and just halves: $0.6 / 3 = 0.2$.

Atoms: random-vector-and-covariance, multivariate-normal.

Question 8 4 points CE1 P1g

$\mathbf X$ is bivariate normal with mean $\boldsymbol\mu = (0, 0)^\top$ and covariance $$\boldsymbol\Sigma = \begin{bmatrix} 1 & -1.5 \\ -1.5 & 4 \end{bmatrix}.$$ Which qualitative description matches the contour plot of the density?

Show answer
Correct answer: C

The diagonal entries $1, 4$ say $X_2$ has larger variance, so the ellipse stretches along $X_2$. The off-diagonal $-1.5$ is negative, so $\rho < 0$ and the ellipse tilts upper-left to lower-right (negative-correlation diagonal).

A would correspond to $\boldsymbol\Sigma = c\,\mathbf I$ (equal variances, zero correlation). B drops the negative covariance and gives the axis-aligned shape that occurs only when $\rho = 0$. D has the tilt direction backwards and the wrong stretching axis (would require $\sigma_1^2 > \sigma_2^2$ and $\rho > 0$).

Atoms: multivariate-normal, random-vector-and-covariance. Lecture: L05-linreg-1.

Question 9 4 points

Mark each statement about a $p$-dimensional random vector $\mathbf X$ with covariance matrix $\boldsymbol\Sigma$ as true or false.

Show answer
  1. True — direct from $\mathrm{Cov}(\mathbf Z) = \mathbb E[(\mathbf Z - \mathbb E\mathbf Z)(\mathbf Z - \mathbb E\mathbf Z)^\top]$ with $\mathbf Z = C\mathbf X$.
  2. False — covariance only measures linear co-variation. Zero covariance implies independence only under joint normality. The classic counterexample is $Y = X^2$ with $X$ symmetric around zero.
  3. True — by definition $\Sigma_{ii} = \mathrm{Cov}(X_i, X_i) = \mathrm{Var}(X_i)$.
  4. False — if some linear combination $\mathbf b^\top \mathbf X$ has zero variance (i.e. one component is a deterministic linear function of the others), $\boldsymbol\Sigma$ is singular. This is the multicollinearity / "$|\boldsymbol\Sigma| = 0$" pathology.

Atoms: random-vector-and-covariance, multivariate-normal, collinearity. Lecture: L04-statlearn-3.

Question 10 4 points

Let $\mathbf X = (X_1, X_2)^\top$ have covariance matrix $\boldsymbol\Sigma = \begin{bmatrix} 4 & 1 \\ 1 & 9 \end{bmatrix}$, and define the contrast $Y = X_1 - X_2$ (so $C = (1, -1)$). What is $\mathrm{Var}(Y)$?

Show answer
Correct answer: A

$\mathrm{Var}(Y) = C\,\boldsymbol\Sigma\,C^\top = \sigma_1^2 + \sigma_2^2 - 2\sigma_{12} = 4 + 9 - 2(1) = 11$.

B forgets the minus sign on the cross term: $4 + 9 + 2(1) = 13$ — that's $\mathrm{Var}(X_1 + X_2)$, not the contrast. C drops the second variance entirely. D ignores the covariance and adds $4 + 9 + 2 \cdot 1 \cdot ?$ via a wrong sign convention.

Atoms: contrasts, random-vector-and-covariance. Lecture: L04-statlearn-3.

Question 11 4 points CE1 P1d

Mark each statement about the bias-variance tradeoff as true or false.

Show answer
  1. False — bias-variance is fundamentally about prediction error $\mathbb E[(y_0 - \hat f(x_0))^2]$. Inference focuses on the sampling distribution of $\hat\beta$, which is a different concern.
  2. False — the $\sigma^2$ floor stays fixed regardless of $n$. Variance shrinks with more data, but the irreducible noise is "irreducible" by definition.
  3. False — lower bias often comes paired with higher variance. The relevant comparison is the sum bias$^2$ + variance + $\sigma^2$, not bias alone.
  4. False — large $\sigma^2$ means the noise floor is already high; adding flexibility just inflates variance on top. You want a less flexible (more biased, lower variance) method.

Atoms: bias-variance-tradeoff, reducible-vs-irreducible-error, flexibility-overfitting-underfitting.

Question 12 4 points CE1 P1e

On a bias-variance plot for KNN regression with $K$ on the $x$-axis, four curves are drawn: squared bias, variance, irreducible error, and total expected test error. Which curve is monotonically decreasing as $K$ increases (within a moderate range)?

Show answer
Correct answer: B

Larger $K$ averages over more neighbours, shrinking $\mathrm{Var}(\hat f(x_0))$. Variance falls monotonically with $K$.

A goes the wrong direction: bias grows with $K$ because a smoother fit cannot track local wiggles in $f$. C is the flat horizontal line $\sigma^2$ (the floor; not strictly monotone). D is the U-shape — falls then rises — not monotone.

Atoms: bias-variance-tradeoff, knn-regression.

Question 13 4 points

For KNN regression at a test point $x_0$ with neighbour set $\mathcal N_0$ of size $K$, the prediction is

Show answer
Correct answer: B

KNN regression averages the $y$-values of the $K$ nearest training points: $\hat f(x_0) = \frac{1}{K}\sum_{\mathcal N_0} y_i$.

A is the KNN classification rule (majority vote, $\arg\max$ over class labels). C centres on the global mean $\bar y$, which has expected value zero — it's not a regression prediction. D returns the smallest $y$ among neighbours; ignores all but one point and discards the averaging benefit.

Atoms: knn-regression, knn-classification. Lecture: L10-resample-1.

Question 14 4 points

On a plot of expected test MSE vs. flexibility (showing bias$^2$, variance, and a horizontal dashed line that the total error curve never crosses), what does that horizontal dashed line represent?

Show answer
Correct answer: D

The horizontal asymptote is $\sigma^2 = \mathrm{Var}(\varepsilon)$. No estimator $\hat f$ can drive expected squared error below it because it captures noise that is, by assumption, independent of the predictors.

A is wrong: training MSE is on a different curve (always falling) and isn't an asymptote of the test curve. B confuses bias-at-infinite-flexibility with the noise floor — a fully flexible model can drive bias to zero, but $\sigma^2$ remains. C is a CV bookkeeping quantity, not a feature of the bias-variance plot.

Atoms: reducible-vs-irreducible-error, bias-variance-tradeoff.

Question 15 4 points

A linear regression and a KNN regression are both fit to the same training set on $[0, 5]$. You then predict at $x_0 = 12$, far outside the training range. Which behaviour best describes the two predictions?

Show answer
Correct answer: A

The linear model has a global form ($\hat\beta_0 + \hat\beta_1 x$) so it keeps going in the same direction past the data — sometimes useful, sometimes nonsense. KNN has no global structure: at $x_0 = 12$ its "nearest neighbours" are still the rightmost training points and the prediction is just their average. So KNN flatlines off to the side instead of extrapolating.

B confuses parametric extrapolation with nonparametric averaging — KNN doesn't know about a trend. C reverses the truth: nonparametric methods are worse at extrapolation, not better; "no assumption" cuts both ways. D is wrong on both halves: linear regression happily extrapolates (that's the trap, not the safety), and KNN returns a local average, not the global mean (unless $K = n$).

Atoms: parametric-vs-nonparametric, knn-regression.

Question 16 4 points

Two regression methods, $A$ and $B$, are fit to the same dataset. Method $A$ has training MSE $0.20$ and test MSE $0.35$. Method $B$ has training MSE $0.05$ and test MSE $0.60$. Which diagnosis is most consistent with these numbers?

Show answer
Correct answer: C

$B$'s training error is much lower than $A$'s, but its test error is much higher — the canonical overfit signature. $A$'s smaller train/test gap and lower test MSE make it the better predictor.

A misreads both methods as underfit; the differences in training MSE rule that out. B reverses which method is overfitting (overfit = good train, bad test, i.e. $B$). D is the standard "training MSE is what counts" trap; training MSE always falls with flexibility and is not a model-selection signal.

Atoms: flexibility-overfitting-underfitting, bias-variance-tradeoff.

Question 17 4 points

Mark each direction-of-effect statement as true or false.

Show answer
  1. True — flexibility ↑ → variance ↑ in the classical regime; degree-20 polynomials wobble enormously across resamples.
  2. False — variance falls with $K$ because averaging more neighbours stabilises the prediction. KNN's flexibility knob is inverted.
  3. True — both errors high and roughly equal points to bias dominating: the model class can't capture the truth even on its own training data.
  4. False — nonparametric methods (KNN, smoothing splines, GAMs) still have hyperparameters ($K$, $\lambda$, df). What's missing is a global parametric form for $f$, not all knobs.

Atoms: flexibility-overfitting-underfitting, knn-regression, parametric-vs-nonparametric, bias-variance-tradeoff.

Question 18 4 points

A binary classification problem has a Bayes decision boundary that is highly non-linear and curls back on itself in several places. Among $K = 1$, $K = 7$, and $K = 50$ in KNN (training size $n = 200$), which is most likely to be near-optimal in test error, all else equal?

Show answer
Correct answer: D

A wiggly Bayes boundary needs a flexible classifier — small $K$ — but $K = 1$ over-commits to individual points and inflates variance. The intermediate $K \approx 7$ trades a little bias for much less variance and is typically near the U's minimum.

A confuses "low bias" with "low test error" — variance dominates at $K = 1$. B ignores the bias side: $K = 50$ on a wiggly boundary smooths the truth away. C is wrong: KNN's optimal $K$ depends on how complex the boundary is (small $K$ for wiggly, large $K$ for nearly-linear).

Atoms: knn-classification, flexibility-overfitting-underfitting, bias-variance-tradeoff.

Question 19 4 points

The prof said in lecture: "if you increase the bias a little bit, you can reduce the variance a lot, because of the squared term." Which fact about the bias-variance decomposition does this argument rely on?

Show answer
Correct answer: C

Test MSE = $\sigma^2$ + Bias$^2$ + Variance. Bias enters squared, so a small absolute increase in bias has a tiny effect on MSE; variance enters linearly, so a comparable absolute decrease moves MSE much more. That asymmetry is why ridge / lasso / smoothing splines / dropout / bagging all work.

A confuses "uncorrelated" with "additive in MSE"; bias and variance are not random variables being correlated — they are deterministic summands in the decomposition. B is empirical and not the mechanism — it's a consequence of being near the U's minimum, not a fact about the decomposition. D invents a non-existent cross term: in the standard derivation cross terms vanish exactly, not by inequality.

Atoms: bias-variance-tradeoff, regularization. Lecture: L13-modelsel-2.

Question 20 4 points

The prof showed simulations where polynomials of degree up to $100{,}000$ were fit to $n = 100$ points using the pseudoinverse. Test error rose sharply near $d \approx n$ and then fell again in the heavily over-parameterised regime. Which statement best captures his explanation of why this "second descent" happens?

Show answer
Correct answer: A

Past $p \approx n$ the data-fit constraint is satisfied by infinitely many models; the pseudoinverse / SGD selects the smallest-norm one. That implicit norm penalty is itself a variance-control mechanism — the prof's headline framing of double descent.

B is wrong: the decomposition stays exact at every $p$. The U-shape just isn't the only possible profile. C is wrong: the truth is generally not in the model class (the misspecified-model regime), and the prof explicitly showed that when the truth is in the class (e.g. $f(x) = x^2$), the second descent disappears. D confuses training fit with the irreducible floor — $\sigma^2 = \mathrm{Var}(\varepsilon)$ is a property of the data-generating process and does not change with $p$; what changes is how much of the noise the model absorbs in-sample.

Atoms: double-descent, bias-variance-tradeoff, regularization. Lecture: L04-statlearn-3.

Question 21 4 points

Mark each statement about a multivariate normal $\mathbf X \sim N_p(\boldsymbol\mu, \boldsymbol\Sigma)$ as true or false.

Show answer
  1. True — one of the four key MVN properties: $C\mathbf X \sim N(C\boldsymbol\mu, C\boldsymbol\Sigma C^\top)$.
  2. True — marginals of an MVN are normal (just project onto the relevant coordinate axis in the density's argument).
  3. False — joint normality of $(X_1, X_2)$ is strictly stronger than each component being marginally normal. Counter-examples exist (e.g. construct $(X, Y)$ where each is standard normal but the joint isn't).
  4. True — the density has $|\boldsymbol\Sigma|^{1/2}$ in the denominator and $\boldsymbol\Sigma^{-1}$ in the exponent. If $\boldsymbol\Sigma$ is singular, the formula breaks down on $\mathbb R^p$ (the distribution lives on a lower-dimensional subspace).

Atoms: multivariate-normal, random-vector-and-covariance.

Question 22 4 points

The prof is on record as critical of calling bias-variance a "trade-off". Which best captures his objection?

Show answer
Correct answer: B

The decomposition is exact at every $p$. The prof's objection is to the implication that movement on bias requires opposite movement on variance: regularisation can flatten the variance curve, and double descent shows both shrinking past the interpolation point. So "trade-off" is locally accurate near the U-minimum, but misleads as a global rule.

A reverses the prof's actual position: he cites the decomposition as exact when defending it. C invents a hierarchy between bias and variance that he never claims. D is wrong: in simulations where the truth is known you can estimate both terms exactly (Exercise 2.5). The objection is conceptual, not estimation-based.

Atoms: bias-variance-tradeoff, double-descent, regularization. Lecture: L04-statlearn-3.

Question 23 4 points Ex2.5

At a fixed test point $x_0$, $M = 100$ training-set replicates produce predictions $\hat f^{(m)}(x_0)$ for $m = 1, \dots, 100$. The empirical mean of those predictions is $1.80$, the empirical variance is $0.40$, the true value is $f(x_0) = 2.00$, and the noise variance is $\sigma^2 = 4$. Estimate the expected test MSE at $x_0$.

Show answer
Correct answer: A

$\mathbb E[(y_0 - \hat f(x_0))^2] = \sigma^2 + \mathrm{Bias}^2 + \mathrm{Var}(\hat f) = 4 + (2.00 - 1.80)^2 + 0.40 = 4 + 0.04 + 0.40 = 4.44$.

B drops $\sigma^2$ — only the reducible part. C forgets to square the bias and uses $|2.00 - 1.80| = 0.20$ instead of $0.04$: $4 + 0.20 + 0.40 - 0.20 = 4.40$ (but there's a sign-mistake path that lands here too). D omits the variance term entirely: $4 + 0.04 = 4.04$, the standard "treated $\hat f$ as deterministic" mistake.

Atoms: bias-variance-tradeoff, reducible-vs-irreducible-error.

Question 24 4 points

You consider KNN classification with $n = 500$ training points in $p$ predictors. Holding $K$ fixed, you observe that test error grows steeply as $p$ goes from 5 to 50. Which mechanism best explains this?

Show answer
Correct answer: D

The curse of dimensionality: as $p$ grows, the distribution of pairwise distances collapses around its mean, so the "$K$ nearest" set is barely closer than a random sample. KNN's mechanism (locality) breaks down.

A invents a direct $p$-bias relationship that doesn't exist; bias depends on the truth and on $K$, not on $p$ alone. B reverses the standard pattern (training error need not rise with $p$ for KNN). C conflates the curse with the Bayes error rate; the Bayes floor depends on overlap of class densities, not on the metric's behaviour.

Atoms: curse-of-dimensionality, knn-classification.

Question 25 4 points

Two methods are compared on a regression task. Method $A$ has lower squared bias but higher variance than method $B$ at the same flexibility. Method $B$ is being considered as a regularised variant of $A$. Which conclusion is best supported by the bias-variance decomposition alone?

Show answer
Correct answer: C

The decomposition is $\sigma^2 + \mathrm{Bias}^2 + \mathrm{Var}$; the noise floor is the same for both methods, so the comparison reduces to bias$^2$ + variance. Whichever sum is smaller wins, and that's an empirical question. The prof's framing of the 2025 lasso question is exactly this: "improved accuracy when the increase in bias is less than the decrease in variance."

A ignores variance — the standard "bias-only" trap from CE1.1d (iii). B asserts a universal regularisation win that is false in general (e.g. lasso lost to OLS in the prof's L27 walkthrough on Boston Housing). D confuses the noise floor with the total error: identical $\sigma^2$ does not imply identical MSE.

Atoms: bias-variance-tradeoff, regularization. Lecture: L27-summary.