← Back to wiki

Module 05 — Resampling

27 questions · 100 points · ~40 min

Click an option to lock the answer; the explanation auto-opens. Score tracker bottom-left.

Question 1 3 points

In the three-partition framing of training, validation, and test data, which job belongs to the validation set?

Show answer
Correct answer: C

Three partitions, three jobs: training fits, validation selects, test assesses. Reusing the test set for selection makes the reported performance "too optimistic", the cardinal sin module 5 is built around.

A names the training set's job. B names the test set's job. D conflates an irreducible-error estimate (a property of the data) with a partition role; no partition is dedicated to estimating $\sigma^2$.

Atom: training-validation-test-split. Lecture: L10-resample-1.

Question 2 3 points

A colleague evaluates ten polynomial regression models with the validation-set approach (single random 50/50 split). They rerun the random split a second time and the curve of validation MSE vs. degree changes substantially, picking a different "best" degree. What is the most direct cause?

Show answer
Correct answer: A

The headline drawback of the validation-set approach is the high variance of the estimate across splits, the canonical "10 reruns, 10 different curves" Auto-data slide demo. Different observations land in the validation half each time, and the model is fit on only ~$n/2$ observations, so the validation MSE jumps around.

B inverts the meaning of bias (and "validation cannot detect bias of an unbiased estimator" is incoherent). C confuses random splitting with a violation of independence between observations; random splits are the right thing to do for IID data. D is a CE1-4b trap: the validation-set approach is not the same as 2-fold CV.

Atom: validation-set-approach. Lecture: L10-resample-1.

Question 3 6 points

Mark each statement comparing LOOCV with $k$-fold CV ($k=5$ or $k=10$) as true or false.

Show answer
  1. True — training on nearly the full sample makes LOOCV essentially unbiased for the model fit on $n$ observations; 5-fold trains on $4n/5$ and so overestimates test error slightly.
  2. False — direction reversed. LOOCV training sets share $n-2$ of $n-1$ points, so per-fold errors are highly correlated and the average has high variance. $k$-fold de-correlates and wins on variance.
  3. True — the shortcut $\text{CV}_{(n)} = \frac{1}{n}\sum ((y_i-\hat y_i)/(1-h_{ii}))^2$ uses only the full-data residuals and the hat-matrix diagonal; no $n$ refits.

Atoms: leave-one-out-cv, k-fold-cv, design-matrix-and-hat-matrix.

Question 4 3 points

For $k$-fold CV with equal-sized folds and squared-error loss in regression, which expression equals $\text{CV}_{(k)}$?

Show answer
Correct answer: D

Each fold $j$ contributes the held-out MSE of the model trained on the other $k-1$ folds; CV is the average across folds. With unequal folds the weighted form $\frac{1}{n}\sum n_j\text{MSE}_j$ is identical to D when fold sizes match.

A is the in-sample training MSE, no held-out structure at all. B invents "fold representatives" and averages only $k$ residuals out of $n$; the held-out test in fold $j$ uses every observation in $C_j$. C is the formula for the per-fold sample variance $\widehat{\text{SE}}^2$, which feeds the one-SE rule, not the CV estimate itself.

Atom: k-fold-cv.

Question 5 4 points CE1 P4

You want pseudocode for choosing $K$ in KNN regression by 10-fold CV. Which option correctly describes the procedure?

Show answer
Correct answer: A

Standard $k$-fold CV pseudocode (CE1 problem 4a): partition once, then for each hyperparameter sweep all $k$ folds, fitting the model on the other $k-1$ each time and evaluating on the held-out fold. CV($K$) is the average; pick the minimizer.

B never holds anything out, the residuals come from a model that already saw fold $j$. So this is just training error in disguise. C confuses CV with bootstrap aggregation; bootstrap samples are not the same as fold partitions, and "median predicted $K$" doesn't define a CV procedure. D is the validation-set approach with a single 10% chunk; that's not CV (no rotation, high split-to-split variance).

Atoms: k-fold-cv, cross-validation.

Question 6 6 points CE1 P4

Mark each statement about resampling schemes as true or false.

Show answer
  1. False — directions reversed. LOOCV trains on more data → lower bias; LOOCV folds are highly correlated → higher variance. So 5-fold has more bias and less variance than LOOCV.
  2. True — 10 fits versus $n$ fits in general; only OLS gets the hat-matrix shortcut.
  3. False — 2-fold CV swaps the two halves and averages two MSEs; the validation-set approach uses one half to fit and one half to evaluate, with no swap. Different procedures, different statistics.
  4. False — LOOCV partitions without replacement (each obs held out exactly once); bootstrap samples with replacement.

Atoms: leave-one-out-cv, k-fold-cv, validation-set-approach, bootstrap.

Question 7 4 points

For an OLS fit on $n=4$ observations the residuals and hat-matrix diagonals are $r_i = y_i - \hat y_i = (2,\,-1,\,1,\,0)$ and $h_{ii} = (0.5,\,0.5,\,0.5,\,0.5)$. Using the LOOCV hat-matrix shortcut, what is $\text{CV}_{(n)}$?

Show answer
Correct answer: C

Apply $\text{CV}_{(n)} = \frac{1}{n}\sum_i \big(\frac{r_i}{1-h_{ii}}\big)^2$. With $h_{ii}=0.5$ uniformly, $1 - h_{ii} = 0.5$, so each inflated residual is $r_i / 0.5 = 2 r_i$, giving $(4,-2,2,0)$. Squared: $(16, 4, 4, 0)$, sum $=24$. Divide by $n=4$: $\text{CV}_{(n)} = 6$.

A is the plain training MSE $(4+1+1+0)/4 = 1.5$, encoding "forgot the $1/(1-h_{ii})$ inflation." B is half of the correct answer, encoding "forgot to square the inflation factor" (computed $\frac{1}{n}\sum r_i^2/(1-h_{ii})$ instead of $\frac{1}{n}\sum r_i^2/(1-h_{ii})^2$). D forgets the $1/n$ averaging step ($\sum r_i^2/(1-h_{ii})^2 = 24$ unaveraged).

Atoms: leave-one-out-cv, design-matrix-and-hat-matrix.

Question 8 4 points

A 10-fold CV sweep over polynomial degrees gives the table below. The standard error at the minimum is $\widehat{\text{SE}} = 0.20$. Applying the one-standard-error rule, which degree do you choose?

Degree 1: CV = 1.95  |  Degree 2: CV = 1.50  |  Degree 3: CV = 1.45  |  Degree 4: CV = 1.40  |  Degree 5: CV = 1.42  |  Degree 6: CV = 1.55

Show answer
Correct answer: B

The minimum is degree 4 with CV $=1.40$. The 1-SE bound is $1.40 + 0.20 = 1.60$. Walk in the direction of "simpler" (smaller degree) and stop at the first model that exceeds the bound. Degree 3 ($1.45$) is in; degree 2 ($1.50$) is in; degree 1 ($1.95$) is out. The simplest model still inside the bound is degree 2.

A always picks the simplest model regardless of the CV evidence; the rule requires staying within the SE band. C is just the CV minimum; the 1-SE rule deliberately moves toward simpler. D walks the wrong direction, the rule biases you toward simpler, not toward equally-flexible neighbors.

Atoms: one-standard-error-rule, k-fold-cv.

Question 9 3 points

The slide deck flags the standard-error estimate underlying the one-SE rule as "strictly speaking, not quite valid." Why?

Show answer
Correct answer: C

The SE is computed on the very fold-MSEs that were minimized to choose $\hat\theta$, so it isn't an independent uncertainty estimate. The prof leaves it as a thought question on the slide; the answer is the same selection-bias logic that motivates nested CV.

A is irrelevant: the sample SD with $n-1$ in the denominator is unbiased for the variance under standard conditions, and that is not the slide's concern. B describes a possible issue for confidence-interval shape, but the prof's footnote is about reuse of data, not normality. D is wrong, no-replacement partitioning is exactly what CV does, and the SE formula does not require i.i.d. fold errors.

Atoms: one-standard-error-rule, k-fold-cv. Lecture: L10-resample-1.

Question 10 3 points

In lecture the prof said: "I almost always in almost everything I do, I use cross-validation of some sort … because assumptions are always wrong, right? They're just so wrong." Which best summarises why he prefers CV to AIC, BIC, and Mallows' $C_p$?

Show answer
Correct answer: B

The prof's verbatim defence: "your assumptions have to be right. And they're not. They're not typically right." Information criteria assume a correct distributional family, a correctly specified model, and i.i.d. samples; resampling makes none of these assumptions and is robust to violations.

A overstates: CV has its own biases (slight upward bias for $k$-fold, see Q3/Q6), and AIC/BIC are not "biased estimators" in any simple sense, they target different quantities. C invents a coverage limitation that does not exist; AIC/BIC are usable for many model classes including linear regression. D is reversed: AIC/BIC and $C_p$ in fact break down in $p>n$ (no $\hat\sigma^2$), and CV remains usable there.

Atoms: aic-bic-conceptual, cross-validation. Lecture: L10-resample-1.

Question 11 3 points

Among the conceptual claims about AIC and BIC that are still in scope (formulas are not), which one is correct?

Show answer
Correct answer: D

The standard one-line comparison: BIC's complexity penalty grows with $\log n$, AIC's does not. So as $n$ grows, BIC penalises additional parameters more harshly and selects smaller models.

A invents a sample-size threshold and inverts the direction; AIC's penalty is $2p$ regardless of $n$ and is always less aggressive than BIC's $\log(n) p$ for $n \ge 8$. B is wrong, the penalties differ in functional form (a $\log n$ factor on $p$), not just a leading constant. C inverts the construction: both criteria start from the training error and add a penalty to estimate test error.

Atom: aic-bic-conceptual.

Question 12 4 points Ex5.3

A genomics paper reports the following pipeline on $n = 50$ samples and $p = 5{,}000$ predictors with random 0/1 labels (so the true Bayes error is 50%): (i) compute the correlation between every predictor and $y$, keep the top $d=25$; (ii) run 10-fold CV on a logistic regression using only those 25 predictors. The reported CV misclassification error is ~0%. What does this 0% tell you?

Show answer
Correct answer: B

Selection bias / "wrong-way CV". Step (i) uses $y$ on all observations, so by the time the CV loop in step (ii) runs, the chosen 25 predictors already encode information about every observation, including those that are about to be "held out." The fix is to redo the correlation filter inside each training fold on training data only; the right-way CV recovers the true ~50% misclassification.

A misses the entire trap, this is the canonical "lying with statistics" example. C is a CV-mechanics red herring; LOOCV with the same outside-the-loop filter is just as compromised. D invents a high-dimensional fix; switching the classifier doesn't undo the data leak in the selection step.

Atom: nested-cv-and-cv-pitfalls. Lecture: L11-resample-2.

Question 13 4 points

Continuing Q12: which pseudocode for "filter then classify" gives an honest CV estimate of generalisation error?

Show answer
Correct answer: B

The discipline: anything that uses $y$ counts as training and must live inside the CV loop. Refit the filter on each training fold using only training data, then evaluate on the held-out fold with the predictors picked from training. The $d$ chosen predictors can change across folds — that's expected.

A is the wrong-way pipeline from Q12. C reports training error and never holds anything out. D inverts the leakage direction: now the held-out fold drives the selection, which is even worse and breaks the whole point of holding it out.

Atoms: nested-cv-and-cv-pitfalls, cross-validation.

Question 14 4 points

You apply LOOCV to a regression model on hourly weather measurements from a single sensor. Adjacent hours are strongly correlated. Compared to its behaviour on i.i.d. data, what happens to the LOOCV estimate?

Show answer
Correct answer: C

The prof's warning: "two points right next to each other, one in your training, one in your validation, it's the same damn thing." The held-out hour is well-predicted by its left and right neighbours that are still in training, so the fold error doesn't reflect generalisation to a genuinely new region of the predictor space. CV therefore picks the most complex model possible. The fix is to chunk by the dependency dimension, e.g. block by day or week, so adjacent folds are temporally separated.

A confuses "uses $n-1$ points" with independence; bias and dependence are unrelated dimensions of the problem, and the bias-vs-dependence pairing here is incoherent. B is the canonical wrong intuition the prof flagged. D claims LOOCV is unbiased here and recommends a final-20% hold-out as a fix; both halves are wrong: LOOCV is biased downward by leakage (not unbiased), and a chronological hold-out without blocking the training side still suffers the same near-neighbour leakage at the seam.

Atoms: leave-one-out-cv, cross-validation.

Question 15 4 points ISLP §5 Q2

Drawing a bootstrap sample of size $n$ from $n$ observations (with replacement). For large $n$, what is the approximate probability that any given original observation $i$ appears at least once in the bootstrap sample?

Show answer
Correct answer: C

$P(\text{not picked on one draw}) = 1 - 1/n$, raised to $n$ independent draws. As $n\to\infty$, $(1-1/n)^n \to 1/e \approx 0.368$, so $P(\text{in bootstrap}) = 1 - 1/e \approx 0.632$.

A is the OOB fraction (the complement, ~37% are not in any given bootstrap, used in OOB error). B is the "I split it 50/50" misintuition that ignores with-replacement sampling. D would correspond to sampling without replacement at full size, where every observation appears exactly once.

Atoms: bootstrap, out-of-bag-error.

Question 16 4 points

Mark each statement about the bootstrap as true or false.

Show answer
  1. True — same size $n$ keeps the variance scale right; with-replacement is what makes successive samples genuinely different (without-replacement at full size just permutes the data).
  2. False — the bootstrap quantifies variability of $\hat\theta$, not its bias relative to the unknown true $\theta$. The bootstrap distribution is centred near $\hat\theta$, not near $\theta$.

Atom: bootstrap.

Question 17 4 points Ex5.5

Which procedure correctly estimates the bootstrap standard error of an OLS coefficient $\hat\beta_j$ when you do not want to rely on the closed-form $\sigma^2(X^\top X)^{-1}$ formula?

Show answer
Correct answer: D

Standard bootstrap recipe (Exercise 5.5 / 5.6): resample rows with replacement at size $n$, refit, collect the resulting $\hat\beta_j^{*b}$ values, and use their sample SD as the bootstrap SE estimate. Use $B$ in the 1{,}000–10{,}000 range.

A uses subsets of size $n/2$ without replacement; that is subsampling, not bootstrapping, and the SE has a different scale. B is permutation testing of the design matrix and breaks the relationship between $X$ and $y$, which is not what we want for a coefficient SE. C ignores the whole point: a single bootstrap sample tells you nothing — the whole point is the distribution across $B$ resamples.

Atom: bootstrap.

Question 18 4 points

$B = 1000$ bootstrap replicates of an estimator give the following ordered quantiles: $\hat\theta^*_{(25)} = 0.18$, $\hat\theta^*_{(50)} = 0.20$, $\hat\theta^*_{(500)} = 0.30$, $\hat\theta^*_{(950)} = 0.42$, $\hat\theta^*_{(975)} = 0.45$. Using the percentile method, what is a 95% bootstrap CI for $\theta$?

Show answer
Correct answer: B

The percentile method takes the 2.5% and 97.5% quantiles of the bootstrap distribution. With $B = 1000$, those are positions 25 and 975, giving $[0.18,\,0.45]$.

A uses the median (position 50) and the 95% upper, mixing two different thresholds. C uses the right tail at 95% (position 950) instead of 97.5%, so the CI is too narrow on the upper end. D names the normal-approximation CI rather than the percentile CI; it could give a different answer, and it is not what "percentile method" denotes.

Atom: bootstrap.

Question 19 4 points CE1 P4

You have fit a logistic regression of $\texttt{chd}$ on $\texttt{sbp}$ and $\texttt{sex}$, and want a 95% CI for the predicted probability $\hat P(\texttt{chd}=1\mid \texttt{sbp}=140, \texttt{sex}=\text{male})$. Why is the bootstrap a natural tool here?

Show answer
Correct answer: D

Predicted probabilities are derived nonlinear quantities — $\sigma(\hat\beta_0 + \hat\beta_1 \cdot 140 + \hat\beta_2 \cdot 1)$. There is no clean closed-form SE, so resample $(X,y)$ rows with replacement, refit, recompute $\hat P$ at the same predictor values each time, take the SD across resamples for the SE and the 2.5%/97.5% quantiles for a percentile CI.

A fabricates a normality claim about predicted probabilities. B confuses confidence intervals with training-error bands. C is wrong — bootstrap CIs and Wald CIs agree only when the underlying assumptions hold; the whole reason to bootstrap is that they often don't.

Atom: bootstrap.

Question 20 3 points

For $n$ large, what fraction of the original observations are out-of-bag for any given bootstrap sample (i.e. not drawn into it at all)?

Show answer
Correct answer: B

$P(\text{not in bootstrap}) = (1-1/n)^n \to 1/e \approx 0.368$. Roughly 37% of observations are OOB for any given tree, which becomes a free per-tree validation set in bagging and random forests.

A is the validation-set-approach intuition leaking in. C is the in-bag fraction, the complement. D inverts the limit; with-replacement sampling cannot draw every distinct observation in $n$ draws.

Atoms: out-of-bag-error, bootstrap.

Question 21 4 points

Mark each statement about bagging as true or false.

Show answer
  1. False — variance of the bagged predictor is $\rho\sigma^2 + \frac{1-\rho}{B}\sigma^2$. The first term is a floor that doesn't shrink with $B$; correlated bootstrap samples make $\rho > 0$. Random forests address the floor by decorrelating trees with random predictor subsets.
  2. True — averaging only helps when the base learners are themselves noisy. A single deep tree is high-variance and benefits from averaging; OLS is already low variance, so bagging adds little.

Atoms: bagging, out-of-bag-error.

Question 22 3 points

Which statement most accurately distinguishes bagging from boosting?

Show answer
Correct answer: C

Bagging = parallel + bootstrap + average; the goal is variance reduction. Boosting = sequential + residual-fitting + weighted ensemble; the goal is bias reduction. Different mechanisms, different tuning behaviour ($B$ doesn't overfit in bagging but does in boosting).

A is wrong on the parallelism of boosting: boosting cannot be fit in parallel because each tree depends on the residuals of the ensemble so far, and its weighting scheme is not a learned weight on bootstrap predictions but a sequential gradient-descent step on the loss. B asserts that bagging and boosting are equivalent in expectation, which they are not: one targets variance reduction at fixed bias, the other targets bias reduction at fixed (or even higher) variance. D invents a single-tree/multi-tree distinction that does not exist (both fit many trees).

Atoms: bagging, boosting.

Question 23 3 points

You're choosing the size of a regression model on a non-Gaussian dataset with mild collinearity. AIC suggests one model size, 10-fold CV suggests another. What does the prof's framing recommend?

Show answer
Correct answer: D

The prof's running default: CV is robust to assumption violations, AIC/BIC/$C_p$ aren't. With non-Gaussian data, AIC's basis collapses; CV still does the right job because it actually evaluates predictions on held-out folds.

A treats two non-equivalent criteria as interchangeable and invents a "pick the smaller one" rule the prof never endorsed. B replaces both selection criteria with significance testing, which is a different problem entirely (significance answers "is $\beta_j \ne 0$", not "what model size minimises test error"). C fabricates a property of AIC ("mathematically optimal") that holds only under exact assumptions, which the scenario violates.

Atoms: cross-validation, aic-bic-conceptual.

Question 24 4 points Exam 2025 P6b

A 10-fold CV-vs-$\log\lambda$ plot for a lasso regression has its minimum at $\log\lambda^* \approx -3$ with CV-MSE = 50.8. The unregularised OLS baseline (which corresponds to $\lambda \to 0$, i.e. far left of the plot) has test MSE = 50.78. Following the prof's interpretation in the L27 walkthrough, what do you conclude?

Show answer
Correct answer: A

The prof's bias-variance tie-in: regularisation trades off bias up for variance down. If after CV-tuning the regularised model is no better than the unregularised one, the variance reduction wasn't enough to offset the bias gain. Keep all parameters and don't add a regulariser.

B inverts the comparison; lasso's MSE is slightly worse here. C dismisses a real, interpretable result as a bug. D rejects CV outright; CV is exactly how we know the lasso isn't helping, the prof's whole argument depends on trusting the CV curve.

Atoms: cross-validation, regularization, bias-variance-tradeoff. Lecture: L27-summary.

Question 25 3 points

Why is the validation-set approach not the same procedure as 2-fold cross-validation, even though both split the data in roughly two?

Show answer
Correct answer: A

2-fold CV is symmetric: fold 1 trains, fold 2 validates, then swap, then average the two MSEs. The validation-set approach commits to one direction and never swaps, so its variance behaviour and bias are both worse than 2-fold CV.

B confuses CV with bootstrap; both 2-fold CV and the validation-set approach are partition-based without replacement. C is wrong, 2-fold and LOOCV are at opposite ends of the $k$-fold spectrum. D invents a "full-data fit" step that neither procedure includes.

Atoms: validation-set-approach, k-fold-cv.

Question 26 3 points

Order the three CV schemes — LOOCV, 10-fold CV, validation-set approach — by their variance as estimators of test error (lowest variance first), assuming i.i.d. data and the same model.

Show answer
Correct answer: D

10-fold CV has lower variance than LOOCV because its training sets share less data (folds are less correlated). LOOCV's $n$ training sets share $n-2$ of $n-1$ points → highly correlated fold errors → high-variance average. Validation-set is the worst: a single split is wildly variable across reruns.

A inverts the LOOCV / 10-fold direction (the canonical CE1-4b trap). B inverts the LOOCV / validation-set direction; LOOCV is variable, but the validation-set approach is even worse. C ignores the bias-variance discussion that motivates picking $k = 5$ or $10$.

Atoms: leave-one-out-cv, k-fold-cv, validation-set-approach.

Question 27 3 points

You have $n=1000$ observations and want both to select a hyperparameter and to report an honest test error. Which pipeline is consistent with the prof's framing?

Show answer
Correct answer: A

Three jobs, three partitions. CV does the validation/selection job inside the 800-point training pool; the held-out 200 is touched only once at the end for assessment. (Nested CV is the equivalent answer when you can't afford a fixed test set.)

B reuses the same CV error for both selection and assessment, biasing the assessment downward. C is the same sin made worse: peeking at the test result and retuning is exactly the data-reuse pitfall the prof flagged ("don't make the dumb mistakes — it's embarrassing"). D tunes directly on the test set, which makes the test set part of training.

Atoms: training-validation-test-split, nested-cv-and-cv-pitfalls, cross-validation.