Module 05 — Resampling

Question 1 3 points

In the three-partition framing of training, validation, and test data, which job belongs to the validation set?

A Fit the parameters of each candidate model.
B Report the final, headline performance number.
C Choose between candidate models or hyperparameters.
D Estimate the irreducible noise variance $\sigma^2$.

Show answer

Correct answer: C

Three partitions, three jobs: training fits, validation selects, test assesses. Reusing the test set for selection makes the reported performance "too optimistic", the cardinal sin module 5 is built around.

A names the training set's job. B names the test set's job. D conflates an irreducible-error estimate (a property of the data) with a partition role; no partition is dedicated to estimating $\sigma^2$.

Atom: training-validation-test-split. Lecture: L10-resample-1.

Question 2 3 points

A colleague evaluates ten polynomial regression models with the validation-set approach (single random 50/50 split). They rerun the random split a second time and the curve of validation MSE vs. degree changes substantially, picking a different "best" degree. What is the most direct cause?

A The validation MSE has high variance across random splits because each split fits on only half the data.
B Polynomial regression is biased; the validation set cannot detect the bias of an unbiased estimator.
C The independence assumption is violated whenever data is split randomly.
D Validation-set MSE is mathematically equal to LOOCV, so the change is numerical noise.

Show answer

Correct answer: A

The headline drawback of the validation-set approach is the high variance of the estimate across splits, the canonical "10 reruns, 10 different curves" Auto-data slide demo. Different observations land in the validation half each time, and the model is fit on only ~$n/2$ observations, so the validation MSE jumps around.

B inverts the meaning of bias (and "validation cannot detect bias of an unbiased estimator" is incoherent). C confuses random splitting with a violation of independence between observations; random splits are the right thing to do for IID data. D is a CE1-4b trap: the validation-set approach is not the same as 2-fold CV.

Atom: validation-set-approach. Lecture: L10-resample-1.

Question 3 6 points

Mark each statement comparing LOOCV with $k$-fold CV ($k=5$ or $k=10$) as true or false.

a) LOOCV trains on $n-1$ observations per fold, so it has lower bias as an estimator of test error than 5-fold CV. True False
b) The per-fold errors in LOOCV are nearly uncorrelated, which is why their average has lower variance than 5-fold CV. True False
c) For OLS, LOOCV can be computed from a single full-data fit using the diagonal $h_{ii}$ of the hat matrix. True False

Show answer

True — training on nearly the full sample makes LOOCV essentially unbiased for the model fit on $n$ observations; 5-fold trains on $4n/5$ and so overestimates test error slightly.
False — direction reversed. LOOCV training sets share $n-2$ of $n-1$ points, so per-fold errors are highly correlated and the average has high variance. $k$-fold de-correlates and wins on variance.
True — the shortcut $\text{CV}_{(n)} = \frac{1}{n}\sum ((y_i-\hat y_i)/(1-h_{ii}))^2$ uses only the full-data residuals and the hat-matrix diagonal; no $n$ refits.

Atoms: leave-one-out-cv, k-fold-cv, design-matrix-and-hat-matrix.

Question 4 3 points

For $k$-fold CV with equal-sized folds and squared-error loss in regression, which expression equals $\text{CV}_{(k)}$?

A $\frac{1}{n}\sum_{i=1}^{n}(y_i-\hat y_i)^2$ where $\hat y_i$ is the prediction from a single full-data fit averaged across folds.
B $\frac{1}{k}\sum_{j=1}^{k}(y_j-\hat y_j)^2$, evaluated only at the $k$ "fold representatives" (one observation per fold).
C $\frac{1}{k-1}\sum_{j=1}^{k}(\text{MSE}_j-\overline{\text{MSE}})^2$, the between-fold sample variance of the held-out MSEs.
D $\frac{1}{k}\sum_{j=1}^{k}\text{MSE}_j$, where $\text{MSE}_j = \frac{1}{n_j}\sum_{i\in C_j}(y_i-\hat y_i^{(-j)})^2$.

Show answer

Correct answer: D

Each fold $j$ contributes the held-out MSE of the model trained on the other $k-1$ folds; CV is the average across folds. With unequal folds the weighted form $\frac{1}{n}\sum n_j\text{MSE}_j$ is identical to D when fold sizes match.

A is the in-sample training MSE, no held-out structure at all. B invents "fold representatives" and averages only $k$ residuals out of $n$; the held-out test in fold $j$ uses every observation in $C_j$. C is the formula for the per-fold sample variance $\widehat{\text{SE}}^2$, which feeds the one-SE rule, not the CV estimate itself.

Atom: k-fold-cv.

Question 5 4 points CE1 P4

You want pseudocode for choosing $K$ in KNN regression by 10-fold CV. Which option correctly describes the procedure?

A Randomly partition the training observations into 10 folds. For each candidate $K$ and each fold $j$: fit KNN-$K$ on the other 9 folds, predict on fold $j$, compute $\text{MSE}_j$. Average the 10 fold-MSEs to get $\text{CV}(K)$, then choose $K^*=\arg\min_K \text{CV}(K)$.
B For each candidate $K$, fit KNN-$K$ on the full training data once and store the predictions; then for each fold $j$ compute the residuals on fold $j$ from that single full-data model and sum the squared residuals across folds to get $\text{CV}(K)$, then choose the $K$ that minimises the sum.
C Randomly partition the training observations into 10 folds; for each candidate $K$, fit KNN-$K$ ten times on different bootstrap samples drawn with replacement at size $n$ from the union of the 10 folds, then take the median predicted $K$ across the runs as the cross-validated choice.
D Hold out a single random 10% validation chunk once at the start; for each candidate $K$ fit KNN-$K$ on the remaining 90% and pick the $K$ whose validation MSE is smallest, treating this single split as a Monte-Carlo approximation to 10-fold CV.

Show answer

Correct answer: A

Standard $k$-fold CV pseudocode (CE1 problem 4a): partition once, then for each hyperparameter sweep all $k$ folds, fitting the model on the other $k-1$ each time and evaluating on the held-out fold. CV($K$) is the average; pick the minimizer.

B never holds anything out, the residuals come from a model that already saw fold $j$. So this is just training error in disguise. C confuses CV with bootstrap aggregation; bootstrap samples are not the same as fold partitions, and "median predicted $K$" doesn't define a CV procedure. D is the validation-set approach with a single 10% chunk; that's not CV (no rotation, high split-to-split variance).

Atoms: k-fold-cv, cross-validation.

Question 6 6 points CE1 P4

Mark each statement about resampling schemes as true or false.

a) 5-fold CV will generally give a prediction-error estimate with less bias but more variance than LOOCV. True False
b) 10-fold CV is computationally cheaper than LOOCV when no closed-form shortcut applies. True False
c) The validation-set approach is the same procedure as 2-fold cross-validation. True False
d) LOOCV is a form of bootstrapping. True False

Show answer

False — directions reversed. LOOCV trains on more data → lower bias; LOOCV folds are highly correlated → higher variance. So 5-fold has more bias and less variance than LOOCV.
True — 10 fits versus $n$ fits in general; only OLS gets the hat-matrix shortcut.
False — 2-fold CV swaps the two halves and averages two MSEs; the validation-set approach uses one half to fit and one half to evaluate, with no swap. Different procedures, different statistics.
False — LOOCV partitions without replacement (each obs held out exactly once); bootstrap samples with replacement.

Atoms: leave-one-out-cv, k-fold-cv, validation-set-approach, bootstrap.

Question 7 4 points

For an OLS fit on $n=4$ observations the residuals and hat-matrix diagonals are $r_i = y_i - \hat y_i = (2,\,-1,\,1,\,0)$ and $h_{ii} = (0.5,\,0.5,\,0.5,\,0.5)$. Using the LOOCV hat-matrix shortcut, what is $\text{CV}_{(n)}$?

A $1.5$
B $3.0$
C $6.0$
D $24.0$

Show answer

Correct answer: C

Apply $\text{CV}_{(n)} = \frac{1}{n}\sum_i \big(\frac{r_i}{1-h_{ii}}\big)^2$. With $h_{ii}=0.5$ uniformly, $1 - h_{ii} = 0.5$, so each inflated residual is $r_i / 0.5 = 2 r_i$, giving $(4,-2,2,0)$. Squared: $(16, 4, 4, 0)$, sum $=24$. Divide by $n=4$: $\text{CV}_{(n)} = 6$.

A is the plain training MSE $(4+1+1+0)/4 = 1.5$, encoding "forgot the $1/(1-h_{ii})$ inflation." B is half of the correct answer, encoding "forgot to square the inflation factor" (computed $\frac{1}{n}\sum r_i^2/(1-h_{ii})$ instead of $\frac{1}{n}\sum r_i^2/(1-h_{ii})^2$). D forgets the $1/n$ averaging step ($\sum r_i^2/(1-h_{ii})^2 = 24$ unaveraged).

Atoms: leave-one-out-cv, design-matrix-and-hat-matrix.

Question 8 4 points

A 10-fold CV sweep over polynomial degrees gives the table below. The standard error at the minimum is $\widehat{\text{SE}} = 0.20$. Applying the one-standard-error rule, which degree do you choose?

A Degree 1, since the one-SE rule defaults to the simplest candidate model whenever the SE is non-negligible.
B Degree 2, the simplest model whose CV error is within $0.20$ of the minimum.
C Degree 4, the polynomial with the lowest CV error among the six candidates.
D Degree 5, since the CV curve is flat between degrees 4 and 5 and the one-SE rule prefers the more flexible neighbour.

Show answer

Correct answer: B

The minimum is degree 4 with CV $=1.40$. The 1-SE bound is $1.40 + 0.20 = 1.60$. Walk in the direction of "simpler" (smaller degree) and stop at the first model that exceeds the bound. Degree 3 ($1.45$) is in; degree 2 ($1.50$) is in; degree 1 ($1.95$) is out. The simplest model still inside the bound is degree 2.

A always picks the simplest model regardless of the CV evidence; the rule requires staying within the SE band. C is just the CV minimum; the 1-SE rule deliberately moves toward simpler. D walks the wrong direction, the rule biases you toward simpler, not toward equally-flexible neighbors.

Atoms: one-standard-error-rule, k-fold-cv.

Question 9 3 points

The slide deck flags the standard-error estimate underlying the one-SE rule as "strictly speaking, not quite valid." Why?

A Sample standard deviations are biased downward in finite samples, so the SE band is systematically too tight.
B The per-fold MSEs are not normally distributed, so the implicit Gaussian SE band has the wrong tail probabilities.
C The held-out folds used to compute the SE are also used to pick the optimal hyperparameter, so the SE is not an independent estimate.
D The folds are sampled without replacement, which violates the i.i.d. requirement of the SE formula.

Show answer

Correct answer: C

The SE is computed on the very fold-MSEs that were minimized to choose $\hat\theta$, so it isn't an independent uncertainty estimate. The prof leaves it as a thought question on the slide; the answer is the same selection-bias logic that motivates nested CV.

A is irrelevant: the sample SD with $n-1$ in the denominator is unbiased for the variance under standard conditions, and that is not the slide's concern. B describes a possible issue for confidence-interval shape, but the prof's footnote is about reuse of data, not normality. D is wrong, no-replacement partitioning is exactly what CV does, and the SE formula does not require i.i.d. fold errors.

Atoms: one-standard-error-rule, k-fold-cv. Lecture: L10-resample-1.

Question 10 3 points

In lecture the prof said: "I almost always in almost everything I do, I use cross-validation of some sort … because assumptions are always wrong, right? They're just so wrong." Which best summarises why he prefers CV to AIC, BIC, and Mallows' $C_p$?

A CV is provably unbiased for the test error at any sample size, whereas AIC, BIC, and $C_p$ are biased estimators of test error.
B AIC, BIC, and $C_p$ depend on distributional and model-specification assumptions that often fail in real data, while CV requires fewer such assumptions.
C AIC, BIC, and $C_p$ apply only to logistic regression and tree-based methods, whereas CV is the only criterion that works for OLS regression.
D AIC, BIC, and $C_p$ are exact only in the high-dimensional regime $p > n$, where CV is computationally infeasible.

Show answer

Correct answer: B

The prof's verbatim defence: "your assumptions have to be right. And they're not. They're not typically right." Information criteria assume a correct distributional family, a correctly specified model, and i.i.d. samples; resampling makes none of these assumptions and is robust to violations.

A overstates: CV has its own biases (slight upward bias for $k$-fold, see Q3/Q6), and AIC/BIC are not "biased estimators" in any simple sense, they target different quantities. C invents a coverage limitation that does not exist; AIC/BIC are usable for many model classes including linear regression. D is reversed: AIC/BIC and $C_p$ in fact break down in $p>n$ (no $\hat\sigma^2$), and CV remains usable there.

Atoms: aic-bic-conceptual, cross-validation. Lecture: L10-resample-1.

Question 11 3 points

Among the conceptual claims about AIC and BIC that are still in scope (formulas are not), which one is correct?

A AIC and BIC are equivalent for $n \le 7$ and diverge only above that threshold, where AIC becomes the more aggressive penaliser.
B Both penalise complexity by the same amount as a function of $p$; the only difference is the leading constant in front of the log-likelihood term.
C Both criteria penalise the test error directly without ever using the training residuals, which is what distinguishes them from CV.
D BIC penalises model complexity more aggressively than AIC, so it tends to favour smaller models.

Show answer

Correct answer: D

The standard one-line comparison: BIC's complexity penalty grows with $\log n$, AIC's does not. So as $n$ grows, BIC penalises additional parameters more harshly and selects smaller models.

A invents a sample-size threshold and inverts the direction; AIC's penalty is $2p$ regardless of $n$ and is always less aggressive than BIC's $\log(n) p$ for $n \ge 8$. B is wrong, the penalties differ in functional form (a $\log n$ factor on $p$), not just a leading constant. C inverts the construction: both criteria start from the training error and add a penalty to estimate test error.

Atom: aic-bic-conceptual.

Question 12 4 points Ex5.3

A genomics paper reports the following pipeline on $n = 50$ samples and $p = 5{,}000$ predictors with random 0/1 labels (so the true Bayes error is 50%): (i) compute the correlation between every predictor and $y$, keep the top $d=25$; (ii) run 10-fold CV on a logistic regression using only those 25 predictors. The reported CV misclassification error is ~0%. What does this 0% tell you?

A The pipeline is correct; the high-dimensional logistic fit on the screened top-25 has discovered a sparse signal that simpler methods miss.
B The CV estimate is biased downward because the correlation filter used $y$ on the full data; held-out folds were already inside the selection step, so they aren't truly held out.
C 10-fold CV is too few folds; LOOCV with the same outside-the-loop screening pipeline would give a more honest estimate of the test error.
D The 0% error is real but an artefact of $p \gg n$ where logistic interpolates the labels; switching to ridge would fix the optimism.

Show answer

Correct answer: B

Selection bias / "wrong-way CV". Step (i) uses $y$ on all observations, so by the time the CV loop in step (ii) runs, the chosen 25 predictors already encode information about every observation, including those that are about to be "held out." The fix is to redo the correlation filter inside each training fold on training data only; the right-way CV recovers the true ~50% misclassification.

A misses the entire trap, this is the canonical "lying with statistics" example. C is a CV-mechanics red herring; LOOCV with the same outside-the-loop filter is just as compromised. D invents a high-dimensional fix; switching the classifier doesn't undo the data leak in the selection step.

Atom: nested-cv-and-cv-pitfalls. Lecture: L11-resample-2.

Question 13 4 points

Continuing Q12: which pseudocode for "filter then classify" gives an honest CV estimate of generalisation error?

A Compute correlations on all $n$ observations once and keep the top $d$; then for $j = 1,\dots,k$ train logistic on the training folds restricted to those $d$ columns, predict on $C_j$, and average held-out errors.
B For $j=1,\dots,k$: on training folds only, compute correlations and pick top $d$; train logistic on those $d$ in the training folds; predict on $C_j$ using the same $d$ predictors; average errors.
C Compute correlations on all $n$ observations and keep the top $d$; train one logistic regression on the full dataset and report training-set misclassification error as the generalisation estimate.
D For $j = 1,\dots,k$ compute correlations on the held-out fold $C_j$ only, keep the top $d$ selected on $C_j$, train logistic on the remaining folds, and average errors.

Show answer

Correct answer: B

The discipline: anything that uses $y$ counts as training and must live inside the CV loop. Refit the filter on each training fold using only training data, then evaluate on the held-out fold with the predictors picked from training. The $d$ chosen predictors can change across folds — that's expected.

A is the wrong-way pipeline from Q12. C reports training error and never holds anything out. D inverts the leakage direction: now the held-out fold drives the selection, which is even worse and breaks the whole point of holding it out.

Atoms: nested-cv-and-cv-pitfalls, cross-validation.

Question 14 4 points

You apply LOOCV to a regression model on hourly weather measurements from a single sensor. Adjacent hours are strongly correlated. Compared to its behaviour on i.i.d. data, what happens to the LOOCV estimate?

A LOOCV becomes effectively unbiased in this setting because each fold uses a maximally large training set of $n-1$ observations, and bias is the only property of CV that depends on the dependence structure between observations.
B LOOCV is unaffected by temporal correlation as long as the regression model itself is well-specified; correlation between observations only matters for the bootstrap, which assumes i.i.d. draws.
C LOOCV underestimates the test error because the held-out point is essentially predicted by its near-neighbour, so per-fold errors are artificially small and CV picks an overly complex model.
D LOOCV is unbiased for the conditional test error at observed time stamps but is too noisy to be useful, so the prof's recommendation is to fall back to a hold-out validation set with the final 20% of hours treated as the test set.

Show answer

Correct answer: C

The prof's warning: "two points right next to each other, one in your training, one in your validation, it's the same damn thing." The held-out hour is well-predicted by its left and right neighbours that are still in training, so the fold error doesn't reflect generalisation to a genuinely new region of the predictor space. CV therefore picks the most complex model possible. The fix is to chunk by the dependency dimension, e.g. block by day or week, so adjacent folds are temporally separated.

A confuses "uses $n-1$ points" with independence; bias and dependence are unrelated dimensions of the problem, and the bias-vs-dependence pairing here is incoherent. B is the canonical wrong intuition the prof flagged. D claims LOOCV is unbiased here and recommends a final-20% hold-out as a fix; both halves are wrong: LOOCV is biased downward by leakage (not unbiased), and a chronological hold-out without blocking the training side still suffers the same near-neighbour leakage at the seam.

Atoms: leave-one-out-cv, cross-validation.

Question 15 4 points ISLP §5 Q2

Drawing a bootstrap sample of size $n$ from $n$ observations (with replacement). For large $n$, what is the approximate probability that any given original observation $i$ appears at least once in the bootstrap sample?

A $\approx 0.368$
B $\approx 0.500$
C $\approx 0.632$
D $\approx 1.000$

Show answer

Correct answer: C

$P(\text{not picked on one draw}) = 1 - 1/n$, raised to $n$ independent draws. As $n\to\infty$, $(1-1/n)^n \to 1/e \approx 0.368$, so $P(\text{in bootstrap}) = 1 - 1/e \approx 0.632$.

A is the OOB fraction (the complement, ~37% are not in any given bootstrap, used in OOB error). B is the "I split it 50/50" misintuition that ignores with-replacement sampling. D would correspond to sampling without replacement at full size, where every observation appears exactly once.

Atoms: bootstrap, out-of-bag-error.

Question 16 4 points

Mark each statement about the bootstrap as true or false.

a) Bootstrap samples are drawn with replacement and are the same size $n$ as the original sample. True False
b) The bootstrap can correct for the bias of an estimator $\hat\theta$ relative to the true $\theta$. True False

Show answer

True — same size $n$ keeps the variance scale right; with-replacement is what makes successive samples genuinely different (without-replacement at full size just permutes the data).
False — the bootstrap quantifies variability of $\hat\theta$, not its bias relative to the unknown true $\theta$. The bootstrap distribution is centred near $\hat\theta$, not near $\theta$.

Atom: bootstrap.

Question 17 4 points Ex5.5

Which procedure correctly estimates the bootstrap standard error of an OLS coefficient $\hat\beta_j$ when you do not want to rely on the closed-form $\sigma^2(X^\top X)^{-1}$ formula?

A Refit OLS $B$ times on $B$ random subsets of size $n/2$ drawn without replacement; take the sample SD of the $B$ values of $\hat\beta_j$ as the bootstrap SE estimate.
B Permute the rows of $X$ relative to $y$ (without replacement) $B$ times, refit OLS, and use the sample SD of $\hat\beta_j$ across the permutations as the bootstrap SE.
C Take a single bootstrap sample of size $n$ from $(X,y)$ with replacement, refit OLS, and report the residual standard error of that single fit as the bootstrap SE.
D For $b = 1,\dots,B$: draw $n$ rows from $(X,y)$ with replacement; refit OLS; store $\hat\beta_j^{*b}$. Bootstrap SE is the sample SD of $\{\hat\beta_j^{*1},\dots,\hat\beta_j^{*B}\}$.

Show answer

Correct answer: D

Standard bootstrap recipe (Exercise 5.5 / 5.6): resample rows with replacement at size $n$, refit, collect the resulting $\hat\beta_j^{*b}$ values, and use their sample SD as the bootstrap SE estimate. Use $B$ in the 1{,}000–10{,}000 range.

A uses subsets of size $n/2$ without replacement; that is subsampling, not bootstrapping, and the SE has a different scale. B is permutation testing of the design matrix and breaks the relationship between $X$ and $y$, which is not what we want for a coefficient SE. C ignores the whole point: a single bootstrap sample tells you nothing — the whole point is the distribution across $B$ resamples.

Atom: bootstrap.

Question 18 4 points

$B = 1000$ bootstrap replicates of an estimator give the following ordered quantiles: $\hat\theta^*_{(25)} = 0.18$, $\hat\theta^*_{(50)} = 0.20$, $\hat\theta^*_{(500)} = 0.30$, $\hat\theta^*_{(950)} = 0.42$, $\hat\theta^*_{(975)} = 0.45$. Using the percentile method, what is a 95% bootstrap CI for $\theta$?

A $[0.20, 0.42]$
B $[0.18, 0.45]$
C $[0.18, 0.42]$
D $[0.30 \pm 1.96\,s]$.

Show answer

Correct answer: B

The percentile method takes the 2.5% and 97.5% quantiles of the bootstrap distribution. With $B = 1000$, those are positions 25 and 975, giving $[0.18,\,0.45]$.

A uses the median (position 50) and the 95% upper, mixing two different thresholds. C uses the right tail at 95% (position 950) instead of 97.5%, so the CI is too narrow on the upper end. D names the normal-approximation CI rather than the percentile CI; it could give a different answer, and it is not what "percentile method" denotes.

Atom: bootstrap.

Question 19 4 points CE1 P4

You have fit a logistic regression of $\texttt{chd}$ on $\texttt{sbp}$ and $\texttt{sex}$, and want a 95% CI for the predicted probability $\hat P(\texttt{chd}=1\mid \texttt{sbp}=140, \texttt{sex}=\text{male})$. Why is the bootstrap a natural tool here?

A Predicted probabilities under the logistic model are normally distributed in finite samples, and the bootstrap is required as an intermediate step to convert them to the log-odds scale where Wald-style CIs are valid before transforming back.
B The bootstrap removes overfitting from the fitted model by resampling, so a 95% bootstrap CI for $\hat P$ is exactly the model's training-set error band, recentred at the held-out prediction at $(\texttt{sbp}=140, \texttt{sex}=\text{male})$.
C Bootstrap percentile CIs always coincide with theoretical Wald CIs at any sample size whenever the underlying GLM has been correctly specified, so the bootstrap here is just a computationally cheaper alternative to the standard delta-method CI.
D The closed-form $\sigma^2(X^\top X)^{-1}$ does not give an SE for nonlinear functions of the coefficients like a predicted probability; the bootstrap estimates the SE/CI of $\hat P$ directly without a closed form.

Show answer

Correct answer: D

Predicted probabilities are derived nonlinear quantities — $\sigma(\hat\beta_0 + \hat\beta_1 \cdot 140 + \hat\beta_2 \cdot 1)$. There is no clean closed-form SE, so resample $(X,y)$ rows with replacement, refit, recompute $\hat P$ at the same predictor values each time, take the SD across resamples for the SE and the 2.5%/97.5% quantiles for a percentile CI.

A fabricates a normality claim about predicted probabilities. B confuses confidence intervals with training-error bands. C is wrong — bootstrap CIs and Wald CIs agree only when the underlying assumptions hold; the whole reason to bootstrap is that they often don't.

Atom: bootstrap.

Question 20 3 points

For $n$ large, what fraction of the original observations are out-of-bag for any given bootstrap sample (i.e. not drawn into it at all)?

A $\approx 1/2$
B $\approx 1/e \approx 0.368$
C $\approx 1 - 1/e \approx 0.632$
D $0$, every observation is drawn at least once for $n$ large.

Show answer

Correct answer: B

$P(\text{not in bootstrap}) = (1-1/n)^n \to 1/e \approx 0.368$. Roughly 37% of observations are OOB for any given tree, which becomes a free per-tree validation set in bagging and random forests.

A is the validation-set-approach intuition leaking in. C is the in-bag fraction, the complement. D inverts the limit; with-replacement sampling cannot draw every distinct observation in $n$ draws.

Atoms: out-of-bag-error, bootstrap.

Question 21 4 points

Mark each statement about bagging as true or false.

a) Increasing the number of bagged trees $B$ eventually drives the variance of the bagged predictor to zero. True False
b) Bagging tends to help high-variance, low-bias base learners (e.g. deep trees) more than low-variance models (e.g. linear regression). True False

Show answer

False — variance of the bagged predictor is $\rho\sigma^2 + \frac{1-\rho}{B}\sigma^2$. The first term is a floor that doesn't shrink with $B$; correlated bootstrap samples make $\rho > 0$. Random forests address the floor by decorrelating trees with random predictor subsets.
True — averaging only helps when the base learners are themselves noisy. A single deep tree is high-variance and benefits from averaging; OLS is already low variance, so bagging adds little.

Atoms: bagging, out-of-bag-error.

Question 22 3 points

Which statement most accurately distinguishes bagging from boosting?

A Bagging and boosting both fit trees in parallel on bootstrap samples; the only difference is that bagging takes a simple average and boosting takes a weighted average learned from residuals.
B Bagging and boosting are equivalent algorithms in expectation; the names refer to two implementations of the same ensemble idea over the same bootstrap draws.
C Bagging fits trees in parallel on independent bootstrap samples and averages or votes; boosting fits trees sequentially, with each new tree correcting the previous fit.
D Bagging is a single-tree algorithm using a bootstrap sample to pick splits; boosting is the multi-tree extension that fits one tree per residual round.

Show answer

Correct answer: C

Bagging = parallel + bootstrap + average; the goal is variance reduction. Boosting = sequential + residual-fitting + weighted ensemble; the goal is bias reduction. Different mechanisms, different tuning behaviour ($B$ doesn't overfit in bagging but does in boosting).

A is wrong on the parallelism of boosting: boosting cannot be fit in parallel because each tree depends on the residuals of the ensemble so far, and its weighting scheme is not a learned weight on bootstrap predictions but a sequential gradient-descent step on the loss. B asserts that bagging and boosting are equivalent in expectation, which they are not: one targets variance reduction at fixed bias, the other targets bias reduction at fixed (or even higher) variance. D invents a single-tree/multi-tree distinction that does not exist (both fit many trees).

Atoms: bagging, boosting.

Question 23 3 points

You're choosing the size of a regression model on a non-Gaussian dataset with mild collinearity. AIC suggests one model size, 10-fold CV suggests another. What does the prof's framing recommend?

A Use whichever criterion gives the smaller model: both AIC and CV are internally consistent estimators, so the more parsimonious choice is the safest default.
B Refit OLS with collinearity-robust standard errors and pick the model whose coefficients are jointly significant; this replaces both criteria with hypothesis testing.
C Trust the AIC choice: 10-fold CV is too noisy at $k = 10$, while AIC is mathematically optimal as a finite-sample estimator for any GLM.
D Trust the 10-fold CV choice; AIC depends on distributional assumptions that this dataset likely violates, while CV makes fewer assumptions.

Show answer

Correct answer: D

The prof's running default: CV is robust to assumption violations, AIC/BIC/$C_p$ aren't. With non-Gaussian data, AIC's basis collapses; CV still does the right job because it actually evaluates predictions on held-out folds.

A treats two non-equivalent criteria as interchangeable and invents a "pick the smaller one" rule the prof never endorsed. B replaces both selection criteria with significance testing, which is a different problem entirely (significance answers "is $\beta_j \ne 0$", not "what model size minimises test error"). C fabricates a property of AIC ("mathematically optimal") that holds only under exact assumptions, which the scenario violates.

Atoms: cross-validation, aic-bic-conceptual.

Question 24 4 points Exam 2025 P6b

A 10-fold CV-vs-$\log\lambda$ plot for a lasso regression has its minimum at $\log\lambda^* \approx -3$ with CV-MSE = 50.8. The unregularised OLS baseline (which corresponds to $\lambda \to 0$, i.e. far left of the plot) has test MSE = 50.78. Following the prof's interpretation in the L27 walkthrough, what do you conclude?

A Lasso slightly underperforms the unregularised model on this dataset, suggesting that the bias added by regularisation isn't offset by a meaningful variance reduction; keep all parameters.
B Lasso with the chosen $\lambda^*$ clearly beats OLS because the CV-MSE at the minimum is lower than the OLS test MSE; report the lasso model with zeroed coefficients dropped.
C The $0.02$ MSE gap is so small it must reflect a numerical bug in the CV routine or the lasso solver; rerun with a different seed and trust only if the gap reverses sign.
D CV is unreliable here because lasso's sparsity breaks the $k$-fold variance assumption; report the OLS solution and ignore the lasso curve until a sparsity-corrected SE is available.

Show answer

Correct answer: A

The prof's bias-variance tie-in: regularisation trades off bias up for variance down. If after CV-tuning the regularised model is no better than the unregularised one, the variance reduction wasn't enough to offset the bias gain. Keep all parameters and don't add a regulariser.

B inverts the comparison; lasso's MSE is slightly worse here. C dismisses a real, interpretable result as a bug. D rejects CV outright; CV is exactly how we know the lasso isn't helping, the prof's whole argument depends on trusting the CV curve.

Atoms: cross-validation, regularization, bias-variance-tradeoff. Lecture: L27-summary.

Question 25 3 points

Why is the validation-set approach not the same procedure as 2-fold cross-validation, even though both split the data in roughly two?

A 2-fold CV fits the model twice (swapping train and validation roles) and averages the two fold-MSEs; the validation-set approach fits once and uses only one of those roles.
B 2-fold CV uses bootstrap resampling with replacement to construct the two folds, whereas the validation-set approach partitions once without replacement and never resamples.
C 2-fold CV is mathematically equivalent to LOOCV in the limit of large $n$; the validation-set approach is biased and not equivalent to either.
D 2-fold CV first fits the model on the full data and then refits twice on each half for residual corrections; the validation-set approach skips the full-data fit.

Show answer

Correct answer: A

2-fold CV is symmetric: fold 1 trains, fold 2 validates, then swap, then average the two MSEs. The validation-set approach commits to one direction and never swaps, so its variance behaviour and bias are both worse than 2-fold CV.

B confuses CV with bootstrap; both 2-fold CV and the validation-set approach are partition-based without replacement. C is wrong, 2-fold and LOOCV are at opposite ends of the $k$-fold spectrum. D invents a "full-data fit" step that neither procedure includes.

Atoms: validation-set-approach, k-fold-cv.

Question 26 3 points

Order the three CV schemes — LOOCV, 10-fold CV, validation-set approach — by their variance as estimators of test error (lowest variance first), assuming i.i.d. data and the same model.

A LOOCV < 10-fold < validation-set.
B Validation-set < 10-fold < LOOCV.
C All three have the same variance because they are all CV.
D 10-fold < LOOCV < validation-set.

Show answer

Correct answer: D

10-fold CV has lower variance than LOOCV because its training sets share less data (folds are less correlated). LOOCV's $n$ training sets share $n-2$ of $n-1$ points → highly correlated fold errors → high-variance average. Validation-set is the worst: a single split is wildly variable across reruns.

A inverts the LOOCV / 10-fold direction (the canonical CE1-4b trap). B inverts the LOOCV / validation-set direction; LOOCV is variable, but the validation-set approach is even worse. C ignores the bias-variance discussion that motivates picking $k = 5$ or $10$.

Atoms: leave-one-out-cv, k-fold-cv, validation-set-approach.

Question 27 3 points

You have $n=1000$ observations and want both to select a hyperparameter and to report an honest test error. Which pipeline is consistent with the prof's framing?

A Hold out 200 as a test set once; run 10-fold CV on the remaining 800 to pick the hyperparameter; refit on all 800 with the chosen value; report MSE on the held-out 200.
B Run 10-fold CV on all 1000 observations once; pick the hyperparameter at the CV minimum and report that same CV error as the final test error, since CV is itself unbiased.
C Run 10-fold CV on all 1000 observations; if the test MSE looks bad, retune the hyperparameter on the same 10-fold split and report whichever configuration gives the better MSE.
D Hold out 200 as a test set; tune the hyperparameter by minimising the MSE on those 200 test observations across a grid; refit on all 1000 and report.

Show answer

Correct answer: A

Three jobs, three partitions. CV does the validation/selection job inside the 800-point training pool; the held-out 200 is touched only once at the end for assessment. (Nested CV is the equivalent answer when you can't afford a fixed test set.)

B reuses the same CV error for both selection and assessment, biasing the assessment downward. C is the same sin made worse: peeking at the test result and retuning is exactly the data-reuse pitfall the prof flagged ("don't make the dumb mistakes — it's embarrassing"). D tunes directly on the test set, which makes the test set part of training.

Atoms: training-validation-test-split, nested-cv-and-cv-pitfalls, cross-validation.