Module 05: Resampling — Book delta

Exam-time lookup reference for concrete artifacts that the prof taught in module 5 but that are not cleanly findable in ISLP Chapter 5. Each item is fully reproduced; cite back as [L10], [L11], [concept: ...], [slides].

The mapped ISLP chapter (wiki/book/05-resample.md) already contains: LOOCV formula (eq. 5.1), LOOCV OLS shortcut (eq. 5.2), $k$ -fold CV (eq. 5.3), classification CV (eq. 5.4), the $α$ minimum-variance bootstrap example (eq. 5.6–5.7), the bootstrap SE formula (eq. 5.8), and the $(1 - 1/ n)^{n}$ derivation in exercise 5.2. Those are not duplicated below.

1. Standard error of the CV estimate

Not in ISLP Ch. 5. ISLP states “k-fold CV has lower variance than LOOCV” but never writes the formula for $SE (CV_{k})$ . The slide deck does, and the prof reproduces it; it is the only way to apply the one-standard-error rule.

For a given hyperparameter $θ$ and a $k$ -fold split with per-fold errors $MSE_{1} (θ), \dots, MSE_{k} (θ)$ and average $\overline{MSE} (θ) = \frac{1}{k} \sum_{j} MSE_{j} (θ)$ :

SE (CV_{(k)} (θ)) = \frac{1}{k - 1} j = 1 \sum k (MSE_{j} (θ) - \overline{MSE} (θ))^{2}

This is the sample standard deviation of the per-fold MSEs (the slide deck’s formula prints without the inner square; the prof’s lecture and the canonical form both square the residual inside the sum, as written above). [slides; L10; concept: k-fold-cv]

Why "not quite valid"

Slide-deck footnote, prof flagged as a thought question: “Strictly speaking, this estimate is not quite valid. Why?” — the same held-out folds are used to pick the minimum and to estimate the SE, so the SE is not an independent estimate of CV-error variability. [L10; concept: one-standard-error-rule]

2. The one-standard-error rule

Not in ISLP Ch. 5. ISLP introduces it briefly in §6.1.3 in the model-selection chapter, not in chapter 5. For module 5 it is a delta: the prof’s preferred selection criterion, defined in the resampling lecture.

Definition. Given a $k$ -fold CV curve $CV (θ)$ with associated SE $SE (CV (θ))$ :

Find $\hat{θ} = ar g min_{θ} CV (θ)$ .
Walk $θ$ in the direction of “simpler” (lower polynomial degree, larger $K$ in KNN, larger $λ$ in ridge/lasso, fewer terminal nodes in trees, fewer predictors in subset selection) while the bound

$CV (θ) \leq CV (\hat{θ}) + SE (CV (\hat{θ}))$

still holds. 3. The simplest $θ$ still satisfying the bound is $θ^{*}$ :

θ^{*} = min {θ simpler than \hat{θ} : CV (θ) \leq CV (\hat{θ}) + SE (CV (\hat{θ}))}

[slides; L10; concept: one-standard-error-rule]

3. Variance-of-a-sum identity (the LOOCV-variance argument)

Not in ISLP Ch. 5. ISLP §5.1.4 gives the intuition “highly correlated outputs → high variance of the average” but does not write the underlying identity. The slide deck prints it explicitly to motivate LOOCV’s high variance, and the prof uses it in L10.

For real-valued random variables $X_{1}, \dots, X_{n}$ and constants $a_{1}, \dots, a_{n}$ :

Var (i = 1 \sum n a_{i} X_{i}) = i = 1 \sum n j = 1 \sum n a_{i} a_{j} Cov (X_{i}, X_{j})

= i = 1 \sum n a_{i}^{2} Var (X_{i}) + 2 i = 2 \sum n j = 1 \sum i - 1 a_{i} a_{j} Cov (X_{i}, X_{j})

Application to LOOCV: $X_{i} = MSE_{i}$ , $a_{i} = 1/ n$ . The $n$ training sets differ by only a single observation, so pairwise $Cov (MSE_{i}, MSE_{j})$ is large and positive. The cross-covariance term dominates, and the variance of $CV_{(n)} = \frac{1}{n} \sum_{i} MSE_{i}$ does not collapse to $σ^{2} / n$ — it is large. In $k$ -fold CV with $k < n$ , two training sets share only $(k - 2) / k$ of the data (60% for $k = 5$ ), so the cross-covariance term is much smaller. [slides; L10; concept: leave-one-out-cv]

4. Weighted $k$ -fold CV formula (slide form)

Differs from ISLP. ISLP §5.1.3 prints $CV_{(k)} = \frac{1}{k} \sum_{j = 1}^{k} MSE_{j}$ (eq. 5.3), implicitly assuming equal-sized folds. The slide deck gives the size-weighted form which is the same when folds are equal but technically more general:

MSE_{j} = \frac{1}{n _{j}} i \in C_{j} \sum (y_{i} - \overset{y}{^}_{i})^{2}, CV_{(k)} = \frac{1}{n} j = 1 \sum k n_{j} MSE_{j}

where $C_{j}$ are the index sets of the $k$ folds, $n_{j} = ∣ C_{j} ∣$ , and $\overset{y}{^}_{i}$ is the prediction from the model trained on all folds except $j$ (for $i \in C_{j}$ ). [slides; L10; concept: k-fold-cv]

For classification, swap to misclassification:

Err_{j} = \frac{1}{n _{j}} i \in C_{j} \sum 1 (y_{i} \neq = \overset{y}{^}_{i}), CV_{(k)} = \frac{1}{n} j = 1 \sum k n_{j} Err_{j}

5. Nested cross-validation (selection + assessment)

Not in ISLP Ch. 5. ISLP §5.1.4’s bias-variance discussion implicitly assumes you only do one of selection or assessment with CV; nested CV is the prof’s solution when you must do both, drawn on the board in L11.

Why needed. Once you have used CV to select a hyperparameter, that same CV error cannot honestly assess the chosen model — you optimized against it. Per the slide deck: “using the test set for both model selection and estimation tends to overfit the test data, and the bias will be underestimated.” [slides; L11]

Procedure. Two layers of CV:

Input: data of size n, candidate hyperparameters θ ∈ Θ, outer folds k_out, inner folds k_in.
Randomly partition {1,...,n} into k_out outer folds O_1, ..., O_{k_out}.
For ℓ = 1, ..., k_out:                          # OUTER: assessment
    outer_train_ℓ = data with O_ℓ removed
    outer_test_ℓ  = data in O_ℓ
    Randomly partition outer_train_ℓ into k_in inner folds I_1, ..., I_{k_in}.
    For each θ ∈ Θ:                              # INNER: selection
        For m = 1, ..., k_in:
            Fit model M(θ) on outer_train_ℓ minus I_m.
            Compute err_m(θ) on I_m.
        Inner_CV(θ) = (1/k_in) Σ_m err_m(θ)
    Pick θ̂_ℓ = argmin_θ Inner_CV(θ).
    Fit M(θ̂_ℓ) on all of outer_train_ℓ.
    Score on outer_test_ℓ → err_ℓ.
Outer_CV = (1/k_out) Σ_ℓ err_ℓ        # ← honest assessment estimate

Mapped onto the three partitions: outer test fold = “test”, inner training folds = “training”, inner validation folds = “validation”. The outer CV’s score is the honest test error of the selection procedure (not of any single $θ$ ). If the inner CV picks the same $\hat{θ}$ on most outer folds, fit the final model on all data with that $\hat{θ}$ and report Outer_CV as the assessment. [L11; concept: nested-cv-and-cv-pitfalls]

6. The right vs wrong way to do CV (selection-bias trap)

Not in ISLP Ch. 5 (it is in Elements of Statistical Learning §7.10, which the slide deck references). The prof devotes ~¼ of L11 to it and flags it as exam-bait.

Setup that exposes the trap

$n = 50$ , $p = 5000$ . All $p$ predictors generated as iid $N (0, 1)$ . No relationship to $y$ .
$y$ assigned as 50/50 random labels. True Bayes error = 50%.
Two-step pipeline:
1. Compute the correlation between each predictor and $y$ . Keep the top $d = 25$ .
2. Fit logistic regression on those $d = 25$ .

Wrong way

Run step 1 on the FULL data → selected predictors S.
For j = 1, ..., k:
    Train logistic regression on data[-fold j, S], evaluate on data[fold j, S].
CV error ≈ 0% (or ~20% at best) on pure noise.

This is a lie: step 1 already used the labels on the full dataset, so the held-out fold has leaked into selection. “The correlation is already doing a bit of the work for you. It’s already a statistical model. You already are selecting parameters based on this for this specific data.” [L11]

Right way

For j = 1, ..., k:
    training_j  = data with fold j removed
    validation_j = fold j
    On training_j ONLY:
        Compute correlation between each predictor and y on training_j.
        Pick top d predictors → selected_j   (different per fold!)
        Fit logistic regression on training_j[, selected_j].
    Predict on validation_j[, selected_j]; compute misclassification on fold j.
CV error ≈ 50%, the truth.

General rule: anything that uses $y$ — correlation filter, variance filter, supervised PCA, even informal peeks at residual plots — is part of training and must live inside the CV loop. [L11; slides; concept: nested-cv-and-cv-pitfalls]

How bad it gets, verbatim

“In the recommended exercises, one of the exercises is basically create fake data using data where there should be no relationship at all, but by pre-selecting which variables you use, you actually get a misclassification error of zero — suggesting like, this is an excellent model, when in reality we know that it’s crap.” [L11]

7. Three-partition rule (training / validation / test)

Not stated as a formal artifact in ISLP Ch. 5 — ISLP folds it into the chapter introduction. The prof treats it as the foundational discipline for everything in modules 5–11.

Partition	Job
Training	Fit the model.
Validation	Select among candidate models / pick hyperparameters (model selection).
Test	Report final performance (model assessment).

Data-reuse principle (verbatim): “We will be too optimistic if we report the error on the test set when we have already used it to choose the best model… Don’t do that.” [L10; slides]

CV replaces the validation set, not the test set. Pipeline: (i) split off a test set once at the start; (ii) on the rest, run $k$ -fold CV to pick the model; (iii) refit on all of “the rest” with the chosen hyperparameter; (iv) report performance on the held-out test set. [L10; concept: training-validation-test-split]

8. The independence trap (spatial / temporal correlation)

Not in ISLP Ch. 5. The prof’s longest tangent in L10, flagged as the headline pitfall of the whole module.

The problem. If observations are correlated (in space, in time, by family / clustering), random partitioning into train/validation gives two sets that leak into each other:

“You’re going to take two points that are right next to each other, but one in your training data, one in your validation data — it’s the same damn thing. You haven’t created two sets of data. You’ve just partitioned the same data twice, basically.” [L10]

Effect on LOOCV (extra bad). The held-out point is essentially predicted by its near-neighbor → fold error is artificially tiny → CV recommends the most complex model possible. “It will be basically identical to just using the likelihood without any penalization.” [L10]

Mitigation.

Chunk by the dependency dimension (whole spatial blocks, whole time windows) before splitting.
Prof’s neuro-data trick: throw away nearby time bins to enforce independence between fold boundaries — “I’ll throw away 80% or 90% of my data by taking a time bin and then jumping ahead another 10 time bins away and throwing all the stuff away in the middle.” [L10]
$k$ -fold tolerates the fix better than LOOCV because you can deliberately construct folds that respect the dependency structure (put a whole block in one fold).

[L10; concept: cross-validation]

9. Bagging (preview from L11; full treatment is in module 8)

Not in ISLP Ch. 5. ISLP places bagging in §8.2.1. The prof previews it inside module 5 because the resampling machinery is identical to the bootstrap, and the variance formula uses the same correlated-mean argument used to explain LOOCV variance.

Procedure

Training data $(x_{1}, y_{1}), \dots, (x_{n}, y_{n})$ . For $b = 1, \dots, B$ :

Draw a bootstrap sample of size $n$ with replacement → $(x_{1, b}^{*}, y_{1, b}^{*}), \dots, (x_{n, b}^{*}, y_{n, b}^{*})$ .
Fit the base model on it → $\hat{f}^{* b}$ .

Aggregate:

\hat{f}_{bag} (x) = \frac{1}{B} b = 1 \sum B \hat{f}^{* b} (x) (regression)

For classification: majority vote across ${\hat{f}^{* b} (x)}_{b = 1}^{B}$ , or average class probabilities and take $ar g max$ . [L11; slides; concept: bagging]

Variance formula (the ” $ρ σ^{2}$ floor”)

Let $σ^{2} = Var (\hat{f}^{* b} (x))$ at a fixed query point $x$ (variance across bootstrap realizations of the fit), and let $ρ$ be the pairwise correlation $Corr (\hat{f}^{* a} (x), \hat{f}^{* b} (x))$ for $a \neq = b$ . Then:

Var (\frac{1}{B} b = 1 \sum B \hat{f}^{* b} (x)) = ρ σ^{2} + \frac{1 - ρ}{B} σ^{2}

Derivation (using the variance-of-sum identity in §3):

Var (\frac{1}{B} b \sum \hat{f}^{* b}) = \frac{1}{B ^{2}} [b \sum Var (\hat{f}^{* b}) + a \neq = b \sum Cov (\hat{f}^{* a}, \hat{f}^{* b})] = \frac{1}{B ^{2}} [B σ^{2} + B (B - 1) ρ σ^{2}] = \frac{σ ^{2}}{B} + \frac{B - 1}{B} ρ σ^{2} .

Algebraic rearrangement gives $ρ σ^{2} + \frac{1 - ρ}{B} σ^{2}$ . [L11; concept: bagging]

Consequences.

If $ρ = 0$ (IID samples), variance collapses to $σ^{2} / B$ — the textbook variance-of-the-mean.
If $ρ > 0$ , the first term $ρ σ^{2}$ is a floor that does not shrink with $B$ . This is the motivation for random forests (module 8), which decorrelate trees by randomizing the predictor subset at each split — pushing $ρ$ down.

When bagging works / fails

Works for high-variance, low-bias base learners (trees especially, KNN with small $K$ , deep neural nets in some regimes).
Fails to help for already-low-variance learners (linear regression).
The “use enough $B$ ” discipline: $B$ is not a tuned hyperparameter for bagging / RF; pick a big number, no CV needed. (Contrast with boosting.)

“It’s actually using this bagging trick implicitly” — the prof’s aside on why very large over-parameterized single models can win: they implicitly bag across their many parameter subsets, the seed of the double-descent discussion. [L11]

10. Out-of-bag (OOB) error

Not in ISLP Ch. 5. ISLP places OOB in §8.2.1. The prof introduces it as the natural by-product of bagging because it reuses the $1 - 1/ n$ probability calculation from §5.

The 1/e result

For a bootstrap sample of size $n$ drawn with replacement from $n$ original observations:

P (obs i NOT drawn on a single draw) = 1 - 1/ n

P (obs i NOT drawn in any of n draws) = (1 - 1/ n)^{n}

(1 - 1/ n)^{n} n \to \infty \frac{1}{e} \approx 0.368

So $\approx 36.8%$ of observations are out-of-bag for any given bootstrap tree, and $\approx 63.2%$ are in-bag. Convergence is fast: $n = 10$ gives $\approx 0.349$ , $n = 100$ gives $\approx 0.366$ .

(The derivation itself appears in ISLP §5.2 exercise 2; the use of it as a free test-set estimator is module-5 delta vis-à-vis Ch. 5.) [concept: out-of-bag-error]

OOB prediction and OOB error

For each $i \in {1, \dots, n}$ , define $OOB_{i} = {b : i \in / InBag_{b}}$ — the trees that did not see observation $i$ . The OOB prediction is

\overset{y}{^}_{i}^{OOB} = average (regression) or majority vote (classification) over {\hat{f}^{* b} (x_{i}) : b \in OOB_{i}}

and the OOB error is

OOB-MSE = \frac{1}{n} i = 1 \sum n (y_{i} - \overset{y}{^}_{i}^{OOB})^{2} (regression)

OOB-Err = \frac{1}{n} i = 1 \sum n 1 (y_{i} \neq = \overset{y}{^}_{i}^{OOB}) (classification)

Key claim. OOB error replaces a separate test set / a separate $k$ -fold pass for bagging-family methods. It is approximately equivalent to LOOCV in the limit of large $B$ — each observation is held out from a random subset of trees, and the OOB prediction averages over the trees that did not see it. [L11; concept: out-of-bag-error]

OOB only works for bagging-family methods. Boosting fits trees sequentially on a single re-weighted training set; no per-tree OOB concept exists.

11. Bootstrap pseudocode for regression coefficients

ISLP §5.2 gives the bootstrap SE formula (eq. 5.8) abstractly and demos it in the lab for the $α$ statistic. The regression-coefficient version (exercise 5.5–5.6 of the course and CE1 problem 4d in modified form) is the explicit pseudocode form the prof expects you to be able to write.

Input: data {(x_i, y_i)}_{i=1}^n, regression model formula, number of resamples B.
Fit OLS on the original data → β̂ = (XᵀX)⁻¹Xᵀy.
For b = 1, ..., B:
    Draw n row-indices from {1,...,n} with replacement → idx_b.
    Form (X*, y*) = (X[idx_b], y[idx_b]).         # SAME (x_i, y_i) pairs, just resampled
    Fit OLS: β̂*_b = ((X*)ᵀ X*)⁻¹ (X*)ᵀ y*.
    Store β̂*_b.
Compute, for each coefficient j:
    β̄_j*  = (1/B) Σ_b β̂*_{b,j}
    SE_boot(β̂_j) = sqrt( (1/(B−1)) Σ_b (β̂*_{b,j} − β̄_j*)² )
Confidence interval for β_j:
    Percentile method: 2.5% and 97.5% quantiles of {β̂*_{b,j}}_{b=1}^B
    Normal-approx:     β̄_j* ± 1.96 · SE_boot(β̂_j)

Compare against the closed-form $Cov (\hat{β}) = \overset{σ}{^}^{2} (X^{⊤} X)^{- 1}$ from module 3. They agree when residuals satisfy the standard IID-Gaussian assumptions; they diverge when those fail, in which case the bootstrap is closer to the truth because the closed-form depends on the (potentially wrong) estimate $\overset{σ}{^}^{2} = RSS / (n - p - 1)$ . [L11; concept: bootstrap]

Bootstrap for a derived quantity (CE1 4d pattern)

Same recipe, with the statistic being a predicted probability rather than a coefficient:

Fit logistic regression on original data → p̂(x_0) for a specified x_0.
For b = 1, ..., B:
    Draw n rows with replacement.
    Refit logistic regression on the bootstrap rows.
    Compute p̂*_b(x_0).
SE_boot(p̂(x_0)) = sample SD of {p̂*_b(x_0)}_{b=1}^B
95% CI: 2.5% and 97.5% percentiles of {p̂*_b(x_0)}

This is the prof-flagged “bootstrap a derived quantity” pattern where no closed-form SE exists. [concept: bootstrap]

12. Why-the-classification-CV-curve-rises explanation

Not in ISLP Ch. 5. ISLP §5.1.5 shows the classification CV plot (Figure 5.7) but explains the upturn only as a generic bias-variance shift. The prof gives the sharper explanation in L11.

Claim. The training error of logistic regression can rise with polynomial degree on the misclassification metric, even though training likelihood always improves with more parameters.

Reason. “The error rate that they use on the y is not actually computed the same way as the log likelihood or the likelihood of the model. So you’re not actually fitting the error rate directly when you’re fitting logistic regression. So even though you’re adding more parameters to your model, so you’re getting what should be a better fit in your logistic regression, it’s not actually minimizing the same thing as the misclassification rate.” [L11]

General principle. When the loss you optimize (likelihood / cross-entropy) differs from the metric you report (0/1 misclassification), “more flexibility” no longer guarantees “lower reported error.” Comes up later in every classification context where the model is fit by likelihood but evaluated by misclassification or AUC. [L11]

13. Bootstrap central idea, verbatim

Not in ISLP Ch. 5 in this form. ISLP describes the procedure operationally; the prof’s framing is the one to write on the A5 sheet:

“Your best model for the real world — like for the real data, not just the sample that you have but the bigger sample everywhere — your best model for that is the data itself. And so if you want to look at different realizations of the data, you resample from that same data with replacement, because it’s always going to be the best model for the world.” [L11]

The empirical distribution $\hat{f}$ puts mass $1/ n$ on each observed point. Sampling from $\hat{f}$ with replacement at full size $n$ is the bootstrap. Without replacement at full $n$ is a permutation — useless. The same-length-with-replacement requirement is what makes it different. [L11; concept: bootstrap]

14. Notation and naming differences

Slides / prof	ISLP Ch. 5	Notes
$CV_{(k)} = \frac{1}{n} \sum_{j = 1}^{k} n_{j} MSE_{j}$ (weighted)	$CV_{(k)} = \frac{1}{k} \sum_{j = 1}^{k} MSE_{j}$ (unweighted, eq. 5.3)	Same when folds equal-sized. Prof’s form is general.
”Validation set approach"	"Validation set approach” (§5.1.1)	Same. Prof emphasizes it is not strictly cross-validation.
LOOCV: $h_{i}$ for hat-matrix leverage	ISLP uses $h_{i}$ (eq. 5.2)	Same. Some treatments use $h_{ii}$ .
$Err_{i} = 1 (y_{i} \neq = \overset{y}{^}_{i})$	$Err_{i} = I (y_{i} \neq = \overset{y}{^}_{i})$	Identical, indicator notation.
Bootstrap SE: $SE_{B} (\hat{θ})$ from $B$ resamples	$SE_{B} (\overset{α}{^})$ (eq. 5.8)	Identical formula.
$ρ σ^{2} + \frac{1 - ρ}{B} σ^{2}$ (bagging variance)	ISLP §8.2.1 mentions decorrelation but does not print this formula	Formula is module 5 slide + L11 derivation.
”Wrong-way CV” / “selection bias in CV”	Hastie–Tibshirani §7.10 (ESL); not in ISLP Ch. 5	Different book. The prof’s verbatim framing is “lying with statistics.”
Three-partition: training / validation / test	Same names	Identical.
”Inner / outer CV” for nested CV	Not formally named in ISLP Ch. 5	Module 5 delta.
OOB fraction: $1 - 1/ e \approx 0.368$	ISLP exercise 5.2 derives $(1 - 1/ n)^{n}$ but does not introduce OOB until §8.2.1	Module 5 delta.
`cv.glm` / `cv.tree` / `boot::boot` (R) — function names	ISLP uses Python: `cross_validate`, `KFold`, `ShuffleSplit`, custom `boot_SE`	Out of scope per prof’s exam policy — no language, no package names. Listed only because slide deck uses R names.

statistical.dog

Explorer

M05: Resampling — Book delta

Module 05: Resampling — Book delta

1. Standard error of the CV estimate

2. The one-standard-error rule

3. Variance-of-a-sum identity (the LOOCV-variance argument)

4. Weighted $k$ -fold CV formula (slide form)

5. Nested cross-validation (selection + assessment)

6. The right vs wrong way to do CV (selection-bias trap)

Setup that exposes the trap

Wrong way

Right way

7. Three-partition rule (training / validation / test)

8. The independence trap (spatial / temporal correlation)

9. Bagging (preview from L11; full treatment is in module 8)

Procedure

Variance formula (the ” $ρ σ^{2}$ floor”)

When bagging works / fails

10. Out-of-bag (OOB) error

The 1/e result

OOB prediction and OOB error

11. Bootstrap pseudocode for regression coefficients

Bootstrap for a derived quantity (CE1 4d pattern)

12. Why-the-classification-CV-curve-rises explanation

13. Bootstrap central idea, verbatim

14. Notation and naming differences

Graph View

Table of Contents

Backlinks

statistical.dog

Explorer

M05: Resampling — Book delta

Module 05: Resampling — Book delta

1. Standard error of the CV estimate

2. The one-standard-error rule

3. Variance-of-a-sum identity (the LOOCV-variance argument)

4. Weighted k-fold CV formula (slide form)

5. Nested cross-validation (selection + assessment)

6. The right vs wrong way to do CV (selection-bias trap)

Setup that exposes the trap

Wrong way

Right way

7. Three-partition rule (training / validation / test)

8. The independence trap (spatial / temporal correlation)

9. Bagging (preview from L11; full treatment is in module 8)

Procedure

Variance formula (the ”ρσ2 floor”)

When bagging works / fails

10. Out-of-bag (OOB) error

The 1/e result

OOB prediction and OOB error

11. Bootstrap pseudocode for regression coefficients

Bootstrap for a derived quantity (CE1 4d pattern)

12. Why-the-classification-CV-curve-rises explanation

13. Bootstrap central idea, verbatim

14. Notation and naming differences

Graph View

Table of Contents

Backlinks

4. Weighted $k$ -fold CV formula (slide form)

Variance formula (the ” $ρ σ^{2}$ floor”)