Leave-one-out cross-validation (LOOCV)

The other extreme from the validation set: -fold CV with one observation per fold. Low bias (train on each time), but high variance because the training sets are nearly identical → fold errors highly correlated → averaging them doesn’t help much. For OLS there’s a beautiful hat-matrix shortcut that needs only one fit.

Definition (prof’s framing)

For :

  1. Hold out observation .
  2. Fit the model on the remaining .
  3. Predict and compute (regression) or (classification).

Average across all folds:

Notation & setup

  • = sample size; LOOCV requires refits.
  • No randomness: the procedure is fully deterministic given the data.
  • For OLS, the closed-form shortcut below replaces the refits with a single fit.

Formula(s) to know cold

The general LOOCV estimator:

where is the prediction at from the model trained without observation .

The hat-matrix shortcut (OLS only):

where is the full-data fitted value (no leave-out) and is the -th diagonal element of the hat matrix . Only one fit needed.

“We have a nice shortcut for linear regression and in some other settings, but not all.” - L10-resample-1

This is derived in compulsory exercise 1 (per the slide deck pointer). The leverage measures how much obs pulls the fitted surface toward itself; inflates the residual to account for the fact that obs helped fit the surface.

Insights & mental models

  • No randomness: fully deterministic. The validation set approach varies wildly across splits; LOOCV doesn’t.
  • The “highly-correlated folds” intuition is the core variance argument:
    • Two LOOCV training sets share of observations → essentially identical → the per-fold errors are highly correlated.
    • Variance of an average of correlated variables: . The cross-covariance terms blow up when the are highly correlated, so the average has high variance, even though we averaged estimates.
  • Outlier sensitivity is a property to know: an extreme point hits one LOOCV fold with a huge held-out error, can dominate the average. The prof framed this as a kind of feature (it tells you which points are problematic) but mostly as a thing to be aware of.
  • Independence trap (extra-bad version): if your data is dependent in time/space, LOOCV is terrible: the model can predict the held-out point from its neighbor, fold error is artificially tiny, CV will recommend the most complex model. “It will be basically identical to just using the likelihood without any penalization.” - L10-resample-1
  • LOOCV = -fold with . The two ends of the same spectrum.

Pros and cons (prof’s slide form)

Pros:

  • No randomness in splits.
  • Low bias, train on each time, almost the full data.

Cons:

  • Expensive, refits (unless you’re in OLS-land with the shortcut).
  • High variance: correlated folds inflate the variance of the average.
  • Extra-bad outlier and dependence sensitivity.

Exam signals

“Two training sets only differ by one observation, thus estimates from each fold highly correlated, which can lead to high variance in their average.”, slide deck (paraphrased in L10-resample-1)

“Leave one out will have a better bias. But the K-fold will likely have a much better variance. And typically in this setting, you’re winning by having less variance.” - L11-resample-2

CE1 problem 4b runs the comparison as true/false:

  • “5-fold CV will generally lead to an estimate of the prediction error with less bias, but more variance, than LOOCV”false. LOOCV has lower bias (trains on more data); k-fold has lower variance (less correlated folds). The statement reverses both.
  • “LOOCV is a form of bootstrapping”false. LOOCV partitions; bootstrap resamples with replacement.

Pitfalls

  • Mixing up the bias and variance directions. LOOCV: low bias, high variance. The wrong-direction trap is on CE1 problem 4b, easy to flip.
  • Using LOOCV under temporal/spatial correlation. Worse than k-fold here, see independence trap above.
  • Forgetting the OLS-only restriction on the shortcut. Eq. (5.2) in ISLP works only for linear models fit by least squares (or with the appropriate generalization for other linear-projection methods). For trees, KNN, GAMs, neural nets, etc., you have to actually do the refits.

Scope vs ISLP

  • In scope: the algorithm, the bias/variance comparison with k-fold (CE1 4b), the hat-matrix shortcut (CE1 derives it), the independence trap.
  • Look up in ISLP: §5.1.2 (pp. 200–202), equation 5.2 is the shortcut. The footnote on the multiple-regression generalization of leverage is also there.

Exercise instances

  • CE1 problem 4b: true/false on bias and variance of LOOCV vs 5-fold CV (and “LOOCV is a form of bootstrapping”). The flipped-direction trap.
  • The OLS shortcut derivation is referenced in the slides as “see Compulsory exercise 1”, historical, but the shortcut formula itself is exam-worthy.

How it might appear on the exam

  • True/false on bias / variance / compute properties, direct CE1 problem 4b style. Easy to flip directions; write reasoning even for T/F.
  • “Compute LOOCV by hand” for a small dataset (say , simple linear regression), feasible because the shortcut works.
  • “Why is k-fold preferred to LOOCV” → variance argument: nearly-identical training sets → highly-correlated fold errors → large variance of the average. K-fold de-correlates by making the training sets actually different.
  • Conceptual on the OLS shortcut: what does measure, why does appear? Connects to leverage from module 3 (design-matrix-and-hat-matrix).
  • Independence-trap question: given temporally correlated data, would you prefer LOOCV or k-fold? Why?