Out-of-bag (OOB) error

The “free test set” that comes with every bootstrap sample. ~37% of observations are not drawn into a given bootstrap sample, those are out-of-bag for that tree, and serve as that tree’s validation set. Aggregate across trees → an honest test-error estimate with no separate test set required.

Definition (prof’s framing)

For each bootstrap sample of size (drawn with replacement from original observations), some originals appear multiple times and others don’t appear at all. The roughly 1/3 left out are the out-of-bag (OOB) observations for that bootstrap. Predict on them with the tree fit on that bootstrap → per-tree validation error. Aggregate across all trees → OOB error estimate.

“It’s approximately B divided by 3.” - L19-boosting-1 (the prof’s quick verbal, meaning ~1/3 are out-of-bag for any given tree)

“About 1/3 of the observations are not used to fit a particular bagged tree, and serve as a built-in test set for that tree. No dedicated test set required.”, paraphrase of the slide / L18-trees-2

Notation & setup

  • original observations.
  • For bootstrap sample , define , the original indices drawn (with multiplicity) into bootstrap .
  • : the indices not drawn.
  • For each , collect the trees that didn’t see it; their average prediction is the OOB prediction .
  • OOB error = average loss between and across all .

Formula(s) to know cold

Probability obs is OOB for a single bootstrap sample (Exercise 5.4 → Exercise 8.1d):

So ~36.8% of observations are OOB for any given tree, and ~63.2% are in-bag ().

OOB prediction for observation :

(Average for regression, majority vote for classification.)

OOB error:

Insights & mental models

  • The result is exam-flagged. Hand-calculation pattern: “Show that for large , and conclude that ~37% of observations are OOB.” Direct port of Exercise 5.4 to module 8.
  • OOB error ≈ leave-one-out CV error in the limit of large , for each obs , the OOB prediction averages over the trees that didn’t see , which is the same in spirit as “leave out, fit on the rest.” With many bootstrap samples, this converges to a CV-like estimate.
  • No separate test set needed. The big practical win of bagging / RF: you don’t have to set aside a held-out test set, and you don’t have to do a separate k-fold CV pass. OOB gives you the assessment estimate as a byproduct of training.
  • Cheap, but not perfectly honest. “You have this strange dependency on the test error from your real error on your test error on how you sampled.” - L18-trees-2. The OOB error is computed from the same bootstrap samples used to train, so there’s some structural dependency. In practice it’s a very good test-error proxy, especially with large .
  • Used for variable importance too. The randomization-based variable-importance flavor (permute predictor on OOB samples, measure performance drop) is exactly OOB error with one column scrambled. See variable-importance.
  • The two-thirds / one-third split is approximate. converges to from above. For : . For : . For : . Convergence is fast and from above.

How it’s used in practice

  1. Replaces a separate test set / k-fold CV for bagged ensembles and random forests. Most random-forest implementations report OOB error automatically.
  2. Choosing : plot OOB error vs and pick where the curve flattens. (Even though isn’t a “real” tuning parameter, this confirms you’ve used “enough” trees.)
  3. Variable importance: the randomization-based version uses OOB samples to measure the drop in performance when each predictor is permuted.

Exam signals

“It’s approximately B divided by 3.” - L19-boosting-1

“Each bootstrap sample uses ~2/3 of the data; the remaining ~1/3 (‘out of bag’) gives an honest validation set per tree.” - L19-boosting-1 paraphrase

The hint in Exercise 8.1d is explicit: “The result from RecEx5-Problem 4c can be used.”, i.e., the derivation from module 5 is reused. That cross-reference is itself a signal, the prof expects Anders to chain the two results.

Pitfalls

  • Confusing 0.632 (in-bag) with 0.368 (OOB). The prof’s “B divided by 3” line refers to the OOB fraction (1/3, give or take). The in-bag fraction is 2/3.
  • Forgetting it’s an aggregate over trees that didn’t see obs . Each individual tree only votes on observations it didn’t train on; you collect those votes per observation and aggregate.
  • Treating OOB as identical to k-fold CV. They’re closely related, OOB is roughly equivalent to LOOCV in the limit, but technically different (each obs is OOB for a random subset of trees, not held out exactly once).
  • Using OOB on small . With , some observations may have very few OOB predictions to average; the per-observation prediction is noisy. With , this is rarely an issue.
  • Forgetting OOB only works for bagging-family methods. Boosting fits trees sequentially on a single training set (re-weighted in AdaBoost or on residuals in gradient boosting); there’s no per-tree OOB concept.

Scope vs ISLP

  • In scope: the probability, the role of OOB as a free test-set estimate, the connection to variable importance (randomization flavor).
  • Look up in ISLP: §8.2.1 (pp. 345), “Out-of-Bag Error Estimation” subsection. Brief, the prof’s treatment is similar in depth.
  • Skip in ISLP (book-only, prof excluded): formal proof that OOB ~ LOOCV, deeper convergence analysis.

Exercise instances

  • Exercise8.1d: explain OOB sample; compute what fraction of observations are OOB. The hint explicitly says to reuse the bootstrap derivation from Exercise 5.4 (the result). This is the textbook hand-calculation for OOB.

(Other module-8 problems use OOB implicitly via randomForest() output, but Exercise 8.1d is the only one that asks for the conceptual derivation.)

How it might appear on the exam

  • Hand calculation: “What fraction of observations are out-of-bag for a given tree in a bagging procedure with large?” → derive , conclude ~37% OOB. Direct port of Exercise 8.1d.
  • Conceptual / T/F:
    • “OOB error replaces the need for a separate test set in bagging / RF” → true (with mild caveats).
    • “OOB error is computed for each tree using observations not in its bootstrap sample” → true.
    • “Boosting models report an OOB error” → false; OOB is bagging-family only.
    • “OOB error is exactly equivalent to LOOCV” → false (closely related, not identical).
  • Use-case justification: “Why does bagging / RF not need a separate test set?” → because each observation is OOB for ~1/3 of the trees, providing per-observation held-out predictions.
  • Connect to variable importance: the randomization-based importance permutes a predictor in the OOB samples, link the two concepts.
  • bootstrap: supplies the result
  • bagging: the procedure OOB is built on
  • random-forest: the natural home of OOB error in practice
  • cross-validation: the alternative test-error estimate; OOB is the bagging-family substitute