Module 08: Tree-Based Methods — Book delta
ISLP ch.8 (the trees / bagging / RF portion of wiki/book/08-trees.md) is the deep-treatment reference for module 8. Anders can flip to it for: the regression-tree split criterion (8.1)–(8.3), Algorithm 8.1 (build-out-then-prune-via-CV), the cost-complexity penalty (8.4), the three classification impurities (8.5)–(8.7), the bagged-predictor formula , the verbal OOB description (“on average each bagged tree uses around two-thirds of the observations”), the impurity-based variable-importance plot, and the random-forest decorrelation trick with .
This file captures the concrete, lookup-able artifacts the prof taught that are not cleanly stated as a formula or derivation in ISLP §8.1–§8.2.2. The biggest items are the correlated-trees variance derivation (the whole point of why RF beats bagging — ISLP only states it verbally), the derivation for the OOB fraction, the explicit definition of permutation-based variable importance (ISLP body text covers only impurity-based), the default for regression random forests, the scaled-cross-entropy (“deviance”) formula, the weakest-link pruning mechanism, and the impurity-sensitivity worked example that motivates Gini/entropy over misclassification.
1. The correlated-trees variance derivation
The single load-bearing math result of the bagging → random-forest pivot. ISLP §8.2.2 explains the idea verbally (“averaging many highly correlated quantities does not lead to as large of a reduction in variance as averaging many uncorrelated quantities”) but does not write down the formula. The prof did it on the board in L19, and the slide deck has the derivation in full (“Recall: variance for independent / correlated datasets”).
Setup. Let be random variables with common variance and pairwise correlation , so
Derivation. Expand the variance of the mean into diagonal + off-diagonal covariance terms:
The off-diagonal sum has pairs, each contributing , hence the factor before division by .
The boxed result:
Sanity checks (the slide deck flagged these explicitly, “Check: and ?”):
- (IID): collapses to , the textbook result. Variance shrinks linearly in .
- (perfect correlation): collapses to . No variance reduction at all from averaging — every tree says the same thing.
- General : a fixed floor that does not depend on , plus a decaying term .
The point of the derivation (the prof verbatim, L19):
“This thing does not shrink with respect to . No matter how many times we make different trees, if they’re all correlated the same way, then there’s a portion that just never improves.”
So bagging trees has a hard floor at , the residual variance of any single tree times the pairwise correlation. The only way past the floor is to lower . Random forests do exactly that by restricting each split to a random subset of predictors, which prevents the strong predictors from being chosen at the same nodes in every tree, decorrelating the ensemble.
Why this is delta-worthy. ISLP §8.2.2 gives the qualitative claim but not the formula. The 2024/2025-style exam question “Why are random forests better than bagging?” expects the explicit decomposition and the verbal “the first term is a floor in ” interpretation — the prof flagged this as one of the high-likelihood mathy theory questions.
2. The OOB fraction:
L18, L19, out-of-bag-error, bootstrap
ISLP §8.2.1 reports the result verbally (“on average, each bagged tree makes use of around two-thirds of the observations”) and footnotes Exercise 5.2 for the derivation. The actual derivation is exam-flagged (Exercise 8.1d explicitly says “the result from RecEx5-Problem 4c can be used”) so it lives here.
Setup. Bootstrap sample of size drawn with replacement from original observations. Probability observation is not drawn on any single draw is . Probability observation is not drawn in any of the draws:
Limit. Using with :
Consequence. For large :
- ~36.8% of observations are OOB for any given bootstrap sample / tree.
- ~63.2% of observations are in-bag for any given tree.
- For trees, each observation is OOB for roughly trees. The OOB prediction averages (regression) or majority-votes (classification) those trees’ predictions for .
Convergence is from above and fast:
| 10 | 0.3487 |
| 100 | 0.3660 |
| 1000 | 0.3677 |
| 0.3679 |
The prof’s verbal short-hand from L19:
“It’s approximately B divided by 3.”
meaning, for any given observation , roughly trees can vote on it because is OOB for that fraction.
Why this is delta-worthy. ISLP reports the conclusion but defers the derivation to an end-of-chapter exercise in ch.5. The prof drilled it as an in-class hand-calc, and Exercise 8.1d ports it directly into module 8.
3. Permutation-based (randomization) variable importance
ISLP §8.2.1 (“Variable Importance Measures”) covers only the impurity-based version: total decrease in RSS (regression) / Gini (classification) due to splits on , averaged over trees. The prof covered both flavors, and stated a preference for permutation importance, “because it makes more sense.” The permutation flavor is the delta.
Permutation importance, exact procedure (uses the OOB sample):
For each predictor :
-
Run the OOB observations through the trained ensemble. For each tree , evaluate the per-tree OOB error on the OOB indices of that tree:
where is squared error (regression) or (classification).
-
Permute the values of predictor across the OOB sample (whole column shuffled); call the permuted feature matrix . Keep all other predictors, the labels, and the trained trees fixed.
-
Re-evaluate the per-tree OOB error on the corrupted features:
-
Compute the per-tree performance drop , average across trees, and (optionally) divide by the standard deviation of the differences:
The reading. If carries real signal, scrambling its values destroys per-row predictive ability and the OOB error spikes (). If is irrelevant, scrambling does nothing (). Reported under the names MeanDecreaseAccuracy (classification) or %IncMSE (regression) in R’s randomForest output.
The prof’s verbal definition (L19):
“If is important, permuting the observations will decrease the performance a lot. If it doesn’t matter, then it won’t matter.”
Why prefer permutation over impurity-based? (Prof’s “makes more sense” substance, partly verbatim, partly the standard mechanism he gestured at.)
- Impurity-based importance is biased toward predictors with many possible split points. Continuous predictors and high-cardinality categoricals have more chances to be picked as a split and to reduce impurity, so they look artificially important. Permutation importance avoids this bias because it directly tests whether knowing the predictor’s value helps prediction in the OOB set.
- Permutation importance is independent of the split criterion used at training. Impurity-based importance changes if you switch Gini ↔ entropy at training; permutation importance reads off the same OOB sample either way.
When the two flavors disagree. Empirically: top predictors tend to agree across both flavors (slide deck and L19 showed both side-by-side on mtcars and the brain-injury data — top 2 always preserved, bottom predictor always preserved). The middle of the ranking is where the methods diverge. Prof’s framing (L19):
“They do measure slightly different things, so you don’t expect them to be the same. … They do capture different aspects of what’s being measured.”
Why this is delta-worthy. ISLP §8.2.1 body text describes only the impurity-based plot (Figure 8.9, Heart data, mean decrease in Gini). The permutation procedure, the OOB-permutation mechanism, and the prof’s preference are entirely absent from the body of ch.8 — they live in the slide deck and lectures.
4. Default for regression random forests
ISLP §8.2.2 names the classification default (Heart data, “4 out of the 13 for the Heart data”) and notes “using a small value of in building a random forest will typically be helpful when we have a large number of correlated predictors,” but does not state the regression default explicitly in the main text. The prof was emphatic and the slide deck spelled it out:
(Bagging is the special case .)
Direction-of-tuning. Smaller → more decorrelation between trees → lower in the variance formula of §1 → lower variance floor. Recommended when predictors are highly correlated. Larger → individual trees are stronger but more correlated → variance floor stays high.
Why this is delta-worthy. ISLP’s main text fixes in its examples; the regression default is the kind of fact that’s specifically asked for in past exam keys (“justify your choice of mtry”). The 2025 exam key reproduced in L27 gives full marks for ” with → use 4 or 5”; the analogous regression answer needs .
5. Weakest-link pruning: explicit mechanism
L17, L18, cost-complexity-pruning
ISLP §8.1.1 (after equation 8.4) says only:
“It turns out that as we increase from zero in (8.4), branches get pruned from the tree in a nested and predictable fashion.”
That’s a black-box statement of the existence of a nested sequence. The prof spelled out the mechanism by which the sequence is generated (slide deck “weakest-link pruning” + L17/L18 verbal):
The procedure. Starting from the maximal tree :
-
For every internal node , compute the increase in training loss that would result from collapsing (turning into a leaf, dropping the subtree rooted at ). For regression, that’s the increase in RSS; for classification (at the prune step), the increase in misclassification count.
-
Among all internal nodes, pick the one whose collapse increases the loss the least — the “weakest link” — and collapse it. The result is the next subtree with terminal nodes (or fewer, if the collapse merges multiple leaves).
-
Repeat. Each step removes one weakest link and produces the next subtree in the sequence:
Theorem (stated, not proved in lecture). For each , the subtree that minimizes over all subtrees is one of the trees in this nested sequence. Equivalently, every member of the weakest-link sequence is optimal for some interval of complexity penalties. The map is therefore a step function on the finite set .
The proof is explicitly out of scope (the prof linked Bo Lindqvist’s MA8701 note on the slide as the “for-those-who-want-it” reference and skipped it in lecture).
Practical consequence: plot tree size, not . Because the map is a step function, the relationship between and the cardinality is one-to-one on the breakpoints but uninformative between them. CV plots therefore display held-out error against tree size , not directly (slide deck verbatim: “we can plot the CV error as a function of tree size (instead of — why?)”). The CV minimum is read off the size axis directly.
Why this is delta-worthy. ISLP states the existence of the nested sequence; the prof gave the algorithm (collapse the weakest link, repeat) that generates it. Worth an exam-pseudocode answer.
6. The deviance / scaled-cross-entropy formula
ISLP §8.1.2 gives the cross-entropy as
and the lab (§8.3.1) gives the deviance in passing as . But the relationship between the two — the prof’s “the R tree package’s split='deviance' is cross-entropy, up to a constant factor and the population-vs-proportion scaling” — is a flagged exam trap not stated cleanly in ISLP’s main text.
The slide-deck definition. For region with observations and observations of class (so ), the deviance is
which is for the multinomial likelihood of the node, treating as the MLE.
Connection to cross-entropy. Substitute :
where is the cross-entropy of region . So deviance is a scaled cross-entropy: scaled by the per-region , which weights larger regions more heavily (the natural weighting for the global loss).
Implication for split="deviance". Calling tree(..., split="deviance") minimizes the total node-summed deviance across the resulting children, which (after the scaling collapses across the global sum) gives the same split decisions as splitting on weighted cross-entropy. Slide-deck verbatim: “thus split='deviance' implies that we split according to the entropy criterion.”
Why this is delta-worthy. ISLP gives the cross-entropy in the body and the deviance in the lab without a derivation linking them; the prof drilled the relationship as one of the named “this is what deviance is” facts for output interpretation.
7. The impurity-sensitivity example: why misclassification is a poor split criterion
ISLP §8.1.2 states the punchline:
“However, it turns out that classification error is not sufficiently sensitive for tree-growing, and in practice two other measures are preferable.”
and demonstrates the related “both children predict the same class” phenomenon on the Heart data (the RestECG<1 split). It does not give the canonical two-class numerical example that the prof did. The slide deck spells it out and the prof walked through it in L18.
The two-class setup. Parent node with 800 observations: 400 of class A, 400 of class B — written . Consider two candidate splits:
- Split 1. Left child: , right child: .
- Split 2. Left child: , right child: .
Misclassification error of each split (using , weighted by node sizes ):
- Split 1: left predicts B with , ; right predicts A with , . Weighted: .
- Split 2: left predicts B with , ; right predicts A with , . Weighted: .
Both splits have the same weighted misclassification rate of 25%. Misclassification cannot distinguish them.
Gini index of each split. Using and weighting by :
- Split 1: each child has . Weighted Gini = .
- Split 2: left , right (pure node). Weighted = .
Gini prefers Split 2 (), because Split 2 produces a pure node (right child has zero class B). Misclassification can’t see this.
Cross-entropy is similar in shape. Both Gini and entropy are strictly concave in , both have global maximum at and zeroes at , so both penalize “in-the-middle” probabilities more than misclassification (which has a kink at and is otherwise piecewise linear).
The slide-deck plot (p on x-axis, three curves: misclassification = , Gini = , scaled entropy):
- Misclassification is a tent function, peak at , slope on either side.
- Gini is a parabola, peak at .
- Entropy (scaled to match peak) is taller in the middle, drops more sharply near the boundaries.
The takeaway: Gini and entropy reward purity gains even when the prediction at each child doesn’t change, because the probabilities at the children get more extreme. Misclassification only sees whether the predicted class changed.
Why this is delta-worthy. The (400, 400) → (100, 300)/(300, 100) vs (200, 400)/(200, 0) example is the cleanest numerical demonstration of the impurity-sensitivity argument, and the prof flagged it as the load-bearing reason to use Gini or entropy for splitting. ISLP gives the punchline without this calculation.
8. The differentiability footnote on impurity criteria
The slide deck flags a second reason to prefer Gini / cross-entropy over misclassification:
“The Gini index and Entropy are differentiable (preferred for numerical optimization!)”
The math. Treating as continuous:
- . Continuous everywhere.
- . Continuous on .
- has a kink at , i.e., where the argmax flips. Not differentiable there.
The prof flagged this as the weaker of the two reasons (L18):
“The reality is now actually [misclassification] would be also pretty easy to make differentiable. Like … we could just do a soft max.”
So the load-bearing reason is the impurity-sensitivity argument of §7. The differentiability point is a secondary, “easier-for-gradient-style-optimizers” argument that the prof noted is rendered moot by modern soft-relaxation tricks.
Why this is delta-worthy. ISLP doesn’t discuss differentiability of the three criteria at all; the slide deck does, and the prof has an opinion on which argument actually matters. Useful for a T/F exam question: “Gini is differentiable everywhere on ” → True. “Misclassification has a kink at ” → True (for ).
9. Categorical predictors: the partition count and the binary-outcome ordering trick
ISLP §8.1.2 mentions that splits on qualitative variables “amount to assigning some of the qualitative values to one branch and the remaining to the other,” and the Thal:a / ChestPain:bc notations in Figure 8.6 show example splits. ISLP does not state the combinatorial count or the binary-outcome reduction trick. The slide deck does, and the prof flagged them.
Partition count. For a predictor with unordered levels, the number of nontrivial binary partitions is
(Each partition assigns each of levels to one of two groups, total. Subtract 2 for the empty / full assignments and divide by 2 for the left-right symmetry: .)
Implication. Categoricals with many levels are dangerous in two senses:
- Variance. With many possible splits, the tree is sensitive to small data perturbations — high-cardinality categoricals are a variance source. The slide deck verbatim: “Try to avoid predictors with very many levels!”
- Spurious importance. Impurity-based variable importance (see §3 above) is biased toward predictors with many split candidates, exactly the high-cardinality categoricals. Permutation importance is the fix.
The binary-outcome ordering trick. For a classification problem, you can avoid the exhaustive search:
- For each level of the categorical, compute the proportion of class 1 observations at that level: .
- Order the levels by .
- Treat the now-ordered levels as an ordinal predictor and search only the ordered cuts.
Theorem (slide-deck claim, no proof given). This procedure recovers the same split as the exhaustive Gini-minimizing search.
For multi-class outcomes () and continuous outcomes, the trick does not apply and the exhaustive search remains .
Why this is delta-worthy. The count is the kind of small combinatorial fact that’s exam-bait (T/F: “A 6-level categorical predictor has 31 possible binary splits” → True since ). ISLP doesn’t state it.
10. The “useless-looking split” mechanism
ISLP §8.1.2 describes the “both children predict the same class” split on the RestECG<1 example and explains it as a node-purity improvement. ISLP §8.1.1 (Tree Pruning paragraph) gives the “too short-sighted” argument for why you don’t stop early: “a seemingly worthless split early on in the tree might be followed by a very good split.” The prof linked these two into one mechanism (L18): the same split that looks useless by misclassification can be the gateway to a good later split, so any early-stopping rule based on misclassification will systematically fail.
Numerical example (the prof’s L18 walk-through). Parent node: 70 class A, 30 class B.
- Parent Gini: . Parent misclassification: . Predicts A.
- Split candidate: left child (40A, 5B), right child (30A, 25B).
- Left child: . Gini . Misclass: . Predicts A.
- Right child: . Gini . Misclass: . Predicts A.
Weighted child misclassification = . Same as parent. No improvement by misclassification.
Weighted child Gini = . Improvement under Gini.
Both children still predict A — the leaf labels are unchanged — but the probability of being A is much more confident in the left child (89%) than the right (55%). The Gini-improvement comes entirely from the increased certainty on the left, even though the prediction label is the same.
Why this is delta-worthy / Where this lands. Combines the “useless-looking split” example into a single hand-calculation that demonstrates both:
- Why misclassification is a bad split criterion (it’s blind to this purity improvement).
- Why early stopping by RSS / misclassification threshold fails (the left child of this split now sits at 40A/5B — almost pure — and a single later split there can produce a 40A/0B leaf, paying off the parent split’s “uselessness”).
The prof’s verbatim L18 conclusion:
“It put kind of like the shitty part of the parent node in one side and the more confident version on the left side.”
ISLP gives the qualitative version of this on RestECG<1; the prof’s numerical-example version is the cleanest exam-ready demonstration.
11. Boston worked example: numerical hierarchy single-tree → bagging → RF
A concrete number sequence the prof anchored as the “this is what variance reduction looks like” headline (slide deck + L19 board work). ISLP shows the bagging-vs-RF curve on gene-expression data (Figure 8.10) and the Heart data (Figure 8.8) — not on Boston housing regression, where the contrast is cleaner. The Boston numbers are the prof’s canonical demonstration:
| Method | mtry () | Test MSE |
|---|---|---|
| Single regression tree (pruned to 6 leaves) | — | |
Bagging (randomForest with mtry = 13) | ||
| Random forest |
The numbers come from MASS::Boston, target medv (median home value), predictors, train / test split with set.seed(1).
The story (L19, prof verbatim):
“It’s improved with respect to simple bagging, it’s likely because of the more diversity in our trees. And now we are talking about a forest and no longer a tree, and we have trees, trees live in a forest, and we’re adding more diversity to our trees so that it gets better.”
Why this is delta-worthy. Specific MSE numbers for the standard sequence — useful as anchors for a method-comparison exam question of the form “ranked these three approaches by test MSE on Boston housing, explain why.” ISLP’s worked Boston example in the lab (§8.3.3) gives different numbers under a different split (28.07 / 14.63 / 20.04, with RF actually losing slightly to bagging in their seed) — the slide-deck numbers are what the prof drilled in lecture.
12. Hyperparameter taxonomy across the tree-ensemble family
A direct comparison table the prof emphasized verbally and that recurs in past-exam keys. ISLP discusses each hyperparameter in turn but does not lay them out side-by-side with the “tune vs. don’t tune” labels that the prof flagged as exam-bait.
| Method | Parameter | Role | Tune? | Default / typical |
|---|---|---|---|---|
| Single tree | (cost-complexity) | Tree size penalty | Yes (K-fold CV) | Chosen by cv.tree |
| Bagging | = ntree | Number of trees | No (“use enough”) | 500–1000 |
| Random forest | Number of trees | No | 500–1000 | |
| Random forest | = mtry | Predictors per split | Yes (the real one) | (classif), (regr) |
| Random forest | Tree depth / nodesize | Individual-tree complexity | No (grow unpruned) | Deep / unpruned |
The hard rule (prof verbatim, L19, used in 2023 / 2025 exam keys):
“It just has to be enough … you don’t typically estimate this, you don’t typically run cross-validation.”
referring specifically to in bagging / RF. Past exam keys deduct points for “I chose ntree=500 by CV” — the correct justification is ” is not a tuning parameter; pick large enough that the OOB error plateau is reached.”
Why this is delta-worthy. ISLP §8.2.1 says “The number of trees is not a critical parameter with bagging; using a very large value of will not lead to overfitting” — true but understated. The prof’s rule is sharper: don’t tune , period; tune . The taxonomy form is what the exam key expects to see.
13. Bias-reduction footnote for bagging
ISLP §8.2.1 frames bagging entirely as a variance-reduction procedure (its opening sentence: “Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method.”) The prof added — and flagged — a less-commonly-stated point: bagging can also reduce bias, not just variance (L11):
“By doing that, even though you’re always just resampling the same data, you can actually remove bias from your model.”
The mechanism (slide-deck-implicit, not in ISLP): a single tree’s greedy splits make locally-optimal but globally-suboptimal choices. The greedy bias of any one tree is some function of which observations entered the bootstrap sample. Averaging across many bootstrap samples partially averages over this search-path bias, not just over noise in the data. The result is that the bagged ensemble can have lower bias than a single tree fit on the full data, in addition to lower variance.
The prof was clear this is the secondary argument; the variance-reduction story is the headline. But the bias point comes up as a T/F exam trap: “Bagging only reduces variance, not bias” → False.
Why this is delta-worthy. ISLP frames bagging exclusively as a variance device. The prof flagged the bias-reduction footnote as a real, examinable consequence.
Notation and naming differences
- Region index. ISLP uses (and sometimes in the regression-tree exposition); the prof uses both and interchangeably. The atom set follows the prof: in §8.1 for regression, in §8.1.2 for classification, with or for the per-region class proportion. Both conventions occur in lecture and slides without ceremony.
- Cost-complexity penalty. ISLP equation 8.4 has ; the slide deck writes with as a generic cost. is the regression RSS in §8.1.1 and the misclassification count in §8.1.2 (prune step), not Gini or entropy — the prof flagged this asymmetry verbally as the ” depends on whether you’re growing or pruning a classification tree” trap.
- notation. Both ISLP and the prof write the number of terminal nodes as , but the prof flagged in L17 that this is cardinality of the leaf set, not absolute value or an -norm: “that’s the zeroth norm.” The “zeroth norm” reading is the prof’s framing; ISLP just says “number of terminal nodes.”
- Boosting shrinkage. Out of scope for m08 (module 9), but the prof uses for the learning rate in L19, whereas ISLP equation 8.10 uses . Same quantity.
tree’sdevfield. Slide deck and lecture refer to the R output columndevas “deviance” in regression (= RSS up to a multiplicative constant) and “number of misclassifications” incv.tree(..., FUN = prune.misclass). The single column has two meanings depending on what you fed in; ISLP doesn’t discusstreepackage R output.- Split notation , . Both ISLP (eq. 8.2) and the prof define and . Strict-vs-non-strict inequality at the boundary doesn’t matter in practice (continuous has measure-zero ties), but the convention is left-strict / right-inclusive in both sources. The R
treepackage’s display labels splits at the midpoint between adjacent integer values for integer predictors (per ISLP footnote 1), which is cosmetic only. - “Deviance” overloaded. Slide deck uses “deviance” to mean the cross-entropy-scaled when fitting classification trees, and to mean “RSS up to a constant” for regression trees. ISLP follows the same overloading in §8.3.1 (lab). The prof flagged this in L18: “split=‘deviance’ is cross-entropy, not Gaussian deviance” — both are deviances in the GLM sense, scaled to match the respective likelihoods.
- “Out of bag” vs “OOB”. Used interchangeably across the slide deck, lectures, and ISLP.