← Back to wiki

Module 08 — Tree-based methods

22 questions · 100 points · ~35 min

Trees, pruning, bagging, random forests, OOB, variable importance. Boosting lives in m09. Click an option to lock the answer; the explanation auto-opens. Score tracker bottom-left.

Question 1 4 points Exam 2024 P3d

At each internal node, a CART regression tree picks $(j, s)$ to minimise which quantity?

Show answer
Correct answer: A

Each candidate split partitions the parent region into $R_1$ and $R_2$; the predictions in each are the in-region means. The criterion is the sum of squared deviations, computed region-by-region against the in-region mean. The prof: "we're summing up the squared difference between each point and the average of all the shit in its region".

B measures deviations against the global mean — wrong reference; that's the total RSS, which the split can't change. C uses the mean of squared residuals (averages instead of sums), the prof flagged this verbatim as a different algorithm that "wouldn't behave very well". D swaps squared deviations for absolute deviations — that's the L1/median-regression criterion, paired with median (not mean) predictions; CART uses squared error.

Atoms: regression-tree. Lecture: L17-trees-1.

Question 2 5 points

Mark each statement about regression trees as true or false.

Show answer
  1. True — a leaf is reached only by satisfying every split condition along its path, so leaf membership encodes a logical conjunction over predictors. That's the prof's "older than X and hit more than Y" framing — the headline reason trees are not additive.
  2. False — the textbook prediction is the in-region mean. Median is a robust variant the prof mentioned in passing, but the algorithm and its analysis use the mean (squared-error loss).
  3. False — GAMs are additive ($f_1(x_1) + f_2(x_2) + \cdots$); a tree is a sum over indicators of region membership, $\sum_m c_m \mathbf{1}(x \in R_m)$, which can encode interactions but cannot be decomposed back to a per-predictor additive form.
  4. True — $|T|$ in the cost-complexity formula and "tree size" in CV plots both count terminal nodes (leaves), not internal split nodes.

Atoms: regression-tree, generalized-additive-models.

Question 3 5 points

A regression tree assigns six training observations to a single leaf $R$, with response values $\{ 12,\ 14,\ 15,\ 18,\ 20,\ 25 \}$. A new test point lands in this leaf. What does the tree predict for it?

Show answer
Correct answer: B

The CART regression-tree prediction in region $R_m$ is $\hat y_{R_m} = \frac{1}{|R_m|}\sum_{i \in R_m} y_i$. Here $\hat y_R = (12+14+15+18+20+25)/6 = 104/6 \approx 17.33$.

A picks the median, which is a robust alternative the prof named but not the textbook rule. C reports the midrange — a centre statistic, but not the one CART uses; the leaf prediction is the mean, not a min/max average. D forgets to divide by $|R_m|$ — the splitter minimises a sum of squared deviations, but the prediction itself is still the in-region mean.

Atoms: regression-tree.

Question 4 5 points

Mark each statement about CART recursive binary splitting as true or false.

Show answer
  1. True — the prof: "trees don't go and change their mind and reconnect". Each split is locked once made.
  2. True — splits are univariate axis-aligned, "very easy" per the lectures. Multi-variable oblique splits exist as a research variant but aren't CART.
  3. False — a split only partitions the parent region. If cuts extended across the whole space, you'd recover the additive grid of GAMs. This is exactly why a tree can encode interactions while a GAM cannot.
  4. False — exhaustive optimal partitioning is computationally intractable; CART is greedy and only locally optimal. The prof: "we can't try every possible combination. It's just too much."

Atoms: regression-tree.

Question 5 4 points Exam 2024 P2d

A regression tree predicting daily bike rentals has the following structure:

              temp < 10
              /        \
           50         humidity < 70
                       /         \
                    weekend?    300
                   /        \
                 200         420
      

Predict the number of bikes rented on a day with temp = 18, humidity = 60, weekend = no.

Show answer
Correct answer: A

Walk the splits with the test row: $\text{temp} = 18 \not< 10$ so go right; $\text{humidity} = 60 < 70$ so go left; $\text{weekend} = \text{no}$ so go left again. Leaf value = 200.

B misreads the first split direction (forces the cold branch). C ignores that 60 < 70 and takes the wrong humidity branch. D answers for a weekend day, getting the leaf for "yes" instead of "no".

Atoms: regression-tree.

Question 6 4 points

Why do CART implementations grow the tree out fully and only then prune backwards, rather than stop growing early when a split's RSS reduction falls below a threshold?

Show answer
Correct answer: D

The prof verbatim: "Sometimes you actually do a seemingly worthless split, but then it's followed by a very good one. Because again, sometimes you need to cut one way and then cut the other way until you really get the benefit … these interactions are not super obvious." Build out, then prune.

A misuses the NP-hard fact (which applies to optimal global partitioning, not to evaluating a stopping threshold). B is direction-confused: more parameters means lower bias but higher variance, and the prune-after step removes parameters. C inverts the problem: early stopping yields trees that are too small, and weakest-link pruning is locally optimal over the nested sequence, not globally optimal over all subtrees.

Atoms: cost-complexity-pruning. Lecture: L18-trees-2.

Question 7 5 points

Mark each statement about cost-complexity pruning $C_\alpha(T) = \sum_{m=1}^{|T|}\sum_{i \in R_m}(y_i - \hat y_{R_m})^2 + \alpha |T|$ as true or false.

Show answer
  1. True — the slide deck explicitly draws this analogy ("Equation 8.4 is reminiscent of the lasso"). $|T|$ acts like an $L_0$ norm on the leaf count.
  2. False — $|T|$ is the cardinality (count of terminal nodes), not absolute value. The prof flagged this notation explicitly: it's "the zeroth norm".
  3. True — weakest-link pruning at increasing $\alpha$ produces a hierarchical sequence of subtrees. Once a leaf is collapsed it remains collapsed.
  4. False — minimising training RSS just selects the unpruned tree $T_0$ (more leaves always means lower training RSS). $\alpha$ is selected by $K$-fold cross-validation on held-out data.

Atoms: cost-complexity-pruning, cross-validation.

Question 8 5 points

You run cv.tree on a regression tree and get the CV deviance vs tree size curve below. The minimum sits at size 5; sizes 4 and 6 are within one standard error of size 5; sizes 1, 2 and 3 are clearly worse, and size 12 (the unpruned tree) is much worse. Which subtree should you select, given the prof's preference, and what does the choice imply about the cost-complexity penalty $\alpha$?

Show answer
Correct answer: C

The prof's preferred selection rule is the one-SE rule: among models within one SE of the CV minimum, pick the simplest. Sizes 4, 5, 6 are all within one SE of the minimum at 5, so 4 wins. A larger $\alpha$ corresponds to a smaller tree, so size 4 sits at a slightly larger $\alpha$ than size 5. From [[L18-trees-2]]: "we picked 7 because that was the smallest model with that level of misclassification error."

A picks the unpruned tree, which overfits — exactly the regime CV is supposed to avoid. B is the bare CV minimum (correct without the one-SE rule, but the prof's stated preference is the simpler tree within one SE). D ignores the CV curve entirely and always picks the root-only tree, which the curve shows is well outside the one-SE band.

Atoms: cost-complexity-pruning, one-standard-error-rule, cross-validation.

Question 9 5 points

Mark each direction-of-effect claim as true or false.

Show answer
  1. True — larger $\alpha$ pays a heavier penalty per leaf, so the optimum drifts toward fewer leaves. Same direction as $\lambda$ in lasso/ridge: more regularisation, simpler model.
  2. True — with no penalty, more leaves always means lower training RSS, so the unpruned tree $T_0$ wins.
  3. False — pruning increases training RSS at every step (you remove flexibility). The held-out / CV error first decreases (overfitting reduced), then increases past the optimal size — but training RSS only ever goes up as you prune.
  4. True — fewer parameters means less data-set sensitivity (lower variance) at the cost of coarser in-region fits (higher bias). The prof: "how little bias you need to get a lot of reduction in variance."

Atoms: cost-complexity-pruning, bias-variance-tradeoff.

Question 10 5 points ISLP §8 Q3

A binary classification region $R_m$ contains 70 class-A observations and 30 class-B observations. Compute the Gini index for this region.

Show answer
Correct answer: C

$G = \sum_{k=1}^K \hat p_{mk}(1 - \hat p_{mk}) = 0.7 \cdot 0.3 + 0.3 \cdot 0.7 = 0.21 + 0.21 = 0.42$. Equivalently, $1 - \sum_k \hat p_{mk}^2 = 1 - (0.49 + 0.09) = 0.42$.

A drops one of the two terms in the sum (binary classification still has two classes, both contribute). B reports $1 - \max_k \hat p_{mk} = 0.3$ — that's misclassification error, the criterion the prof flagged as too coarse for splitting. D is the cross-entropy (a different, also valid impurity measure) — close-but-different to Gini, and Gini is what the question asked for.

Atoms: classification-tree.

Question 11 4 points

Mark each statement about classification-tree splitting and pruning criteria as true or false.

Show answer
  1. True — both impurity measures vanish exactly when one class has $\hat p = 1$ and are positive otherwise. This is what makes them sensible "node purity" scores.
  2. True — the prof's flagged asymmetry: "you build out the tree using deviance or Gini index but then you actually prune it using misclassification criteria … because that's ultimately what you care about."
  3. False — both children predicting the same class while one is much purer than the other reduces impurity, even though the labels don't change. The split also enables a strong follow-up split that early-stopping on misclassification would miss.
  4. False — misclassification is too coarse: it can't distinguish between two splits with the same overall error rate but very different node purities. The prof: "the misclassification error is not sufficiently sensitive for tree growing."

Atoms: classification-tree, cost-complexity-pruning.

Question 12 4 points

Consider a parent node with 400 of class A and 400 of class B. Two candidate splits both yield 25% misclassification error overall. Split 1 produces children $(100\text{A}, 300\text{B})$ and $(300\text{A}, 100\text{B})$. Split 2 produces children $(200\text{A}, 0\text{B})$ and $(200\text{A}, 400\text{B})$. Which is the strongest single reason the prof gave for preferring Gini over misclassification at the growth step?

Show answer
Correct answer: D

Both splits have the same misclassification rate (25%), so misclassification can't choose between them. Gini sees the difference: Split 2's pure $(200\text{A}, 0\text{B})$ child has Gini 0, which is much purer than any node from Split 1. The prof's load-bearing argument: "the misclassification error is not sufficiently sensitive for tree growing" — it is the impurity-sensitivity argument, not differentiability.

A is the popular "differentiability" myth — CART does an exhaustive search over candidate cut-points, no gradient needed; the actual reason is sensitivity to purity, not smoothness. B is fabricated: the disagreement here happens with a perfectly 50/50 parent. C inverts the symptom: misclassification is impurity-insensitive, treating Split 1 and Split 2 as equally good; it doesn't preferentially pick one shape over the other.

Atoms: classification-tree. Lecture: L18-trees-2.

Question 13 5 points

Mark each statement about bagging as true or false.

Show answer
  1. True — that's the textbook recipe: $B$ bootstrap samples, fit a tree on each, average predictions for regression (or majority vote for classification). Compare boosting, which is sequential.
  2. False — the variance floor is $\rho \sigma^2$, where $\rho$ is the pairwise correlation between bootstrap-trained predictors. Adding more trees collapses the $(1-\rho)\sigma^2/B$ term but never the $\rho\sigma^2$ floor. This is precisely why random forests decorrelate trees via mtry.
  3. True — averaging 500 trees produces a model that you can no longer read off as a flow chart. The prof's "explainable AI" framing: importance plots give you which variables matter, even if not how they combine.
  4. False — bagging shines on high-variance, low-bias learners (deep trees, KNN with small $K$). Bagging linear regression barely helps because OLS is already low variance; the average of nearly-identical fits is the same fit.

Atoms: bagging, random-forest, variable-importance.

Question 14 4 points

On the South African heart-disease dataset with $p = 9$ predictors you fit a random forest classifier. Which value of mtry matches the textbook default?

Show answer
Correct answer: B

Classification: $m \approx \sqrt p$. Regression: $m = p/3$. With $p = 9$ and a binary classifier, $\sqrt 9 = 3$, so mtry = 3 is the textbook classification default.

A picks $m = p$, which is plain bagging — no decorrelation, the variance floor stays at $\rho\sigma^2$. C uses the regression default ($p/3$); it happens to land on 3 here, but the rule is for the wrong task type — easy way to lose a point on a question with $p = 12$ where the two defaults give different numbers (4 for regression, $\approx 3$ for classification). D invents a "use half the predictors" rule that isn't textbook; the canonical defaults are $\sqrt p$ and $p/3$, not $p/2$.

Atoms: random-forest.

Question 15 5 points Exam 2025 P7

Mark each statement about random-forest hyperparameters as true or false.

Show answer
  1. False — $B$ is "use enough", not tuned. The 2025 exam key: "the number of trees is not a tuning parameter, but the students should mention that it should be chosen large enough." The 2023 exam deducts marks for treating $B$ as a tuning parameter.
  2. False — adding more trees never makes a random forest worse on test error; the variance just keeps shrinking until it hits the $\rho\sigma^2$ floor. Compare boosting, where too many trees does overfit.
  3. True — per-split, not per-tree. This maximises diversity even within a single tree, where different splits see different predictor pools.
  4. True — pruning individual RF trees is unnecessary because the ensemble average is already the regulariser. Each individual tree is allowed to overfit; the average doesn't.

Atoms: random-forest, out-of-bag-error.

Question 16 5 points Ex8.1d

For a bootstrap sample of size $n$ drawn with replacement from $n$ training observations, what fraction of the original observations is left out (out-of-bag) for that sample, in the limit $n \to \infty$?

Show answer
Correct answer: B

Each of the $n$ draws independently fails to pick observation $i$ with probability $1 - 1/n$, so $P(i \text{ never drawn}) = (1 - 1/n)^n \to 1/e \approx 0.368$. So about 37% are out-of-bag for each tree, and about 63% are in-bag.

A guesses 50/50 with no derivation. D swaps in-bag and OOB — $1 - 1/e \approx 0.632$ is the in-bag fraction; OOB is the complement. C confuses the per-draw selection probability with the all-draws-fail probability, missing the exponentiation.

Atoms: out-of-bag-error, bootstrap.

Question 17 4 points

Bagged tree predictions $\hat f^{*1}, \ldots, \hat f^{*B}$ have common variance $\sigma^2$ and pairwise correlation $\rho$. The variance of their average is $\rho \sigma^2 + (1-\rho)\sigma^2/B$. Why does a random forest outperform plain bagging when there is one strong predictor in the data?

Show answer
Correct answer: C

The $\rho \sigma^2$ floor doesn't shrink with $B$. To beat it you have to reduce $\rho$ — the per-tree pairwise correlation. RF does exactly this by hiding the strong predictor from a random subset of splits; trees diverge from the very first split. The prof: "if you want to have a lot of opinions, don't clone the same person 50 times … figure out a way to add diversity into your pool."

A gets the mechanism wrong: per-tree variance $\sigma^2$ is roughly unchanged (RF trees can even be higher variance individually); the gain is decorrelation. B inverts the direction of $B$ (and treats $B$ as a tuning parameter, which it isn't). D mis-states the algorithm: RF still uses bootstrap with replacement (it is bagging plus per-split predictor restriction); subsampling without replacement is a separate variant, and even then the trees aren't actually independent — the gain is reducing $\rho$, not driving it to zero.

Atoms: random-forest, bagging.

Question 18 4 points

Mark each statement about out-of-bag (OOB) error as true or false.

Show answer
  1. True — that's the OOB construction: each observation is held out by roughly a third of the trees, so its prediction is unbiased w.r.t. those trees.
  2. True — with mild caveats. OOB gives an honest test-error proxy without needing a dedicated test set or a separate $k$-fold CV pass; the prof flagged the slight dependence on bootstrap structure but endorsed it for tree ensembles.
  3. False — boosting fits trees sequentially on a single re-weighted/residualised training set; there is no per-tree bootstrap and no per-tree OOB sample. OOB is a bagging-family construct.
  4. True — permutation importance permutes predictor $j$ on the OOB sample and measures the drop in OOB performance, exactly the OOB-vs-permuted-OOB comparison.

Atoms: out-of-bag-error, variable-importance.

Question 19 5 points

Suppose individual bagged-tree predictions at a fixed test point have common variance $\sigma^2 = 1$ and pairwise correlation $\rho = 0.4$. Using $\operatorname{Var}\!\left(\frac{1}{B}\sum_b \hat f^{*b}\right) = \rho\sigma^2 + (1-\rho)\sigma^2 / B$, what is the variance of the bagged predictor when $B = 100$, and what does it converge to as $B \to \infty$?

Show answer
Correct answer: A

$0.4 \cdot 1 + (1 - 0.4)\cdot 1 / 100 = 0.4 + 0.006 = 0.406$. As $B \to \infty$, the second term vanishes and the variance approaches the floor $\rho\sigma^2 = 0.4$. This is the very reason RF decorrelates trees via mtry — to push $\rho$ (and hence the floor) down.

B applies the formula at finite $B$ correctly but then forgets the floor at infinity, claiming variance goes to zero (the IID reasoning, which doesn't apply when bootstrap samples share data). C uses $\sigma^2/B$ alone (the IID variance, $0.01$), missing the $\rho\sigma^2$ contribution entirely. D inverts the role of $\rho$: it computes $(1-\rho) + \rho/B = 0.604$ at $B=100$ and converges to $1 - \rho = 0.6$ — wrong direction.

Atoms: bagging, random-forest.

Question 20 5 points Exam 2023

Mark each statement about variable-importance plots from a tree ensemble as true or false.

Show answer
  1. True — that is the impurity-based ("MeanDecreaseGini" / RSS-based) definition. The 2023 exam key: "which three variables are most important to predict default, according to an importance measure based on node purity?"
  2. True — permutation importance permutes predictor $j$ on OOB observations and measures the drop in MSE / accuracy. The prof prefers it: "I would generally go with randomization one because it makes more sense."
  3. False — importance only ranks predictors by how much they matter; it doesn't say in which direction they push the response. For direction-of-effect, use partial dependence plots (m09).
  4. False — importance is for interpretation, not selection. Rankings are conditional on the fitted model; dropping low-importance predictors and refitting can break interactions and is a common mistake.

Atoms: variable-importance, out-of-bag-error.

Question 21 4 points

A teammate fits a random forest on a mix of continuous and many-level categorical predictors and asks whether to read off variable importance from MeanDecreaseGini or MeanDecreaseAccuracy. Which is the textbook reason to prefer the permutation-based version (MeanDecreaseAccuracy) here?

Show answer
Correct answer: D

Many-level / continuous predictors get more chances to reduce impurity by chance, so impurity-based ranks them artificially high. Permutation importance directly tests whether knowing the predictor's value helps prediction on held-out data, so it sidesteps this bias. The prof on his preference: "I would generally go with randomization one because it makes more sense."

A invents an unrelated efficiency claim. B wrongly excludes Gini-based importance, which is also valid (the 2023 exam key explicitly asked about it). C inverts the OOB construction: permutation importance uses OOB observations, not in-bag, exactly so the corruption test is honest.

Atoms: variable-importance. Lecture: L19-boosting-1.

Question 22 4 points

Mark each statement as true or false.

Show answer
  1. True — splits depend only on the relative ordering of values along each axis, so monotone rescaling can't change the tree. The Specials atom on standardisation flags trees as the canonical "doesn't need it" example, in contrast to ridge / lasso / PCA / k-means / KNN / NNs.
  2. True — the slide-deck Boston example: single tree test MSE $\approx 35$, bagging $\approx 23.7$, RF with $m = p/3 \approx 18.9$. The expected ordering, and a canonical scenario answer.
  3. False — tree pruning is selected by $K$-fold CV on held-out data, not by AIC. AIC mechanics are explicitly out of scope per the prof; the textbook tool is cv.tree.
  4. True — when one predictor is dominant, every bootstrap tree picks it for the root split, so $\rho$ stays high and the bagging variance floor dominates. RF breaks the dominance by withholding the strong predictor at random splits, so trees actually decorrelate and the ensemble improves.

Atoms: standardization, random-forest, bagging, cost-complexity-pruning.