Module 08 — Tree-based methods

Question 1 4 points Exam 2024 P3d

At each internal node, a CART regression tree picks $(j, s)$ to minimise which quantity?

A $\sum_{i \in R_1(j,s)} (y_i - \hat y_{R_1})^2 + \sum_{i \in R_2(j,s)} (y_i - \hat y_{R_2})^2$, with $\hat y_{R_m}$ the in-region mean of $y$.
B $\sum_{i \in R_1(j,s)} (y_i - \bar y)^2 + \sum_{i \in R_2(j,s)} (y_i - \bar y)^2$, with $\bar y$ the overall mean of $y$ across the whole training set.
C The mean of $(y_i - \hat y_{R_m})^2$ across all training observations, taken over both regions $R_1$ and $R_2$.
D $\sum_{i \in R_1(j,s)} |y_i - \hat y_{R_1}| + \sum_{i \in R_2(j,s)} |y_i - \hat y_{R_2}|$, the sum of absolute in-region deviations.

Show answer

Correct answer: A

Each candidate split partitions the parent region into $R_1$ and $R_2$; the predictions in each are the in-region means. The criterion is the sum of squared deviations, computed region-by-region against the in-region mean. The prof: "we're summing up the squared difference between each point and the average of all the shit in its region".

B measures deviations against the global mean — wrong reference; that's the total RSS, which the split can't change. C uses the mean of squared residuals (averages instead of sums), the prof flagged this verbatim as a different algorithm that "wouldn't behave very well". D swaps squared deviations for absolute deviations — that's the L1/median-regression criterion, paired with median (not mean) predictions; CART uses squared error.

Atoms: regression-tree. Lecture: L17-trees-1.

Question 2 5 points

Mark each statement about regression trees as true or false.

a) Regression trees automatically capture interactions between predictors via the split path. True False
b) Predicting the median (instead of the mean) of training $y$'s in each region is the textbook regression-tree rule. True False
c) Regression trees are equivalent to a GAM with one additive component per predictor. True False
d) The number of regions $J$ in a fitted tree equals the number of terminal nodes (leaves), not the number of internal split nodes. True False

Show answer

True — a leaf is reached only by satisfying every split condition along its path, so leaf membership encodes a logical conjunction over predictors. That's the prof's "older than X and hit more than Y" framing — the headline reason trees are not additive.
False — the textbook prediction is the in-region mean. Median is a robust variant the prof mentioned in passing, but the algorithm and its analysis use the mean (squared-error loss).
False — GAMs are additive ($f_1(x_1) + f_2(x_2) + \cdots$); a tree is a sum over indicators of region membership, $\sum_m c_m \mathbf{1}(x \in R_m)$, which can encode interactions but cannot be decomposed back to a per-predictor additive form.
True — $|T|$ in the cost-complexity formula and "tree size" in CV plots both count terminal nodes (leaves), not internal split nodes.

Atoms: regression-tree, generalized-additive-models.

Question 3 5 points

A regression tree assigns six training observations to a single leaf $R$, with response values $\{ 12,\ 14,\ 15,\ 18,\ 20,\ 25 \}$. A new test point lands in this leaf. What does the tree predict for it?

A $16.5$ (the median of the leaf's training $y$ values).
B $17.33$ (the mean of the training $y$ values in the leaf).
C $18.5$ (the midrange $(\min + \max)/2$ of the training $y$ values in the leaf).
D $104$ (the sum of the training $y$ values in the leaf, which is what the splitter optimises).

Show answer

Correct answer: B

The CART regression-tree prediction in region $R_m$ is $\hat y_{R_m} = \frac{1}{|R_m|}\sum_{i \in R_m} y_i$. Here $\hat y_R = (12+14+15+18+20+25)/6 = 104/6 \approx 17.33$.

A picks the median, which is a robust alternative the prof named but not the textbook rule. C reports the midrange — a centre statistic, but not the one CART uses; the leaf prediction is the mean, not a min/max average. D forgets to divide by $|R_m|$ — the splitter minimises a sum of squared deviations, but the prediction itself is still the in-region mean.

Atoms: regression-tree.

Question 4 5 points

Mark each statement about CART recursive binary splitting as true or false.

a) The algorithm is greedy: it picks the locally best split at each node and never revisits earlier splits. True False
b) Each split considers only one predictor and one threshold at a time. True False
c) When a region is split, the cut extends across the entire predictor space, including regions that are not the parent. True False
d) Greedy recursive binary splitting produces the partition that globally minimises training RSS over all possible partitions of the predictor space. True False

Show answer

True — the prof: "trees don't go and change their mind and reconnect". Each split is locked once made.
True — splits are univariate axis-aligned, "very easy" per the lectures. Multi-variable oblique splits exist as a research variant but aren't CART.
False — a split only partitions the parent region. If cuts extended across the whole space, you'd recover the additive grid of GAMs. This is exactly why a tree can encode interactions while a GAM cannot.
False — exhaustive optimal partitioning is computationally intractable; CART is greedy and only locally optimal. The prof: "we can't try every possible combination. It's just too much."

Atoms: regression-tree.

Question 5 4 points Exam 2024 P2d

A regression tree predicting daily bike rentals has the following structure:

              temp < 10
              /        \
           50         humidity < 70
                       /         \
                    weekend?    300
                   /        \
                 200         420

Predict the number of bikes rented on a day with temp = 18, humidity = 60, weekend = no.

A 200 — temp $\geq 10$, humidity $< 70$, weekend = no, leftmost leaf.
B 50 — the cold-day branch caught by the very first split on temperature.
C 300 — temp $\geq 10$, humidity $\geq 70$, so the right branch of the humidity split.
D 420 — temp $\geq 10$, humidity $< 70$, weekend = yes, so the rightmost leaf of that subtree.

Show answer

Correct answer: A

Walk the splits with the test row: $\text{temp} = 18 \not< 10$ so go right; $\text{humidity} = 60 < 70$ so go left; $\text{weekend} = \text{no}$ so go left again. Leaf value = 200.

B misreads the first split direction (forces the cold branch). C ignores that 60 < 70 and takes the wrong humidity branch. D answers for a weekend day, getting the leaf for "yes" instead of "no".

Atoms: regression-tree.

Question 6 4 points

Why do CART implementations grow the tree out fully and only then prune backwards, rather than stop growing early when a split's RSS reduction falls below a threshold?

A Early stopping is computationally infeasible because evaluating the RSS-reduction threshold is NP-hard at every node.
B Pruning leaves the tree with more parameters than early stopping does, which lowers the bias of the final fitted model.
C Early stopping always produces a tree that is too large, while pruning guarantees the globally optimal partition over all subtrees of $T_0$.
D A locally weak-looking split can enable a much stronger split deeper in that branch — only visible if the tree grows.

Show answer

Correct answer: D

The prof verbatim: "Sometimes you actually do a seemingly worthless split, but then it's followed by a very good one. Because again, sometimes you need to cut one way and then cut the other way until you really get the benefit … these interactions are not super obvious." Build out, then prune.

A misuses the NP-hard fact (which applies to optimal global partitioning, not to evaluating a stopping threshold). B is direction-confused: more parameters means lower bias but higher variance, and the prune-after step removes parameters. C inverts the problem: early stopping yields trees that are too small, and weakest-link pruning is locally optimal over the nested sequence, not globally optimal over all subtrees.

Atoms: cost-complexity-pruning. Lecture: L18-trees-2.

Question 7 5 points

Mark each statement about cost-complexity pruning $C_\alpha(T) = \sum_{m=1}^{|T|}\sum_{i \in R_m}(y_i - \hat y_{R_m})^2 + \alpha |T|$ as true or false.

a) The penalty $\alpha |T|$ has the same shape as a lasso/ridge penalty applied to the number of leaves. True False
b) $|T|$ in the formula denotes the absolute value of the training RSS, hence the bars. True False
c) The pruning sequence $T_0 \supset T_1 \supset \cdots$ is nested: once a node is collapsed by weakest-link pruning, it stays collapsed at all larger $\alpha$. True False
d) The complexity hyperparameter $\alpha$ is selected by minimising training RSS at each candidate value of $\alpha$. True False

Show answer

True — the slide deck explicitly draws this analogy ("Equation 8.4 is reminiscent of the lasso"). $|T|$ acts like an $L_0$ norm on the leaf count.
False — $|T|$ is the cardinality (count of terminal nodes), not absolute value. The prof flagged this notation explicitly: it's "the zeroth norm".
True — weakest-link pruning at increasing $\alpha$ produces a hierarchical sequence of subtrees. Once a leaf is collapsed it remains collapsed.
False — minimising training RSS just selects the unpruned tree $T_0$ (more leaves always means lower training RSS). $\alpha$ is selected by $K$-fold cross-validation on held-out data.

Atoms: cost-complexity-pruning, cross-validation.

Question 8 5 points

You run cv.tree on a regression tree and get the CV deviance vs tree size curve below. The minimum sits at size 5; sizes 4 and 6 are within one standard error of size 5; sizes 1, 2 and 3 are clearly worse, and size 12 (the unpruned tree) is much worse. Which subtree should you select, given the prof's preference, and what does the choice imply about the cost-complexity penalty $\alpha$?

A Size 12 — using the largest tree minimises training RSS, which corresponds to penalty $\alpha = 0$.
B Size 5 — the global CV minimum, corresponding to a moderate non-zero penalty $\alpha > 0$.
C Size 4 — the smallest tree within one SE of the minimum, slightly larger $\alpha$ than at size 5.
D Size 1 — the simplest tree always wins under the one-SE rule, regardless of where the CV minimum sits.

Show answer

Correct answer: C

The prof's preferred selection rule is the one-SE rule: among models within one SE of the CV minimum, pick the simplest. Sizes 4, 5, 6 are all within one SE of the minimum at 5, so 4 wins. A larger $\alpha$ corresponds to a smaller tree, so size 4 sits at a slightly larger $\alpha$ than size 5. From [[L18-trees-2]]: "we picked 7 because that was the smallest model with that level of misclassification error."

A picks the unpruned tree, which overfits — exactly the regime CV is supposed to avoid. B is the bare CV minimum (correct without the one-SE rule, but the prof's stated preference is the simpler tree within one SE). D ignores the CV curve entirely and always picks the root-only tree, which the curve shows is well outside the one-SE band.

Atoms: cost-complexity-pruning, one-standard-error-rule, cross-validation.

Question 9 5 points

Mark each direction-of-effect claim as true or false.

a) Increasing $\alpha$ in cost-complexity pruning leads to a smaller tree (fewer terminal nodes). True False
b) At $\alpha = 0$ the cost-complexity criterion is minimised by the fully grown tree $T_0$. True False
c) As you prune from $T_0$ toward the root, training RSS decreases at every step. True False
d) Pruning trades a small increase in bias for a (typically larger) decrease in variance. True False

Show answer

True — larger $\alpha$ pays a heavier penalty per leaf, so the optimum drifts toward fewer leaves. Same direction as $\lambda$ in lasso/ridge: more regularisation, simpler model.
True — with no penalty, more leaves always means lower training RSS, so the unpruned tree $T_0$ wins.
False — pruning increases training RSS at every step (you remove flexibility). The held-out / CV error first decreases (overfitting reduced), then increases past the optimal size — but training RSS only ever goes up as you prune.
True — fewer parameters means less data-set sensitivity (lower variance) at the cost of coarser in-region fits (higher bias). The prof: "how little bias you need to get a lot of reduction in variance."

Atoms: cost-complexity-pruning, bias-variance-tradeoff.

Question 10 5 points ISLP §8 Q3

A binary classification region $R_m$ contains 70 class-A observations and 30 class-B observations. Compute the Gini index for this region.

A $0.21$, computed as $0.7 \cdot 0.3 = 0.21$.
B $0.30$, the proportion of the minority class.
C $0.42$, computed as $0.7 \cdot 0.3 + 0.3 \cdot 0.7$.
D $0.61$, computed as $-(0.7\log 0.7 + 0.3 \log 0.3)$ (cross-entropy in nats).

Show answer

Correct answer: C

$G = \sum_{k=1}^K \hat p_{mk}(1 - \hat p_{mk}) = 0.7 \cdot 0.3 + 0.3 \cdot 0.7 = 0.21 + 0.21 = 0.42$. Equivalently, $1 - \sum_k \hat p_{mk}^2 = 1 - (0.49 + 0.09) = 0.42$.

A drops one of the two terms in the sum (binary classification still has two classes, both contribute). B reports $1 - \max_k \hat p_{mk} = 0.3$ — that's misclassification error, the criterion the prof flagged as too coarse for splitting. D is the cross-entropy (a different, also valid impurity measure) — close-but-different to Gini, and Gini is what the question asked for.

Atoms: classification-tree.

Question 11 4 points

Mark each statement about classification-tree splitting and pruning criteria as true or false.

a) Both Gini index and cross-entropy are zero on a pure node and strictly positive on an impure one. True False
b) The standard recipe is to grow the tree using Gini or cross-entropy and prune using misclassification error. True False
c) A split where both children predict the same majority class is necessarily a useless split that the algorithm should not have made. True False
d) Misclassification error is preferred for growing because it directly minimises the test error each split is supposed to reduce. True False

Show answer

True — both impurity measures vanish exactly when one class has $\hat p = 1$ and are positive otherwise. This is what makes them sensible "node purity" scores.
True — the prof's flagged asymmetry: "you build out the tree using deviance or Gini index but then you actually prune it using misclassification criteria … because that's ultimately what you care about."
False — both children predicting the same class while one is much purer than the other reduces impurity, even though the labels don't change. The split also enables a strong follow-up split that early-stopping on misclassification would miss.
False — misclassification is too coarse: it can't distinguish between two splits with the same overall error rate but very different node purities. The prof: "the misclassification error is not sufficiently sensitive for tree growing."

Atoms: classification-tree, cost-complexity-pruning.

Question 12 4 points

Consider a parent node with 400 of class A and 400 of class B. Two candidate splits both yield 25% misclassification error overall. Split 1 produces children $(100\text{A}, 300\text{B})$ and $(300\text{A}, 100\text{B})$. Split 2 produces children $(200\text{A}, 0\text{B})$ and $(200\text{A}, 400\text{B})$. Which is the strongest single reason the prof gave for preferring Gini over misclassification at the growth step?

A Gini is differentiable in $\hat p_{mk}$, which CART's split-search routines need; misclassification error has kinks that the optimiser can't handle.
B Misclassification error and Gini disagree only when the parent is class-imbalanced, so on a 50/50 parent like this one Gini is preferred for stability reasons.
C Misclassification error tends to under-count splits whose smaller child is the minority class, so it systematically prefers asymmetric splits like Split 1.
D Gini rewards Split 2 because one child is fully pure, while misclassification cannot tell Split 1 from Split 2.

Show answer

Correct answer: D

Both splits have the same misclassification rate (25%), so misclassification can't choose between them. Gini sees the difference: Split 2's pure $(200\text{A}, 0\text{B})$ child has Gini 0, which is much purer than any node from Split 1. The prof's load-bearing argument: "the misclassification error is not sufficiently sensitive for tree growing" — it is the impurity-sensitivity argument, not differentiability.

A is the popular "differentiability" myth — CART does an exhaustive search over candidate cut-points, no gradient needed; the actual reason is sensitivity to purity, not smoothness. B is fabricated: the disagreement here happens with a perfectly 50/50 parent. C inverts the symptom: misclassification is impurity-insensitive, treating Split 1 and Split 2 as equally good; it doesn't preferentially pick one shape over the other.

Atoms: classification-tree. Lecture: L18-trees-2.

Question 13 5 points

Mark each statement about bagging as true or false.

a) Bagging fits $B$ trees in parallel on independent bootstrap samples and then averages their predictions for regression. True False
b) Increasing $B$ in bagging will eventually drive the variance of the bagged predictor to zero. True False
c) Bagging the trees in a CART ensemble destroys single-tree interpretability, which is partly recovered by variable-importance plots. True False
d) Bagging is most useful for low-variance, high-bias base learners such as linear regression. True False

Show answer

True — that's the textbook recipe: $B$ bootstrap samples, fit a tree on each, average predictions for regression (or majority vote for classification). Compare boosting, which is sequential.
False — the variance floor is $\rho \sigma^2$, where $\rho$ is the pairwise correlation between bootstrap-trained predictors. Adding more trees collapses the $(1-\rho)\sigma^2/B$ term but never the $\rho\sigma^2$ floor. This is precisely why random forests decorrelate trees via mtry.
True — averaging 500 trees produces a model that you can no longer read off as a flow chart. The prof's "explainable AI" framing: importance plots give you which variables matter, even if not how they combine.
False — bagging shines on high-variance, low-bias learners (deep trees, KNN with small $K$). Bagging linear regression barely helps because OLS is already low variance; the average of nearly-identical fits is the same fit.

Atoms: bagging, random-forest, variable-importance.

Question 14 4 points

On the South African heart-disease dataset with $p = 9$ predictors you fit a random forest classifier. Which value of mtry matches the textbook default?

A $m = p = 9$ — every predictor available at every split.
B $m = \lfloor \sqrt p \rfloor = 3$ — the classification default.
C $m = \lfloor p/3 \rfloor = 3$ — the regression default applied here.
D $m = \lceil p/2 \rceil = 5$ — the symmetric "use about half the predictors" compromise rule.

Show answer

Correct answer: B

Classification: $m \approx \sqrt p$. Regression: $m = p/3$. With $p = 9$ and a binary classifier, $\sqrt 9 = 3$, so mtry = 3 is the textbook classification default.

A picks $m = p$, which is plain bagging — no decorrelation, the variance floor stays at $\rho\sigma^2$. C uses the regression default ($p/3$); it happens to land on 3 here, but the rule is for the wrong task type — easy way to lose a point on a question with $p = 12$ where the two defaults give different numbers (4 for regression, $\approx 3$ for classification). D invents a "use half the predictors" rule that isn't textbook; the canonical defaults are $\sqrt p$ and $p/3$, not $p/2$.

Atoms: random-forest.

Question 15 5 points Exam 2025 P7

Mark each statement about random-forest hyperparameters as true or false.

a) The number of trees $B$ is a tuning parameter that should be optimised by cross-validation. True False
b) Increasing $B$ past the point where OOB error has stabilised will cause the random forest to overfit. True False
c) A fresh random subset of $m$ predictors is drawn at every internal node, not once per tree. True False
d) Trees inside a random forest are typically grown deep and unpruned; ensemble averaging plays the role of regularisation. True False

Show answer

False — $B$ is "use enough", not tuned. The 2025 exam key: "the number of trees is not a tuning parameter, but the students should mention that it should be chosen large enough." The 2023 exam deducts marks for treating $B$ as a tuning parameter.
False — adding more trees never makes a random forest worse on test error; the variance just keeps shrinking until it hits the $\rho\sigma^2$ floor. Compare boosting, where too many trees does overfit.
True — per-split, not per-tree. This maximises diversity even within a single tree, where different splits see different predictor pools.
True — pruning individual RF trees is unnecessary because the ensemble average is already the regulariser. Each individual tree is allowed to overfit; the average doesn't.

Atoms: random-forest, out-of-bag-error.

Question 16 5 points Ex8.1d

For a bootstrap sample of size $n$ drawn with replacement from $n$ training observations, what fraction of the original observations is left out (out-of-bag) for that sample, in the limit $n \to \infty$?

A About $0.5$, since on average half the indices are sampled and half are not.
B About $0.368 \approx 1/e$, since $(1 - 1/n)^n \to 1/e$ as $n \to \infty$.
C Exactly $1/n$, the probability of any single observation being drawn on a given draw.
D About $0.632$, since the in-bag fraction is $1 - 1/e$ and that is the OOB fraction.

Show answer

Correct answer: B

Each of the $n$ draws independently fails to pick observation $i$ with probability $1 - 1/n$, so $P(i \text{ never drawn}) = (1 - 1/n)^n \to 1/e \approx 0.368$. So about 37% are out-of-bag for each tree, and about 63% are in-bag.

A guesses 50/50 with no derivation. D swaps in-bag and OOB — $1 - 1/e \approx 0.632$ is the in-bag fraction; OOB is the complement. C confuses the per-draw selection probability with the all-draws-fail probability, missing the exponentiation.

Atoms: out-of-bag-error, bootstrap.

Question 17 4 points

Bagged tree predictions $\hat f^{*1}, \ldots, \hat f^{*B}$ have common variance $\sigma^2$ and pairwise correlation $\rho$. The variance of their average is $\rho \sigma^2 + (1-\rho)\sigma^2/B$. Why does a random forest outperform plain bagging when there is one strong predictor in the data?

A RF reduces $\sigma^2$ by considering only $m$ predictors per split, so each individual tree has lower variance.
B RF reduces $B$ to a smaller value, which lets the variance term $\sigma^2/B$ shrink faster across the ensemble.
C RF reduces $\rho$ — different trees pick different root splits — lowering the $\rho\sigma^2$ floor of the variance.
D RF replaces the bootstrap with subsampling without replacement, which is what makes the trees independent and drives $\rho \to 0$.

Show answer

Correct answer: C

The $\rho \sigma^2$ floor doesn't shrink with $B$. To beat it you have to reduce $\rho$ — the per-tree pairwise correlation. RF does exactly this by hiding the strong predictor from a random subset of splits; trees diverge from the very first split. The prof: "if you want to have a lot of opinions, don't clone the same person 50 times … figure out a way to add diversity into your pool."

A gets the mechanism wrong: per-tree variance $\sigma^2$ is roughly unchanged (RF trees can even be higher variance individually); the gain is decorrelation. B inverts the direction of $B$ (and treats $B$ as a tuning parameter, which it isn't). D mis-states the algorithm: RF still uses bootstrap with replacement (it is bagging plus per-split predictor restriction); subsampling without replacement is a separate variant, and even then the trees aren't actually independent — the gain is reducing $\rho$, not driving it to zero.

Atoms: random-forest, bagging.

Question 18 4 points

Mark each statement about out-of-bag (OOB) error as true or false.

a) For each observation $i$, the OOB prediction averages only over the trees that did not see $i$ in their bootstrap sample. True False
b) For a bagged ensemble of $B$ trees, OOB error replaces the need for a separate held-out test set. True False
c) Boosted ensembles (e.g. gradient boosting) report OOB error in the same way bagging and RF do. True False
d) The randomisation-based variable-importance procedure is computed on the OOB observations, not on the in-bag training data. True False

Show answer

True — that's the OOB construction: each observation is held out by roughly a third of the trees, so its prediction is unbiased w.r.t. those trees.
True — with mild caveats. OOB gives an honest test-error proxy without needing a dedicated test set or a separate $k$-fold CV pass; the prof flagged the slight dependence on bootstrap structure but endorsed it for tree ensembles.
False — boosting fits trees sequentially on a single re-weighted/residualised training set; there is no per-tree bootstrap and no per-tree OOB sample. OOB is a bagging-family construct.
True — permutation importance permutes predictor $j$ on the OOB sample and measures the drop in OOB performance, exactly the OOB-vs-permuted-OOB comparison.

Atoms: out-of-bag-error, variable-importance.

Question 19 5 points

Suppose individual bagged-tree predictions at a fixed test point have common variance $\sigma^2 = 1$ and pairwise correlation $\rho = 0.4$. Using $\operatorname{Var}\!\left(\frac{1}{B}\sum_b \hat f^{*b}\right) = \rho\sigma^2 + (1-\rho)\sigma^2 / B$, what is the variance of the bagged predictor when $B = 100$, and what does it converge to as $B \to \infty$?

A $0.406$ at $B=100$; $\to 0.4$ as $B \to \infty$.
B $0.406$ at $B=100$; $\to 0$ as $B \to \infty$.
C $0.010$ at $B=100$; $\to 0$ as $B \to \infty$.
D $0.604$ at $B=100$; $\to 0.6$ as $B \to \infty$.

Show answer

Correct answer: A

$0.4 \cdot 1 + (1 - 0.4)\cdot 1 / 100 = 0.4 + 0.006 = 0.406$. As $B \to \infty$, the second term vanishes and the variance approaches the floor $\rho\sigma^2 = 0.4$. This is the very reason RF decorrelates trees via mtry — to push $\rho$ (and hence the floor) down.

B applies the formula at finite $B$ correctly but then forgets the floor at infinity, claiming variance goes to zero (the IID reasoning, which doesn't apply when bootstrap samples share data). C uses $\sigma^2/B$ alone (the IID variance, $0.01$), missing the $\rho\sigma^2$ contribution entirely. D inverts the role of $\rho$: it computes $(1-\rho) + \rho/B = 0.604$ at $B=100$ and converges to $1 - \rho = 0.6$ — wrong direction.

Atoms: bagging, random-forest.

Question 20 5 points Exam 2023

Mark each statement about variable-importance plots from a tree ensemble as true or false.

a) Impurity-based importance for each predictor sums the decrease in Gini (or RSS) across all splits using that predictor, averaged over trees. True False
b) Permutation-based importance shuffles the values of one predictor across the OOB sample and measures the resulting drop in performance. True False
c) Variable-importance plots tell you the sign of each predictor's effect (positively or negatively associated with the response). True False
d) Variable importance is best used as a variable-selection device: drop the bottom predictors and refit the model. True False

Show answer

True — that is the impurity-based ("MeanDecreaseGini" / RSS-based) definition. The 2023 exam key: "which three variables are most important to predict default, according to an importance measure based on node purity?"
True — permutation importance permutes predictor $j$ on OOB observations and measures the drop in MSE / accuracy. The prof prefers it: "I would generally go with randomization one because it makes more sense."
False — importance only ranks predictors by how much they matter; it doesn't say in which direction they push the response. For direction-of-effect, use partial dependence plots (m09).
False — importance is for interpretation, not selection. Rankings are conditional on the fitted model; dropping low-importance predictors and refitting can break interactions and is a common mistake.

Atoms: variable-importance, out-of-bag-error.

Question 21 4 points

A teammate fits a random forest on a mix of continuous and many-level categorical predictors and asks whether to read off variable importance from MeanDecreaseGini or MeanDecreaseAccuracy. Which is the textbook reason to prefer the permutation-based version (MeanDecreaseAccuracy) here?

A Permutation importance needs fewer trees to converge, so its rankings are more reliable on small ensembles fitted with a low value of $B$.
B MeanDecreaseAccuracy is the only valid importance measure for classification trees, since Gini-based importance applies to regression problems only.
C Permutation importance is computed on the in-bag training data, which is more representative of the model's typical inputs at deployment time.
D Impurity-based importance is biased toward predictors with many possible split points, inflating their apparent importance.

Show answer

Correct answer: D

Many-level / continuous predictors get more chances to reduce impurity by chance, so impurity-based ranks them artificially high. Permutation importance directly tests whether knowing the predictor's value helps prediction on held-out data, so it sidesteps this bias. The prof on his preference: "I would generally go with randomization one because it makes more sense."

A invents an unrelated efficiency claim. B wrongly excludes Gini-based importance, which is also valid (the 2023 exam key explicitly asked about it). C inverts the OOB construction: permutation importance uses OOB observations, not in-bag, exactly so the corruption test is honest.

Atoms: variable-importance. Lecture: L19-boosting-1.

Question 22 4 points

Mark each statement as true or false.

a) Tree-based methods are scale-invariant, so standardising the predictors before fitting is unnecessary. True False
b) On the Boston Housing benchmark, the prof's worked example shows test MSE ranking $\text{(single tree)} > \text{(bagging)} > \text{(random forest)}$ — RF wins, with bagging in between. True False
c) For a regression tree benchmark in Compulsory Exercise 2 you would prune by cross-validating against AIC, since AIC is the standard tree-pruning criterion. True False
d) On a brain-injury-style classification dataset where one predictor dominates, plain bagging often delivers little improvement over a single classification tree, while RF still helps. True False

Show answer

True — splits depend only on the relative ordering of values along each axis, so monotone rescaling can't change the tree. The Specials atom on standardisation flags trees as the canonical "doesn't need it" example, in contrast to ridge / lasso / PCA / k-means / KNN / NNs.
True — the slide-deck Boston example: single tree test MSE $\approx 35$, bagging $\approx 23.7$, RF with $m = p/3 \approx 18.9$. The expected ordering, and a canonical scenario answer.
False — tree pruning is selected by $K$-fold CV on held-out data, not by AIC. AIC mechanics are explicitly out of scope per the prof; the textbook tool is cv.tree.
True — when one predictor is dominant, every bootstrap tree picks it for the root split, so $\rho$ stays high and the bagging variance floor dominates. RF breaks the dominance by withholding the strong predictor at random splits, so trees actually decorrelate and the ensemble improves.

Atoms: standardization, random-forest, bagging, cost-complexity-pruning.