← Back to wiki

Module 09 — Boosting and Additive Trees

29 questions · 100 points · ~45 min

Click an option to lock the answer; the explanation auto-opens. Score tracker bottom-left.

Question 1 3 points Ex9.1

Which statement best captures the conceptual difference between bagging and boosting as ensemble methods?

Show answer
Correct answer: A

Bagging is parallel and aimed at variance reduction (averaging deep, low-bias trees grown on bootstrap samples). Boosting is sequential and aimed at bias reduction (each new weak learner corrects the previous ensemble).

B inverts the bias-variance roles. C confuses the two methods — boosting is the one that fits residuals. D describes the random-forest decorrelation trick (bagging restricted via $m$), not the bagging-vs-boosting axis.

Atoms: boosting, bagging. Lecture: L19-boosting-1.

Question 2 4 points Exam 2025 P6c

You fit a gradient-boosting model and need to specify the number of trees $M$, the tree depth $d$, and the shrinkage $\nu$. Which approach for picking these values does the prof advocate?

Show answer
Correct answer: C

The prof's verbatim answer to this exact question (2025 Q6c walkthrough): cross-validation, with $\nu$ fixed small and $M$ chosen at the CV minimum. `gbm()`'s cv.folds and `xgb.cv` are the canonical hooks.

A skips the data-driven step, but $M$ in particular overfits if too large — defaults won't generalise across datasets. B optimises against the model's own training noise; training error keeps falling as $M$ grows, never picking a meaningful stop. D applies the prof's least trusted model-selection family — he flagged AIC/BIC as making assumptions that "probably won't hold" and recommends CV instead.

Atoms: gradient-boosting, cross-validation. Lecture: L27-summary.

Question 3 3 points

For the squared-error loss $L(y, f) = \tfrac{1}{2}(y - f)^2$, the negative gradient $-\partial L / \partial f$ that gradient boosting fits each new tree to is:

Show answer
Correct answer: D

$\partial L / \partial f = -(y - f)$, so $-\partial L / \partial f = y - f$. This is exactly the residual, which is why "fit a tree to the residuals" and "fit a tree to the negative gradient" coincide for squared-error loss.

A flips the sign — that's the positive gradient, which would push the fit away from the data. B is the loss itself, not its derivative. C is the form of the AdaBoost exponent inside $\exp(-yf)$, not a gradient.

Atoms: gradient-boosting, boosting-loss-functions. Lecture: L20-boosting-2.

Question 4 4 points

Mark each statement about gradient-boosting hyperparameters as true or false.

Show answer
  1. True — smaller per-tree contribution means each step is a smaller move toward the minimum, so more steps are needed. $M$ and $\nu$ are coupled.
  2. False — boosting overfits if $M$ is too large because residuals eventually become noise. The "use enough" rule applies to bagging/RF, not boosting.
  3. True — a tree of depth $d$ uses at most $d$ predictors on any path, so it captures up to $d$-way interactions. Stumps (depth 1) → at most one variable per term → additive model (cf. ISLP §8.4 Q2).
  4. False — the rule of thumb is $\nu \le 0.1$ (typically $0.1, 0.05, 0.01, 0.001$), never near $1$. Setting $\nu \approx 1$ would kill the "small step" property the algorithm relies on.

Atoms: weak-learner-and-learning-rate, gradient-boosting. Lecture: L20-boosting-2.

Question 5 3 points

In the bias-variance picture, the core mechanism by which the basic boosting algorithm improves over a single tree is best described as:

Show answer
Correct answer: C

The prof's framing: bagging/RF start with high-variance/low-bias trees and drive variance down by averaging. Boosting starts from the opposite end — high-bias/low-variance weak learners — and drives bias down by accumulating their corrections.

A is the bagging/random-forest mechanism, not boosting's. B confuses model error with the irreducible noise floor — no algorithm reduces $\sigma^2$. D ignores that $M$ too large overfits, increasing variance again.

Atoms: boosting, bias-variance-tradeoff. Lecture: L19-boosting-1.

Question 6 3 points

In AdaBoost, after fitting the $m$-th weak classifier $G_m$ with weighted error $\text{err}_m$, what happens to a training observation that $G_m$ correctly classifies, in the weight update $w_i \leftarrow w_i \cdot \exp(\alpha_m \cdot \mathbb{1}[y_i \neq G_m(x_i)])$?

Show answer
Correct answer: C

For a correctly classified point, $\mathbb{1}[y_i \neq G_m(x_i)] = 0$, so the multiplier is $\exp(\alpha_m \cdot 0) = 1$. Misclassified points get $\exp(\alpha_m)$, which boosts their weight; correct points are simply left alone (weights are renormalised after the step).

A is the multiplier for misclassified points — the question asks about the correct ones. B describes a symmetric "boost good / cut bad" scheme that AdaBoost.M1 doesn't actually use. D conflates initialisation with the per-iteration update.

Atoms: adaboost. Lecture: L19-boosting-1.

Question 7 4 points

Mark each statement about AdaBoost as true or false.

Show answer
  1. False — AdaBoost uses the $\pm 1$ coding (essential for the exponential-loss formulation) and outputs the sign of the $\alpha$-weighted vote, not the raw sum and not a $\{0,1\}$ probability.
  2. True — the equivalence was discovered five years after AdaBoost itself ("most things are very obvious after the fact"). Plug exponential loss into Algorithm 10.3 and AdaBoost falls out.
  3. False — AdaBoost uses stumps ("essentially a bias term, very, very small trees"). Strong individual learners would over-explain each round and break the "weak learners that build on each other" logic.
  4. False — the inequality is reversed. $\alpha_m > 0$ when $\text{err}_m < 0.5$ (better than chance, gets a positive vote); $\alpha_m < 0$ when $\text{err}_m > 0.5$ (worse than chance, vote is flipped).

Atoms: adaboost, boosting-loss-functions. Lecture: L20-boosting-2.

Question 8 4 points Exam 2025 P6d

On the Boston housing data with 13 predictors, you fit a linear regression (test MSE 21.7), a GAM with B-splines (test MSE 14.5), and a gradient-boosted tree model (test MSE 11.2). Which is the most defensible one-sentence interpretation?

Show answer
Correct answer: D

This is exactly the prof's expected interpretation from the 2025 walkthrough: "It's probably because there's some non-linear interactions that weren't captured in the linear regression model." GAMs handle marginal non-linearities ($\sum f_j(X_j)$) but lack interaction terms; boosted depth-$\ge 2$ trees naturally model both.

A confuses training and test error — more parameters always lower training MSE, but the question is about test MSE, where overfitting can win or lose. B confuses boosting's machinery with ridge: boosting doesn't shrink regression coefficients, it adds many trees with shrinkage on the trees. C ignores the magnitudes of the gaps (≈10 MSE units between LR and boosting on a response with this scale is meaningful).

Atoms: gradient-boosting, boosting. Lecture: L27-summary.

Question 9 3 points Exam 2024 P2a

A genomic dataset has $n = 1000$ individuals and $p = 10000$ predictors. Which of the following statements is correct?

Show answer
Correct answer: B

Each tree at each split picks one predictor (or a few, with column subsampling) — no $p \times p$ matrix to invert, no need for $p < n$. The 2024 exam Q2a explicitly listed "Boosted regression trees allow for $p>n$" as one of the correct statements. This is also why Exercise 9.4 uses a $p \gg n$ genomic simulation.

A reflects the OLS / forward-selection constraint — irrelevant for trees. C is unnecessary; trees do their own implicit feature selection. D wrongly restricts to stumps; deeper trees still work fine when $p > n$.

Atoms: boosting, gradient-boosting.

Question 10 3 points Ex9.3

In the gradient tree-boosting algorithm, where in the update would you insert the learning rate $\nu$? The current update reads $f_m(x) = f_{m-1}(x) + \sum_j \gamma_{jm}\,\mathbb{1}(x \in R_{jm})$. The corrected update is:

Show answer
Correct answer: C

$\nu$ shrinks only the new tree's contribution, leaving the existing ensemble alone (Algorithm 10.3 step 2(d), Exercise 9.3). Operationally: take a small step in the direction of the freshly-fit tree.

A would shrink the previous ensemble at every iteration, decaying the model. B inverts the role — dividing by $\nu$ would amplify each new tree, the opposite of what shrinkage means. D shrinks the entire ensemble each iteration, decaying earlier trees exponentially with $m$.

Atoms: weak-learner-and-learning-rate, gradient-boosting.

Question 11 4 points

Mark each statement about boosting loss functions as true or false.

Show answer
  1. True — squaring a residual of magnitude 100 gives 10,000, which dominates the sum and bends the fit toward outliers. Absolute and Huber are linear in big residuals, so they downweight outlier influence.
  2. False — multinomial deviance fits $K$ trees per round, one per class, each fit to its class-specific gradient $\mathbb{1}(y = k) - p_k(x)$. This is the most-flagged exam trap in this atom.
  3. True — exactly Huber's design: quadratic on $|y-f| \le \delta$, linear beyond, smooth at the seam if $\delta$ is chosen well.
  4. False — binomial deviance is the same negative log-likelihood as logistic regression, and its negative gradient is $y - p(x)$ (indicator minus probability), which is the prof's "indicator minus probability ends up sounding very similar to residuals".

Atoms: boosting-loss-functions, gradient-boosting, logistic-regression.

Question 12 3 points

Stochastic gradient boosting modifies the standard algorithm in which way, and to what end?

Show answer
Correct answer: B

Friedman 2002 stochastic GBM: at each iteration, draw a fraction (typically $\le 0.5$) of rows without replacement, and optionally a fraction of columns, then fit the next tree on this subsample. Each tree sees different data, the ensemble is more diverse, ensemble variance drops.

A describes neural-network mini-batch SGD — different "stochastic" mechanism. C claims bootstrap (with replacement); the prof flagged that this is exactly not what stochastic GBM does — it's without replacement. D describes the dropout trick from XGBoost, not stochastic subsampling.

Atoms: stochastic-gradient-boosting, regularization.

Question 13 4 points

Mark each statement about XGBoost as true or false. (Internals are out of scope; "in scope" means the conceptual menu of what it adds on top of vanilla GBM.)

Show answer
  1. True — XGBoost's "Newton" step uses both the gradient (direction) and the Hessian (curvature), giving better-sized steps and typically fewer iterations.
  2. False — same gradient-boosting skeleton, just engineered (parallelism, approximate split search) and more regularized (leaf $L_1/L_2$, dropout). "Same algorithm, more engineering."
  3. True — the penalty $\Omega(f) = \gamma|T| + \tfrac12 \lambda \|w\|^2$ (or with $\|w\|_1$) imports ridge/lasso onto the leaf-value vector $w$.
  4. False — XGBoost natively exposes both subsample (row) and colsample_bytree (column) parameters; it inherits everything stochastic GBM does and adds more.

Atoms: xgboost, stochastic-gradient-boosting. Lecture: L20-boosting-2.

Question 14 3 points

For a fitted black-box model $\hat f$, the empirical partial dependence of $\hat f$ on a single predictor $X_j$ at a value $x_j$ is computed as:

Show answer
Correct answer: A

You sweep $X_j$ over a grid of values, and at each $x_j$ replace the $j$-th coordinate of every training row with $x_j$ while keeping the rest as observed, then average the predictions. The empirical $X_{-j}$ stands in for the unknown joint distribution.

B plugs in the marginal mean of $X_{-j}$ — that's a "predict at the centroid" trick, not a PDP, and it ignores the joint distribution. C usually has zero or one matching observations, so it's degenerate (and confounded with $X_j$'s marginal density). D plugs in zeros, an arbitrary point that may be far outside the data.

Atoms: partial-dependence-plots. Lecture: L21-unsupervised-1.

Question 15 3 points

Mark each statement about partial dependence plots (PDPs) as true or false.

Show answer
  1. False — a PDP is a marginal average of model predictions; it tells you what the model learned, not what would happen in the world if $X_j$ changed and nothing else did.
  2. True — the prof's "tiny house with 50 bedrooms" caution: the PDP estimator pins $X_j$ and averages over the marginal of $X_{-j}$, which under high correlation creates implausible synthetic rows.
  3. True — the standard pairing for tree-ensemble interpretability: importance plots rank variables, PDPs reveal the shape of the dependence on the top-ranked ones.

Atoms: partial-dependence-plots, variable-importance.

Question 16 4 points

A boosted tree fit on Boston housing produces a variable-importance plot showing rm (rooms) and lstat (% lower-SES) at the top, well above crim and nox. What is the intended interpretation?

Show answer
Correct answer: D

Variable importance for a tree ensemble aggregates either total impurity reduction across all splits using that variable, or out-of-bag performance drop when that variable is permuted. Both flavors say "this predictor was relied on most by the model."

A overclaims: importance is correlational, not causal — the prof distinguished this from PDPs and from causal questions repeatedly. B confuses ensembles with linear regression; boosted trees have no scalar "coefficients" per predictor. C imports a hypothesis-test framework that doesn't apply to tree-based importance.

Atoms: variable-importance, partial-dependence-plots. Lecture: L19-boosting-1.

Question 17 4 points

A gradient-boosting fit on Ames housing reports training MSE and 10-fold CV MSE as functions of the number of trees $M$. Training MSE drops monotonically; CV MSE drops sharply, hits a minimum near $M^\star = 1100$, then slowly rises. What is the recommended choice and why?

Show answer
Correct answer: B

Standard early-stopping recipe: $\hat M = \arg\min_M \mathrm{CV}(M)$. Past the minimum, residuals are mostly noise and the new trees fit it, so generalization deteriorates. The U-shape on the CV curve is exactly the signal you're using.

A is the bagging/RF rule, not boosting's — for boosting $M$ overfits. C is meaningless: CV MSE is essentially always above training MSE, the gap is the generalisation cost, not the optimum. D doesn't happen with iid samples (CV MSE typically stays above training throughout).

Atoms: gradient-boosting, cross-validation. Lecture: L20-boosting-2.

Question 18 3 points ISLP §8 Q2

Boosting with depth-one trees (stumps) leads to an additive model $f(X) = \sum_{j=1}^p f_j(X_j)$. Why?

Show answer
Correct answer: C

A depth-1 tree picks one split on one predictor and outputs two leaf values — that is, its contribution depends on a single $X_j$. Summing $M$ such trees gives $\sum_j f_j(X_j)$, where $f_j$ collects the contributions of all stumps split on $X_j$. (ISLP §8.4 Q2.)

A overstates: stumps are non-linear (step functions). The point isn't linearity, it's that no interaction terms appear. B is irrelevant — $\nu$ scales each tree but does not change which variables the tree uses. D gets the leaf count right (two regions per stump) but the conclusion wrong: additivity comes from "one variable per term," not from leaf count — a depth-2 tree also has a small leaf count yet introduces interactions.

Atoms: boosting, weak-learner-and-learning-rate.

Question 19 4 points

Which of the following is not a regularization mechanism present in standard gradient-boosted machines (or their XGBoost extension)?

Show answer
Correct answer: D

Deeper trees do the opposite of regularizing — each individual tree becomes a stronger learner, increasing variance per step and undermining the "many small corrections" philosophy. The other three (shrinkage, early stopping, subsampling) are exactly the prof's "three regularization strategies" in boosting; XGBoost adds two more (leaf $L_1/L_2$, dropout).

A, B, and C are textbook regularizers in this context. D is the trap — it's the random-forest reflex applied to the wrong setting. RF wants high-variance/low-bias trees; boosting wants the opposite.

Atoms: regularization, weak-learner-and-learning-rate, stochastic-gradient-boosting.

Question 20 3 points

XGBoost's penalty $\Omega(f) = \gamma |T| + \tfrac{1}{2} \lambda \|w\|^2$ (with $|T|$ the number of leaves and $w$ the leaf-value vector) is best described as:

Show answer
Correct answer: A

$\gamma|T|$ is exactly cost-complexity pruning ($\alpha|T|$ from module 8) applied per boosting iteration; $\tfrac12 \lambda \|w\|^2$ is ridge applied to leaf weights. The prof's framing: same regularizers as elsewhere in the course, only here the parameter vector is the leaf values rather than regression coefficients.

B confuses XGBoost with linear models — there are no $\beta$'s here. C overlays a Bayesian story the prof explicitly excluded ("really don't think I'd put this on the test"). D substitutes a classification loss for a regularization penalty; the two are different terms in the objective.

Atoms: xgboost, regularization, cost-complexity-pruning.

Question 21 3 points

In AdaBoost, weak classifier $G_3$ has weighted error $\text{err}_3 = 0.20$. What is its classifier weight $\alpha_3 = \log\!\big((1 - \text{err}_3)/\text{err}_3\big)$, and what does its sign mean?

Show answer
Correct answer: A

$(1-0.20)/0.20 = 4$, so $\alpha_3 = \log 4 \approx 1.39$. Because $\text{err}_3 < 0.5$, the ratio is $> 1$ and $\alpha_3 > 0$: a better-than-chance classifier gets a positive weight in the final $\operatorname{sign}(\sum_m \alpha_m G_m(x))$ vote.

B inverts the ratio inside the log — the formula is $(1-\text{err})/\text{err}$, not $\text{err}/(1-\text{err})$. C confuses $\alpha_m$ with $\text{err}_m$. D mixes up AdaBoost's $\alpha$ formula with the $\tfrac12 \log(\cdot)$ variant from the SAMME / "discrete AdaBoost" alternative coding (and even there, the ratio inside the log is unchanged).

Atoms: adaboost.

Question 22 4 points

Mark each statement as true or false.

Show answer
  1. False — bagging's $B$ is "use enough": extra trees average more with diminishing returns, but never inflate OOB error. There is no overfitting-in-$B$ for bagging.
  2. False — boosting overfits if $M$ is too large because each new tree fits whatever residual is left, and after the signal is exhausted that's just noise.
  3. True — the standard contrast: RF wants high-variance/low-bias deep trees to average; boosting wants high-bias/low-variance weak learners to sum.
  4. False — the bias-variance roles are inverted. RF reduces variance via decorrelated averaging (each tree is low-bias/high-variance); boosting reduces bias via sequential corrections (each tree is high-bias/low-variance).

Atoms: boosting, random-forest, bias-variance-tradeoff.

Question 23 3 points Ex9.2

In Algorithm 10.3 (gradient tree boosting) at iteration $m$, after computing the negative-gradient pseudo-residuals $r_{im}$, the next step (2b) is:

Show answer
Correct answer: D

Step 2b of Algorithm 10.3: fit a regression tree to the negative-gradient pseudo-residuals by least squares to obtain the regions $R_{jm}$. Step 2c then fills in the leaf values $\gamma_{jm}$ using the original loss $L$ on those regions. The least-squares fit on residuals is the "smooth proxy" the prof emphasized.

A would just refit the original problem, wasting the gradient information. B is the AdaBoost-style sample reweighting, which is a different mechanism (and applies to classification, not arbitrary loss). C is the initialization step (step 1, $f_0$), not the per-iteration tree fit.

Atoms: gradient-boosting.

Question 24 4 points

You have a regression dataset where a sensor occasionally records implausible spikes — a few $y$ values are roughly 100× the typical scale. You want a boosting fit that is not heavily distorted by these outliers but still fits the bulk of the data smoothly. Which loss is the most appropriate?

Show answer
Correct answer: B

Huber is exactly the engineered compromise: quadratic ($(y-f)^2$) for $|y-f| \le \delta$ (smooth, doesn't aggressively zero things), linear ($2\delta|y-f| - \delta^2$) beyond (doesn't blow up on outliers). Designed for this scenario.

A is the loss the question's setup is warning against: 100×-scale residuals get squared into 10,000×-scale penalties, which yanks the fit toward the outliers. C is a binary-classification loss (and exponential growth makes outliers even worse, not better). D is for $K$-class classification — wrong problem entirely.

Atoms: boosting-loss-functions. Lecture: L20-boosting-2.

Question 25 4 points

Mark each statement as true or false.

Show answer
  1. True — the prof's canonical strategy. Joint-tuning $M$ and $\nu$ wastes compute because they are coupled along an iso-fit curve.
  2. True — with $\nu = 1$ each tree gets full weight; the ensemble jumps to a near-perfect training fit immediately and loses the "many small corrections" diversity.
  3. False — depth controls the interaction order the model can capture and how strong each individual learner is. Bias-variance is precisely the lens.
  4. True — the four canonical hyperparameters. XGBoost adds more (leaf $L_1/L_2$, $\gamma$, dropout, subsample fractions), but the GBM core is these four.

Atoms: weak-learner-and-learning-rate, gradient-boosting.

Question 26 3 points Exam 2024 P5d

On a binary classification dataset (orange-juice purchase, $n_{\text{train}} = 800$, $\sim 60/40$ class split), you fit a boosted tree with `distribution = "bernoulli"`, `n.trees = 1000`, `shrinkage = 0.01`, `interaction.depth = 2`, `cv.folds = 10`. The CV-optimal number of trees is 280, giving a test misclassification error of 0.18 — slightly higher than logistic regression on the same data. The most defensible interpretation is:

Show answer
Correct answer: B

From the 2024 exam answer key: most students find boosting's error not lower than logistic regression / GAM here, and the prof's expected interpretation is that the data is largely linear in the log-odds, so boosting's extra complexity isn't paying off. Simpler models can or should be used (Occam).

A is the textbook over-claim — boosting wins often, not always; bias-variance has to break even to pay for the extra flexibility. C overstates: the imbalance here (60/40) is mild, error rate is fine to report alongside other metrics. D is fabricated — `cv.folds` is the canonical hook the slides and the textbook recommend.

Atoms: gradient-boosting, cross-validation, bias-variance-tradeoff.

Question 27 3 points

Consider the regression-boosting update from Algorithm 8.2: $$\hat f(x) \leftarrow \hat f(x) + \nu\, \hat f^b(x), \qquad r_i \leftarrow r_i - \nu\, \hat f^b(x_i).$$ Suppose at iteration $b$, the new tree's prediction at observation $i$ is $\hat f^b(x_i) = 2$, the current residual is $r_i = 5$, and $\nu = 0.1$. After this step, the residual $r_i$ becomes:

Show answer
Correct answer: A

$r_i \leftarrow r_i - \nu\, \hat f^b(x_i) = 5 - 0.1 \cdot 2 = 5 - 0.2 = 4.8$. Each step nudges the residual by a small fraction of the new tree's prediction; that's the whole point of $\nu$ — keep the corrections small.

B flips the sign — the residual is being reduced by what the new tree explains, not increased. C forgets to apply $\nu$ and subtracts the full tree contribution, defeating shrinkage. D wrongly multiplies the residual by $\nu$ as if $\nu$ scaled the existing residual rather than the new tree.

Atoms: gradient-boosting, weak-learner-and-learning-rate.

Question 28 3 points

You compare two GBM runs on the same dataset, both with depth-3 trees and 10-fold CV: run A uses $\nu = 0.001$, run B uses $\nu = 0.1$. Holding everything else fixed, you'd expect:

Show answer
Correct answer: C

Smaller $\nu$ means each tree contributes less per step, so more steps are needed to traverse the same fit landscape. The optimal $M$ scales roughly as $1/\nu$ along the iso-fit curve.

A overstates: a small $\nu$ does slow learning, but the CV curve still hits a minimum once enough trees are added — it just happens at a much larger $M$. B ignores the $M$–$\nu$ coupling that the prof spent significant time on. D claims $\nu$ amplifies the trees, but $\nu$ is a multiplier in $(0,1]$ that shrinks them; small $\nu$ = small contribution per step, the opposite of overfit-friendly.

Atoms: weak-learner-and-learning-rate, cross-validation. Lecture: L20-boosting-2.

Question 29 4 points

A student replaces the depth-4 trees in a gradient-boosting fit with much deeper trees (depth 20) of comparable interpretation cost, keeping $\nu$ and $M$ as before. CV error rises sharply. The most accurate diagnosis:

Show answer
Correct answer: A

Boosting's whole point is that each tree is a weak learner — a small step in the gradient direction. A depth-20 tree is essentially a strong learner, fitting most of the residual at its own iteration; subsequent trees see noise, the ensemble's variance shoots up. The prof flagged this directly: "We want learners that make good progress and a step in the right direction, but not ones that are going to throw you out."

B reflexively claims "deeper = lower bias" without thinking about the variance pay-back, the textbook bias-variance trap. C is wrong: deeper trees do not turn boosting into RF; sequential vs parallel structure is unchanged, and $M$ still matters. D conflates the model's variance with the data's irreducible error — the latter is fixed by $\sigma^2$ and cannot be moved by the model.

Atoms: weak-learner-and-learning-rate, bias-variance-tradeoff, boosting. Lecture: L20-boosting-2.