← Back to wiki
Module 09 — Boosting and Additive Trees
29 questions · 100 points · ~45 min
Click an option to lock the answer; the explanation auto-opens.
Score tracker bottom-left.
Question 1
3 points
Ex9.1
Which statement best captures the conceptual difference between bagging and boosting as ensemble methods?
- A Bagging fits trees in parallel on bootstrap samples (variance reduction); boosting fits trees sequentially, each correcting the previous fit (bias reduction).
- B Bagging averages many low-variance trees fit on the original data; boosting averages many high-variance trees fit on bootstrap samples.
- C Bagging and boosting both fit trees sequentially, but bagging uses the residuals while boosting uses the raw response.
- D Bagging uses random subsets of predictors at each split; boosting uses random subsets of observations at each split.
Show answer
Correct answer: A
Bagging is parallel and aimed at variance reduction (averaging deep, low-bias trees grown on bootstrap samples). Boosting is sequential and aimed at bias reduction (each new weak learner corrects the previous ensemble).
B inverts the bias-variance roles. C confuses the two methods — boosting is the one that fits residuals. D describes the random-forest decorrelation trick (bagging restricted via $m$), not the bagging-vs-boosting axis.
Atoms: boosting, bagging. Lecture: L19-boosting-1.
Question 2
4 points
Exam 2025 P6c
You fit a gradient-boosting model and need to specify the number of trees $M$, the tree depth $d$, and the shrinkage $\nu$. Which approach for picking these values does the prof advocate?
- A Set them to fixed defaults from the textbook ($M=1000$, $d=5$, $\nu=0.1$); the prof's slides explicitly recommend skipping tuning because boosting is robust to hyperparameter choices.
- B Pick the combination that minimises training MSE on the full training set, since boosting's regularization machinery already prevents the ensemble from chasing training noise.
- C Use cross-validation (or early stopping on a held-out validation curve) to pick the combination that minimises out-of-sample error.
- D Use AIC or BIC with the effective degrees of freedom of the boosted ensemble to penalise the joint complexity of $M$, $d$, and $\nu$, then minimise the criterion over a grid.
Show answer
Correct answer: C
The prof's verbatim answer to this exact question (2025 Q6c walkthrough): cross-validation, with $\nu$ fixed small and $M$ chosen at the CV minimum. `gbm()`'s cv.folds and `xgb.cv` are the canonical hooks.
A skips the data-driven step, but $M$ in particular overfits if too large — defaults won't generalise across datasets. B optimises against the model's own training noise; training error keeps falling as $M$ grows, never picking a meaningful stop. D applies the prof's least trusted model-selection family — he flagged AIC/BIC as making assumptions that "probably won't hold" and recommends CV instead.
Atoms: gradient-boosting, cross-validation. Lecture: L27-summary.
For the squared-error loss $L(y, f) = \tfrac{1}{2}(y - f)^2$, the negative gradient $-\partial L / \partial f$ that gradient boosting fits each new tree to is:
- A $f - y$ (prediction minus response).
- B $(y - f)^2$ (the squared residual).
- C $-y \cdot f$ (negative product).
- D $y - f$ (the residual).
Show answer
Correct answer: D
$\partial L / \partial f = -(y - f)$, so $-\partial L / \partial f = y - f$. This is exactly the residual, which is why "fit a tree to the residuals" and "fit a tree to the negative gradient" coincide for squared-error loss.
A flips the sign — that's the positive gradient, which would push the fit away from the data. B is the loss itself, not its derivative. C is the form of the AdaBoost exponent inside $\exp(-yf)$, not a gradient.
Atoms: gradient-boosting, boosting-loss-functions. Lecture: L20-boosting-2.
Mark each statement about gradient-boosting hyperparameters as true or false.
Show answer
- True — smaller per-tree contribution means each step is a smaller move toward the minimum, so more steps are needed. $M$ and $\nu$ are coupled.
- False — boosting overfits if $M$ is too large because residuals eventually become noise. The "use enough" rule applies to bagging/RF, not boosting.
- True — a tree of depth $d$ uses at most $d$ predictors on any path, so it captures up to $d$-way interactions. Stumps (depth 1) → at most one variable per term → additive model (cf. ISLP §8.4 Q2).
- False — the rule of thumb is $\nu \le 0.1$ (typically $0.1, 0.05, 0.01, 0.001$), never near $1$. Setting $\nu \approx 1$ would kill the "small step" property the algorithm relies on.
Atoms: weak-learner-and-learning-rate, gradient-boosting. Lecture: L20-boosting-2.
In the bias-variance picture, the core mechanism by which the basic boosting algorithm improves over a single tree is best described as:
- A Reducing variance by averaging many low-bias deep trees.
- B Reducing the irreducible error $\sigma^2$ via repeated denoising of the response.
- C Reducing bias by summing many high-bias weak learners that each correct the previous fit.
- D Eliminating both bias and variance by choosing a sufficiently large number of trees.
Show answer
Correct answer: C
The prof's framing: bagging/RF start with high-variance/low-bias trees and drive variance down by averaging. Boosting starts from the opposite end — high-bias/low-variance weak learners — and drives bias down by accumulating their corrections.
A is the bagging/random-forest mechanism, not boosting's. B confuses model error with the irreducible noise floor — no algorithm reduces $\sigma^2$. D ignores that $M$ too large overfits, increasing variance again.
Atoms: boosting, bias-variance-tradeoff. Lecture: L19-boosting-1.
In AdaBoost, after fitting the $m$-th weak classifier $G_m$ with weighted error $\text{err}_m$, what happens to a training observation that $G_m$ correctly classifies, in the weight update $w_i \leftarrow w_i \cdot \exp(\alpha_m \cdot \mathbb{1}[y_i \neq G_m(x_i)])$?
- A Its weight is multiplied by $e^{\alpha_m}$, increasing it.
- B Its weight is multiplied by $e^{-\alpha_m}$, decreasing it.
- C Its weight is multiplied by $e^0 = 1$, so it is unchanged.
- D Its weight is set back to the initial value $1/N$.
Show answer
Correct answer: C
For a correctly classified point, $\mathbb{1}[y_i \neq G_m(x_i)] = 0$, so the multiplier is $\exp(\alpha_m \cdot 0) = 1$. Misclassified points get $\exp(\alpha_m)$, which boosts their weight; correct points are simply left alone (weights are renormalised after the step).
A is the multiplier for misclassified points — the question asks about the correct ones. B describes a symmetric "boost good / cut bad" scheme that AdaBoost.M1 doesn't actually use. D conflates initialisation with the per-iteration update.
Atoms: adaboost. Lecture: L19-boosting-1.
Mark each statement about AdaBoost as true or false.
Show answer
- False — AdaBoost uses the $\pm 1$ coding (essential for the exponential-loss formulation) and outputs the sign of the $\alpha$-weighted vote, not the raw sum and not a $\{0,1\}$ probability.
- True — the equivalence was discovered five years after AdaBoost itself ("most things are very obvious after the fact"). Plug exponential loss into Algorithm 10.3 and AdaBoost falls out.
- False — AdaBoost uses stumps ("essentially a bias term, very, very small trees"). Strong individual learners would over-explain each round and break the "weak learners that build on each other" logic.
- False — the inequality is reversed. $\alpha_m > 0$ when $\text{err}_m < 0.5$ (better than chance, gets a positive vote); $\alpha_m < 0$ when $\text{err}_m > 0.5$ (worse than chance, vote is flipped).
Atoms: adaboost, boosting-loss-functions. Lecture: L20-boosting-2.
Question 8
4 points
Exam 2025 P6d
On the Boston housing data with 13 predictors, you fit a linear regression (test MSE 21.7), a GAM with B-splines (test MSE 14.5), and a gradient-boosted tree model (test MSE 11.2). Which is the most defensible one-sentence interpretation?
- A The boosted model wins because it has strictly more effective parameters than linear regression and the GAM, and a model with more parameters mechanically fits any held-out test set better than a simpler one.
- B The boosted model wins because the shrinkage parameter $\nu$ behaves like the ridge penalty $\lambda$ on a hidden vector of regression coefficients, so the gain over linear regression is the same coefficient-shrinkage effect that drove the ridge improvements in module 6.
- C The gaps between the three test MSEs are within Monte-Carlo noise from the random train/test split; the three methods should be regarded as effectively tied on this dataset and the choice between them comes down to interpretability.
- D The boosted model wins because there are non-linear interactions between predictors that linear regression cannot capture, while GAM captures non-linearities but not the interactions.
Show answer
Correct answer: D
This is exactly the prof's expected interpretation from the 2025 walkthrough: "It's probably because there's some non-linear interactions that weren't captured in the linear regression model." GAMs handle marginal non-linearities ($\sum f_j(X_j)$) but lack interaction terms; boosted depth-$\ge 2$ trees naturally model both.
A confuses training and test error — more parameters always lower training MSE, but the question is about test MSE, where overfitting can win or lose. B confuses boosting's machinery with ridge: boosting doesn't shrink regression coefficients, it adds many trees with shrinkage on the trees. C ignores the magnitudes of the gaps (≈10 MSE units between LR and boosting on a response with this scale is meaningful).
Atoms: gradient-boosting, boosting. Lecture: L27-summary.
Question 9
3 points
Exam 2024 P2a
A genomic dataset has $n = 1000$ individuals and $p = 10000$ predictors. Which of the following statements is correct?
- A Boosted regression trees cannot be fit in this setting; like OLS, the routine requires $p < n$.
- B Boosted regression trees can be fit; they handle $p > n$ because each tree splits on at most a handful of predictors.
- C Boosted regression trees can be fit, but only after first reducing dimensionality via PCA onto the top $k \ll n$ components.
- D Boosted regression trees can only be fit if you restrict the ensemble to stumps ($d = 1$); deeper trees would have too few observations per terminal node when $p > n$.
Show answer
Correct answer: B
Each tree at each split picks one predictor (or a few, with column subsampling) — no $p \times p$ matrix to invert, no need for $p < n$. The 2024 exam Q2a explicitly listed "Boosted regression trees allow for $p>n$" as one of the correct statements. This is also why Exercise 9.4 uses a $p \gg n$ genomic simulation.
A reflects the OLS / forward-selection constraint — irrelevant for trees. C is unnecessary; trees do their own implicit feature selection. D wrongly restricts to stumps; deeper trees still work fine when $p > n$.
Atoms: boosting, gradient-boosting.
Question 10
3 points
Ex9.3
In the gradient tree-boosting algorithm, where in the update would you insert the learning rate $\nu$? The current update reads $f_m(x) = f_{m-1}(x) + \sum_j \gamma_{jm}\,\mathbb{1}(x \in R_{jm})$. The corrected update is:
- A $f_m(x) = \nu \cdot f_{m-1}(x) + \sum_j \gamma_{jm}\,\mathbb{1}(x \in R_{jm})$.
- B $f_m(x) = f_{m-1}(x) + \sum_j (\gamma_{jm}/\nu)\,\mathbb{1}(x \in R_{jm})$.
- C $f_m(x) = f_{m-1}(x) + \nu \cdot \sum_j \gamma_{jm}\,\mathbb{1}(x \in R_{jm})$.
- D $f_m(x) = \nu \cdot \big(f_{m-1}(x) + \sum_j \gamma_{jm}\,\mathbb{1}(x \in R_{jm})\big)$.
Show answer
Correct answer: C
$\nu$ shrinks only the new tree's contribution, leaving the existing ensemble alone (Algorithm 10.3 step 2(d), Exercise 9.3). Operationally: take a small step in the direction of the freshly-fit tree.
A would shrink the previous ensemble at every iteration, decaying the model. B inverts the role — dividing by $\nu$ would amplify each new tree, the opposite of what shrinkage means. D shrinks the entire ensemble each iteration, decaying earlier trees exponentially with $m$.
Atoms: weak-learner-and-learning-rate, gradient-boosting.
Mark each statement about boosting loss functions as true or false.
Show answer
- True — squaring a residual of magnitude 100 gives 10,000, which dominates the sum and bends the fit toward outliers. Absolute and Huber are linear in big residuals, so they downweight outlier influence.
- False — multinomial deviance fits $K$ trees per round, one per class, each fit to its class-specific gradient $\mathbb{1}(y = k) - p_k(x)$. This is the most-flagged exam trap in this atom.
- True — exactly Huber's design: quadratic on $|y-f| \le \delta$, linear beyond, smooth at the seam if $\delta$ is chosen well.
- False — binomial deviance is the same negative log-likelihood as logistic regression, and its negative gradient is $y - p(x)$ (indicator minus probability), which is the prof's "indicator minus probability ends up sounding very similar to residuals".
Atoms: boosting-loss-functions, gradient-boosting, logistic-regression.
Stochastic gradient boosting modifies the standard algorithm in which way, and to what end?
- A It replaces the deterministic gradient with a noisy mini-batch estimate, mirroring SGD in neural networks.
- B It subsamples rows (and optionally columns) without replacement before each tree, encouraging diversity and reducing variance.
- C It bootstraps the training data with replacement before each tree, mirroring bagging.
- D It randomly drops a fraction of already-fit trees during prediction to reduce reliance on early trees.
Show answer
Correct answer: B
Friedman 2002 stochastic GBM: at each iteration, draw a fraction (typically $\le 0.5$) of rows without replacement, and optionally a fraction of columns, then fit the next tree on this subsample. Each tree sees different data, the ensemble is more diverse, ensemble variance drops.
A describes neural-network mini-batch SGD — different "stochastic" mechanism. C claims bootstrap (with replacement); the prof flagged that this is exactly not what stochastic GBM does — it's without replacement. D describes the dropout trick from XGBoost, not stochastic subsampling.
Atoms: stochastic-gradient-boosting, regularization.
Mark each statement about XGBoost as true or false. (Internals are out of scope; "in scope" means the conceptual menu of what it adds on top of vanilla GBM.)
Show answer
- True — XGBoost's "Newton" step uses both the gradient (direction) and the Hessian (curvature), giving better-sized steps and typically fewer iterations.
- False — same gradient-boosting skeleton, just engineered (parallelism, approximate split search) and more regularized (leaf $L_1/L_2$, dropout). "Same algorithm, more engineering."
- True — the penalty $\Omega(f) = \gamma|T| + \tfrac12 \lambda \|w\|^2$ (or with $\|w\|_1$) imports ridge/lasso onto the leaf-value vector $w$.
- False — XGBoost natively exposes both
subsample (row) and colsample_bytree (column) parameters; it inherits everything stochastic GBM does and adds more.
Atoms: xgboost, stochastic-gradient-boosting. Lecture: L20-boosting-2.
For a fitted black-box model $\hat f$, the empirical partial dependence of $\hat f$ on a single predictor $X_j$ at a value $x_j$ is computed as:
- A $\bar f_j(x_j) = \frac{1}{N} \sum_{i=1}^N \hat f(x_j, x_{i, -j})$, averaging over the empirical $X_{-j}$.
- B $\bar f_j(x_j) = \hat f(x_j, \bar x_{-j})$, evaluating at the mean of the other predictors.
- C $\bar f_j(x_j) = \frac{1}{N} \sum_{i: x_{ij} = x_j} \hat f(x_i)$, restricting to observations whose $j$-th value equals $x_j$.
- D $\bar f_j(x_j) = \hat f(x_j, 0, 0, \ldots, 0)$, setting all other predictors to zero.
Show answer
Correct answer: A
You sweep $X_j$ over a grid of values, and at each $x_j$ replace the $j$-th coordinate of every training row with $x_j$ while keeping the rest as observed, then average the predictions. The empirical $X_{-j}$ stands in for the unknown joint distribution.
B plugs in the marginal mean of $X_{-j}$ — that's a "predict at the centroid" trick, not a PDP, and it ignores the joint distribution. C usually has zero or one matching observations, so it's degenerate (and confounded with $X_j$'s marginal density). D plugs in zeros, an arbitrary point that may be far outside the data.
Atoms: partial-dependence-plots. Lecture: L21-unsupervised-1.
Mark each statement about partial dependence plots (PDPs) as true or false.
Show answer
- False — a PDP is a marginal average of model predictions; it tells you what the model learned, not what would happen in the world if $X_j$ changed and nothing else did.
- True — the prof's "tiny house with 50 bedrooms" caution: the PDP estimator pins $X_j$ and averages over the marginal of $X_{-j}$, which under high correlation creates implausible synthetic rows.
- True — the standard pairing for tree-ensemble interpretability: importance plots rank variables, PDPs reveal the shape of the dependence on the top-ranked ones.
Atoms: partial-dependence-plots, variable-importance.
A boosted tree fit on Boston housing produces a variable-importance plot showing rm (rooms) and lstat (% lower-SES) at the top, well above crim and nox. What is the intended interpretation?
- A
rm and lstat have the largest causal effects on median house price after adjusting for the other predictors in the joint distribution.
- B
rm and lstat are the predictors whose partial-dependence curves have the largest average slope, which is what the importance algorithm aggregates.
- C
crim and nox failed an internal Wald-style test at each split and were therefore demoted at the 5% level.
- D
rm and lstat contribute the most to reducing impurity (or to OOB performance under permutation) across the ensemble's splits.
Show answer
Correct answer: D
Variable importance for a tree ensemble aggregates either total impurity reduction across all splits using that variable, or out-of-bag performance drop when that variable is permuted. Both flavors say "this predictor was relied on most by the model."
A overclaims: importance is correlational, not causal — the prof distinguished this from PDPs and from causal questions repeatedly. B confuses ensembles with linear regression; boosted trees have no scalar "coefficients" per predictor. C imports a hypothesis-test framework that doesn't apply to tree-based importance.
Atoms: variable-importance, partial-dependence-plots. Lecture: L19-boosting-1.
A gradient-boosting fit on Ames housing reports training MSE and 10-fold CV MSE as functions of the number of trees $M$. Training MSE drops monotonically; CV MSE drops sharply, hits a minimum near $M^\star = 1100$, then slowly rises. What is the recommended choice and why?
- A Pick $M$ as large as compute allows; later trees keep refining without overfitting in boosting.
- B Pick $M = 1100$ (the CV minimum); past it, additional trees fit residual noise and CV error rises.
- C Pick $M$ so that training MSE matches CV MSE; that's where bias and variance balance.
- D Pick the smallest $M$ for which CV MSE first falls below training MSE.
Show answer
Correct answer: B
Standard early-stopping recipe: $\hat M = \arg\min_M \mathrm{CV}(M)$. Past the minimum, residuals are mostly noise and the new trees fit it, so generalization deteriorates. The U-shape on the CV curve is exactly the signal you're using.
A is the bagging/RF rule, not boosting's — for boosting $M$ overfits. C is meaningless: CV MSE is essentially always above training MSE, the gap is the generalisation cost, not the optimum. D doesn't happen with iid samples (CV MSE typically stays above training throughout).
Atoms: gradient-boosting, cross-validation. Lecture: L20-boosting-2.
Question 18
3 points
ISLP §8 Q2
Boosting with depth-one trees (stumps) leads to an additive model $f(X) = \sum_{j=1}^p f_j(X_j)$. Why?
- A Stumps can only represent strictly linear functions of a single predictor at a time, so summing them mechanically produces a linear-in-$X$ model with no curvature in any $f_j$.
- B The shrinkage parameter $\nu$ projects each tree's contribution onto the additive subspace, removing any interaction component before the new tree is added to the ensemble.
- C Each stump splits on exactly one variable, so its contribution is a function of one $X_j$ only; the sum of such terms is, by definition, additive.
- D Stumps have exactly two terminal nodes, and any two-leaf tree is mathematically equivalent to a single additive term, so summing them keeps the ensemble in the additive class.
Show answer
Correct answer: C
A depth-1 tree picks one split on one predictor and outputs two leaf values — that is, its contribution depends on a single $X_j$. Summing $M$ such trees gives $\sum_j f_j(X_j)$, where $f_j$ collects the contributions of all stumps split on $X_j$. (ISLP §8.4 Q2.)
A overstates: stumps are non-linear (step functions). The point isn't linearity, it's that no interaction terms appear. B is irrelevant — $\nu$ scales each tree but does not change which variables the tree uses. D gets the leaf count right (two regions per stump) but the conclusion wrong: additivity comes from "one variable per term," not from leaf count — a depth-2 tree also has a small leaf count yet introduces interactions.
Atoms: boosting, weak-learner-and-learning-rate.
Which of the following is not a regularization mechanism present in standard gradient-boosted machines (or their XGBoost extension)?
- A Shrinking each new tree's contribution by the learning rate $\nu \le 0.1$ to take many small gradient steps.
- B Choosing $M$ by early stopping at the minimum of a held-out validation curve rather than the training one.
- C Subsampling rows (without replacement) and optionally columns before each tree to diversify successive trees.
- D Increasing each tree's depth so that early trees can capture the bulk of the signal in a few iterations.
Show answer
Correct answer: D
Deeper trees do the opposite of regularizing — each individual tree becomes a stronger learner, increasing variance per step and undermining the "many small corrections" philosophy. The other three (shrinkage, early stopping, subsampling) are exactly the prof's "three regularization strategies" in boosting; XGBoost adds two more (leaf $L_1/L_2$, dropout).
A, B, and C are textbook regularizers in this context. D is the trap — it's the random-forest reflex applied to the wrong setting. RF wants high-variance/low-bias trees; boosting wants the opposite.
Atoms: regularization, weak-learner-and-learning-rate, stochastic-gradient-boosting.
XGBoost's penalty $\Omega(f) = \gamma |T| + \tfrac{1}{2} \lambda \|w\|^2$ (with $|T|$ the number of leaves and $w$ the leaf-value vector) is best described as:
- A Cost-complexity pruning $+$ ridge regression, with both penalties applied to the tree's leaves.
- B Lasso applied to a hidden vector of linear coefficients learned alongside the tree ensemble.
- C A Bayesian prior on the tree structure with a Laplace likelihood on leaf values.
- D Cross-entropy classification loss plus an extra bias term absorbing the ensemble's constant offset.
Show answer
Correct answer: A
$\gamma|T|$ is exactly cost-complexity pruning ($\alpha|T|$ from module 8) applied per boosting iteration; $\tfrac12 \lambda \|w\|^2$ is ridge applied to leaf weights. The prof's framing: same regularizers as elsewhere in the course, only here the parameter vector is the leaf values rather than regression coefficients.
B confuses XGBoost with linear models — there are no $\beta$'s here. C overlays a Bayesian story the prof explicitly excluded ("really don't think I'd put this on the test"). D substitutes a classification loss for a regularization penalty; the two are different terms in the objective.
Atoms: xgboost, regularization, cost-complexity-pruning.
In AdaBoost, weak classifier $G_3$ has weighted error $\text{err}_3 = 0.20$. What is its classifier weight $\alpha_3 = \log\!\big((1 - \text{err}_3)/\text{err}_3\big)$, and what does its sign mean?
- A $\alpha_3 \approx \log 4 \approx 1.39$, positive — $G_3$ is better than chance and gets a positive vote in the final ensemble.
- B $\alpha_3 \approx \log(1/4) \approx -1.39$; the negative sign tells you AdaBoost has detected an over-confident weak learner and is flipping its predictions before adding them to the ensemble.
- C $\alpha_3 \approx 0.20$, equal to the weighted error itself; AdaBoost makes each classifier's voice in the final vote directly proportional to the error it just achieved.
- D $\alpha_3 = \tfrac12 \log(1 - 0.20) \approx 0.11$, positive but small because the standard formula uses a one-half log of $(1 - \text{err}_m)$ instead of a ratio.
Show answer
Correct answer: A
$(1-0.20)/0.20 = 4$, so $\alpha_3 = \log 4 \approx 1.39$. Because $\text{err}_3 < 0.5$, the ratio is $> 1$ and $\alpha_3 > 0$: a better-than-chance classifier gets a positive weight in the final $\operatorname{sign}(\sum_m \alpha_m G_m(x))$ vote.
B inverts the ratio inside the log — the formula is $(1-\text{err})/\text{err}$, not $\text{err}/(1-\text{err})$. C confuses $\alpha_m$ with $\text{err}_m$. D mixes up AdaBoost's $\alpha$ formula with the $\tfrac12 \log(\cdot)$ variant from the SAMME / "discrete AdaBoost" alternative coding (and even there, the ratio inside the log is unchanged).
Atoms: adaboost.
Mark each statement as true or false.
Show answer
- False — bagging's $B$ is "use enough": extra trees average more with diminishing returns, but never inflate OOB error. There is no overfitting-in-$B$ for bagging.
- False — boosting overfits if $M$ is too large because each new tree fits whatever residual is left, and after the signal is exhausted that's just noise.
- True — the standard contrast: RF wants high-variance/low-bias deep trees to average; boosting wants high-bias/low-variance weak learners to sum.
- False — the bias-variance roles are inverted. RF reduces variance via decorrelated averaging (each tree is low-bias/high-variance); boosting reduces bias via sequential corrections (each tree is high-bias/low-variance).
Atoms: boosting, random-forest, bias-variance-tradeoff.
Question 23
3 points
Ex9.2
In Algorithm 10.3 (gradient tree boosting) at iteration $m$, after computing the negative-gradient pseudo-residuals $r_{im}$, the next step (2b) is:
- A Fit a regression tree to the original responses $y_i$ using least squares.
- B Fit a classification tree to $\operatorname{sign}(r_{im})$ to determine which observations to upweight next round.
- C Solve $\arg\min_\gamma \sum_i L(y_i, f_{m-1}(x_i) + \gamma)$ to obtain a single constant update.
- D Fit a regression tree to the pseudo-residuals $r_{im}$ using least squares, obtaining regions $R_{jm}$.
Show answer
Correct answer: D
Step 2b of Algorithm 10.3: fit a regression tree to the negative-gradient pseudo-residuals by least squares to obtain the regions $R_{jm}$. Step 2c then fills in the leaf values $\gamma_{jm}$ using the original loss $L$ on those regions. The least-squares fit on residuals is the "smooth proxy" the prof emphasized.
A would just refit the original problem, wasting the gradient information. B is the AdaBoost-style sample reweighting, which is a different mechanism (and applies to classification, not arbitrary loss). C is the initialization step (step 1, $f_0$), not the per-iteration tree fit.
Atoms: gradient-boosting.
You have a regression dataset where a sensor occasionally records implausible spikes — a few $y$ values are roughly 100× the typical scale. You want a boosting fit that is not heavily distorted by these outliers but still fits the bulk of the data smoothly. Which loss is the most appropriate?
- A Squared-error (Gaussian) loss — the default and the most efficient under Gaussian noise.
- B Huber loss — quadratic for small residuals, linear for large ones; smooth and outlier-tolerant.
- C Exponential loss — its sharp penalty isolates the outliers from the rest of the fit.
- D Multinomial deviance — the multi-class generalization handles the outlier "class" of spikes naturally.
Show answer
Correct answer: B
Huber is exactly the engineered compromise: quadratic ($(y-f)^2$) for $|y-f| \le \delta$ (smooth, doesn't aggressively zero things), linear ($2\delta|y-f| - \delta^2$) beyond (doesn't blow up on outliers). Designed for this scenario.
A is the loss the question's setup is warning against: 100×-scale residuals get squared into 10,000×-scale penalties, which yanks the fit toward the outliers. C is a binary-classification loss (and exponential growth makes outliers even worse, not better). D is for $K$-class classification — wrong problem entirely.
Atoms: boosting-loss-functions. Lecture: L20-boosting-2.
Mark each statement as true or false.
Show answer
- True — the prof's canonical strategy. Joint-tuning $M$ and $\nu$ wastes compute because they are coupled along an iso-fit curve.
- True — with $\nu = 1$ each tree gets full weight; the ensemble jumps to a near-perfect training fit immediately and loses the "many small corrections" diversity.
- False — depth controls the interaction order the model can capture and how strong each individual learner is. Bias-variance is precisely the lens.
- True — the four canonical hyperparameters. XGBoost adds more (leaf $L_1/L_2$, $\gamma$, dropout, subsample fractions), but the GBM core is these four.
Atoms: weak-learner-and-learning-rate, gradient-boosting.
Question 26
3 points
Exam 2024 P5d
On a binary classification dataset (orange-juice purchase, $n_{\text{train}} = 800$, $\sim 60/40$ class split), you fit a boosted tree with `distribution = "bernoulli"`, `n.trees = 1000`, `shrinkage = 0.01`, `interaction.depth = 2`, `cv.folds = 10`. The CV-optimal number of trees is 280, giving a test misclassification error of 0.18 — slightly higher than logistic regression on the same data. The most defensible interpretation is:
- A Boosting must be misconfigured; given a sufficiently large $M$ and a small $\nu$, boosting is guaranteed to dominate logistic regression on any binary classification problem because it is a universal function approximator.
- B The boosting fit is fine, but the data may be well-described by a (near-)linear / additive logistic model, so the extra flexibility brings little benefit and possibly more variance.
- C The misclassification rate is meaningless under any class imbalance, so the comparison itself is invalid; only the AUC under the ROC curve should be reported and compared across methods.
- D The prof would deduct for using
cv.folds on a boosted fit; OOB error is the only acceptable validation strategy for boosting and the resulting CV-optimal $M$ is therefore not a defensible early-stopping target.
Show answer
Correct answer: B
From the 2024 exam answer key: most students find boosting's error not lower than logistic regression / GAM here, and the prof's expected interpretation is that the data is largely linear in the log-odds, so boosting's extra complexity isn't paying off. Simpler models can or should be used (Occam).
A is the textbook over-claim — boosting wins often, not always; bias-variance has to break even to pay for the extra flexibility. C overstates: the imbalance here (60/40) is mild, error rate is fine to report alongside other metrics. D is fabricated — `cv.folds` is the canonical hook the slides and the textbook recommend.
Atoms: gradient-boosting, cross-validation, bias-variance-tradeoff.
Consider the regression-boosting update from Algorithm 8.2:
$$\hat f(x) \leftarrow \hat f(x) + \nu\, \hat f^b(x), \qquad r_i \leftarrow r_i - \nu\, \hat f^b(x_i).$$
Suppose at iteration $b$, the new tree's prediction at observation $i$ is $\hat f^b(x_i) = 2$, the current residual is $r_i = 5$, and $\nu = 0.1$. After this step, the residual $r_i$ becomes:
- A $4.8$ (we subtract $\nu \cdot \hat f^b(x_i) = 0.2$).
- B $5.2$ (we add $\nu \cdot \hat f^b(x_i)$).
- C $3$ (we subtract the full prediction $\hat f^b(x_i) = 2$).
- D $0.5$ (residual is multiplied by $\nu = 0.1$).
Show answer
Correct answer: A
$r_i \leftarrow r_i - \nu\, \hat f^b(x_i) = 5 - 0.1 \cdot 2 = 5 - 0.2 = 4.8$. Each step nudges the residual by a small fraction of the new tree's prediction; that's the whole point of $\nu$ — keep the corrections small.
B flips the sign — the residual is being reduced by what the new tree explains, not increased. C forgets to apply $\nu$ and subtracts the full tree contribution, defeating shrinkage. D wrongly multiplies the residual by $\nu$ as if $\nu$ scaled the existing residual rather than the new tree.
Atoms: gradient-boosting, weak-learner-and-learning-rate.
You compare two GBM runs on the same dataset, both with depth-3 trees and 10-fold CV: run A uses $\nu = 0.001$, run B uses $\nu = 0.1$. Holding everything else fixed, you'd expect:
- A Run A's CV curve plateaus before ever reaching a minimum, because $\nu = 0.001$ is too small to make measurable progress within any practical number of trees.
- B Both runs reach their CV minimum at roughly the same $M$, since $\nu$ rescales each tree but does not change the iteration at which CV error stops decreasing.
- C Run A reaches its CV minimum at a far larger $M$ than run B.
- D Run A overfits at very small $M$ because its small $\nu$ amplifies each tree's contribution.
Show answer
Correct answer: C
Smaller $\nu$ means each tree contributes less per step, so more steps are needed to traverse the same fit landscape. The optimal $M$ scales roughly as $1/\nu$ along the iso-fit curve.
A overstates: a small $\nu$ does slow learning, but the CV curve still hits a minimum once enough trees are added — it just happens at a much larger $M$. B ignores the $M$–$\nu$ coupling that the prof spent significant time on. D claims $\nu$ amplifies the trees, but $\nu$ is a multiplier in $(0,1]$ that shrinks them; small $\nu$ = small contribution per step, the opposite of overfit-friendly.
Atoms: weak-learner-and-learning-rate, cross-validation. Lecture: L20-boosting-2.
A student replaces the depth-4 trees in a gradient-boosting fit with much deeper trees (depth 20) of comparable interpretation cost, keeping $\nu$ and $M$ as before. CV error rises sharply. The most accurate diagnosis:
- A Each individual tree now over-explains the residual at its own step, leaving little signal for later trees and inviting variance, which breaks the additive-correction logic.
- B Deeper trees can only lower bias and never raise variance, so a rise in CV error is impossible under correct implementation; the student must have introduced a coding error somewhere in the CV split or the residual update.
- C Depth-20 trees on small bootstrap-like subsets effectively converge to an unweighted random forest, so the CV error is now the random-forest error and is mathematically independent of the boosting iteration count $M$.
- D Replacing depth-4 with depth-20 trees increases the irreducible noise $\sigma^2$ of the data-generating distribution, since deeper trees absorb noise into the residual at training time and inflate the test-time error floor.
Show answer
Correct answer: A
Boosting's whole point is that each tree is a weak learner — a small step in the gradient direction. A depth-20 tree is essentially a strong learner, fitting most of the residual at its own iteration; subsequent trees see noise, the ensemble's variance shoots up. The prof flagged this directly: "We want learners that make good progress and a step in the right direction, but not ones that are going to throw you out."
B reflexively claims "deeper = lower bias" without thinking about the variance pay-back, the textbook bias-variance trap. C is wrong: deeper trees do not turn boosting into RF; sequential vs parallel structure is unchanged, and $M$ still matters. D conflates the model's variance with the data's irreducible error — the latter is fixed by $\sigma^2$ and cannot be moved by the model.
Atoms: weak-learner-and-learning-rate, bias-variance-tradeoff, boosting. Lecture: L20-boosting-2.