Module 09 — Boosting and Additive Trees

Question 1 3 points Ex9.1

Which statement best captures the conceptual difference between bagging and boosting as ensemble methods?

A Bagging fits trees in parallel on bootstrap samples (variance reduction); boosting fits trees sequentially, each correcting the previous fit (bias reduction).
B Bagging averages many low-variance trees fit on the original data; boosting averages many high-variance trees fit on bootstrap samples.
C Bagging and boosting both fit trees sequentially, but bagging uses the residuals while boosting uses the raw response.
D Bagging uses random subsets of predictors at each split; boosting uses random subsets of observations at each split.

Show answer

Correct answer: A

Bagging is parallel and aimed at variance reduction (averaging deep, low-bias trees grown on bootstrap samples). Boosting is sequential and aimed at bias reduction (each new weak learner corrects the previous ensemble).

B inverts the bias-variance roles. C confuses the two methods — boosting is the one that fits residuals. D describes the random-forest decorrelation trick (bagging restricted via $m$), not the bagging-vs-boosting axis.

Atoms: boosting, bagging. Lecture: L19-boosting-1.

Question 2 4 points Exam 2025 P6c

You fit a gradient-boosting model and need to specify the number of trees $M$, the tree depth $d$, and the shrinkage $\nu$. Which approach for picking these values does the prof advocate?

A Set them to fixed defaults from the textbook ($M=1000$, $d=5$, $\nu=0.1$); the prof's slides explicitly recommend skipping tuning because boosting is robust to hyperparameter choices.
B Pick the combination that minimises training MSE on the full training set, since boosting's regularization machinery already prevents the ensemble from chasing training noise.
C Use cross-validation (or early stopping on a held-out validation curve) to pick the combination that minimises out-of-sample error.
D Use AIC or BIC with the effective degrees of freedom of the boosted ensemble to penalise the joint complexity of $M$, $d$, and $\nu$, then minimise the criterion over a grid.

Show answer

Correct answer: C

The prof's verbatim answer to this exact question (2025 Q6c walkthrough): cross-validation, with $\nu$ fixed small and $M$ chosen at the CV minimum. `gbm()`'s cv.folds and `xgb.cv` are the canonical hooks.

A skips the data-driven step, but $M$ in particular overfits if too large — defaults won't generalise across datasets. B optimises against the model's own training noise; training error keeps falling as $M$ grows, never picking a meaningful stop. D applies the prof's least trusted model-selection family — he flagged AIC/BIC as making assumptions that "probably won't hold" and recommends CV instead.

Atoms: gradient-boosting, cross-validation. Lecture: L27-summary.

Question 3 3 points

For the squared-error loss $L(y, f) = \tfrac{1}{2}(y - f)^2$, the negative gradient $-\partial L / \partial f$ that gradient boosting fits each new tree to is:

A $f - y$ (prediction minus response).
B $(y - f)^2$ (the squared residual).
C $-y \cdot f$ (negative product).
D $y - f$ (the residual).

Show answer

Correct answer: D

$\partial L / \partial f = -(y - f)$, so $-\partial L / \partial f = y - f$. This is exactly the residual, which is why "fit a tree to the residuals" and "fit a tree to the negative gradient" coincide for squared-error loss.

A flips the sign — that's the positive gradient, which would push the fit away from the data. B is the loss itself, not its derivative. C is the form of the AdaBoost exponent inside $\exp(-yf)$, not a gradient.

Atoms: gradient-boosting, boosting-loss-functions. Lecture: L20-boosting-2.

Question 4 4 points

Mark each statement about gradient-boosting hyperparameters as true or false.

a) Decreasing the learning rate $\nu$ generally requires increasing the number of trees $M$ to achieve a comparable fit. True False
b) For boosting, $M$ behaves like the random-forest's $B$: pick a number "large enough" and you cannot overfit by adding more trees. True False
c) Tree depth in boosting controls the order of interactions the model can capture; depth 1 stumps yield a purely additive model. True False
d) The empirical rule of thumb in boosting is $\nu \approx 1$, so each tree contributes its full prediction and the ensemble converges in few iterations. True False

Show answer

True — smaller per-tree contribution means each step is a smaller move toward the minimum, so more steps are needed. $M$ and $\nu$ are coupled.
False — boosting overfits if $M$ is too large because residuals eventually become noise. The "use enough" rule applies to bagging/RF, not boosting.
True — a tree of depth $d$ uses at most $d$ predictors on any path, so it captures up to $d$-way interactions. Stumps (depth 1) → at most one variable per term → additive model (cf. ISLP §8.4 Q2).
False — the rule of thumb is $\nu \le 0.1$ (typically $0.1, 0.05, 0.01, 0.001$), never near $1$. Setting $\nu \approx 1$ would kill the "small step" property the algorithm relies on.

Atoms: weak-learner-and-learning-rate, gradient-boosting. Lecture: L20-boosting-2.

Question 5 3 points

In the bias-variance picture, the core mechanism by which the basic boosting algorithm improves over a single tree is best described as:

A Reducing variance by averaging many low-bias deep trees.
B Reducing the irreducible error $\sigma^2$ via repeated denoising of the response.
C Reducing bias by summing many high-bias weak learners that each correct the previous fit.
D Eliminating both bias and variance by choosing a sufficiently large number of trees.

Show answer

Correct answer: C

The prof's framing: bagging/RF start with high-variance/low-bias trees and drive variance down by averaging. Boosting starts from the opposite end — high-bias/low-variance weak learners — and drives bias down by accumulating their corrections.

A is the bagging/random-forest mechanism, not boosting's. B confuses model error with the irreducible noise floor — no algorithm reduces $\sigma^2$. D ignores that $M$ too large overfits, increasing variance again.

Atoms: boosting, bias-variance-tradeoff. Lecture: L19-boosting-1.

Question 6 3 points

In AdaBoost, after fitting the $m$-th weak classifier $G_m$ with weighted error $\text{err}_m$, what happens to a training observation that $G_m$ correctly classifies, in the weight update $w_i \leftarrow w_i \cdot \exp(\alpha_m \cdot \mathbb{1}[y_i \neq G_m(x_i)])$?

A Its weight is multiplied by $e^{\alpha_m}$, increasing it.
B Its weight is multiplied by $e^{-\alpha_m}$, decreasing it.
C Its weight is multiplied by $e^0 = 1$, so it is unchanged.
D Its weight is set back to the initial value $1/N$.

Show answer

Correct answer: C

For a correctly classified point, $\mathbb{1}[y_i \neq G_m(x_i)] = 0$, so the multiplier is $\exp(\alpha_m \cdot 0) = 1$. Misclassified points get $\exp(\alpha_m)$, which boosts their weight; correct points are simply left alone (weights are renormalised after the step).

A is the multiplier for misclassified points — the question asks about the correct ones. B describes a symmetric "boost good / cut bad" scheme that AdaBoost.M1 doesn't actually use. D conflates initialisation with the per-iteration update.

Atoms: adaboost. Lecture: L19-boosting-1.

Question 7 4 points

Mark each statement about AdaBoost as true or false.

a) AdaBoost codes labels as $y \in \{0, 1\}$ and outputs $\sum_m \alpha_m G_m(x)$ directly (without a sign function) as the final classifier's class probability. True False
b) AdaBoost is a special case of gradient boosting under the exponential loss $L(y, f) = \exp(-y f(x))$. True False
c) The trees in AdaBoost are typically grown deep so that each individual classifier has low bias. True False
d) AdaBoost's classifier weight $\alpha_m = \log\!\big((1 - \text{err}_m)/\text{err}_m\big)$ is positive when $\text{err}_m > 0.5$ and negative when $\text{err}_m < 0.5$. True False

Show answer

False — AdaBoost uses the $\pm 1$ coding (essential for the exponential-loss formulation) and outputs the sign of the $\alpha$-weighted vote, not the raw sum and not a $\{0,1\}$ probability.
True — the equivalence was discovered five years after AdaBoost itself ("most things are very obvious after the fact"). Plug exponential loss into Algorithm 10.3 and AdaBoost falls out.
False — AdaBoost uses stumps ("essentially a bias term, very, very small trees"). Strong individual learners would over-explain each round and break the "weak learners that build on each other" logic.
False — the inequality is reversed. $\alpha_m > 0$ when $\text{err}_m < 0.5$ (better than chance, gets a positive vote); $\alpha_m < 0$ when $\text{err}_m > 0.5$ (worse than chance, vote is flipped).

Atoms: adaboost, boosting-loss-functions. Lecture: L20-boosting-2.

Question 8 4 points Exam 2025 P6d

On the Boston housing data with 13 predictors, you fit a linear regression (test MSE 21.7), a GAM with B-splines (test MSE 14.5), and a gradient-boosted tree model (test MSE 11.2). Which is the most defensible one-sentence interpretation?

A The boosted model wins because it has strictly more effective parameters than linear regression and the GAM, and a model with more parameters mechanically fits any held-out test set better than a simpler one.
B The boosted model wins because the shrinkage parameter $\nu$ behaves like the ridge penalty $\lambda$ on a hidden vector of regression coefficients, so the gain over linear regression is the same coefficient-shrinkage effect that drove the ridge improvements in module 6.
C The gaps between the three test MSEs are within Monte-Carlo noise from the random train/test split; the three methods should be regarded as effectively tied on this dataset and the choice between them comes down to interpretability.
D The boosted model wins because there are non-linear interactions between predictors that linear regression cannot capture, while GAM captures non-linearities but not the interactions.

Show answer

Correct answer: D

This is exactly the prof's expected interpretation from the 2025 walkthrough: "It's probably because there's some non-linear interactions that weren't captured in the linear regression model." GAMs handle marginal non-linearities ($\sum f_j(X_j)$) but lack interaction terms; boosted depth-$\ge 2$ trees naturally model both.

A confuses training and test error — more parameters always lower training MSE, but the question is about test MSE, where overfitting can win or lose. B confuses boosting's machinery with ridge: boosting doesn't shrink regression coefficients, it adds many trees with shrinkage on the trees. C ignores the magnitudes of the gaps (≈10 MSE units between LR and boosting on a response with this scale is meaningful).

Atoms: gradient-boosting, boosting. Lecture: L27-summary.

Question 9 3 points Exam 2024 P2a

A genomic dataset has $n = 1000$ individuals and $p = 10000$ predictors. Which of the following statements is correct?

A Boosted regression trees cannot be fit in this setting; like OLS, the routine requires $p < n$.
B Boosted regression trees can be fit; they handle $p > n$ because each tree splits on at most a handful of predictors.
C Boosted regression trees can be fit, but only after first reducing dimensionality via PCA onto the top $k \ll n$ components.
D Boosted regression trees can only be fit if you restrict the ensemble to stumps ($d = 1$); deeper trees would have too few observations per terminal node when $p > n$.

Show answer

Correct answer: B

Each tree at each split picks one predictor (or a few, with column subsampling) — no $p \times p$ matrix to invert, no need for $p < n$. The 2024 exam Q2a explicitly listed "Boosted regression trees allow for $p>n$" as one of the correct statements. This is also why Exercise 9.4 uses a $p \gg n$ genomic simulation.

A reflects the OLS / forward-selection constraint — irrelevant for trees. C is unnecessary; trees do their own implicit feature selection. D wrongly restricts to stumps; deeper trees still work fine when $p > n$.

Atoms: boosting, gradient-boosting.

Question 10 3 points Ex9.3

In the gradient tree-boosting algorithm, where in the update would you insert the learning rate $\nu$? The current update reads $f_m(x) = f_{m-1}(x) + \sum_j \gamma_{jm}\,\mathbb{1}(x \in R_{jm})$. The corrected update is:

A $f_m(x) = \nu \cdot f_{m-1}(x) + \sum_j \gamma_{jm}\,\mathbb{1}(x \in R_{jm})$.
B $f_m(x) = f_{m-1}(x) + \sum_j (\gamma_{jm}/\nu)\,\mathbb{1}(x \in R_{jm})$.
C $f_m(x) = f_{m-1}(x) + \nu \cdot \sum_j \gamma_{jm}\,\mathbb{1}(x \in R_{jm})$.
D $f_m(x) = \nu \cdot \big(f_{m-1}(x) + \sum_j \gamma_{jm}\,\mathbb{1}(x \in R_{jm})\big)$.

Show answer

Correct answer: C

$\nu$ shrinks only the new tree's contribution, leaving the existing ensemble alone (Algorithm 10.3 step 2(d), Exercise 9.3). Operationally: take a small step in the direction of the freshly-fit tree.

A would shrink the previous ensemble at every iteration, decaying the model. B inverts the role — dividing by $\nu$ would amplify each new tree, the opposite of what shrinkage means. D shrinks the entire ensemble each iteration, decaying earlier trees exponentially with $m$.

Atoms: weak-learner-and-learning-rate, gradient-boosting.

Question 11 4 points

Mark each statement about boosting loss functions as true or false.

a) Squared-error loss is more sensitive to outliers than absolute or Huber loss. True False
b) For $K$-class classification with multinomial deviance, gradient boosting fits a single tree per round, regardless of $K$. True False
c) Huber loss is quadratic for small residuals and linear for large ones, combining squared error's smoothness with absolute loss's outlier robustness. True False
d) Binomial deviance is unrelated to the negative log-likelihood used in logistic regression; gradient boosting derives it from a fundamentally different probabilistic model. True False

Show answer

True — squaring a residual of magnitude 100 gives 10,000, which dominates the sum and bends the fit toward outliers. Absolute and Huber are linear in big residuals, so they downweight outlier influence.
False — multinomial deviance fits $K$ trees per round, one per class, each fit to its class-specific gradient $\mathbb{1}(y = k) - p_k(x)$. This is the most-flagged exam trap in this atom.
True — exactly Huber's design: quadratic on $|y-f| \le \delta$, linear beyond, smooth at the seam if $\delta$ is chosen well.
False — binomial deviance is the same negative log-likelihood as logistic regression, and its negative gradient is $y - p(x)$ (indicator minus probability), which is the prof's "indicator minus probability ends up sounding very similar to residuals".

Atoms: boosting-loss-functions, gradient-boosting, logistic-regression.

Question 12 3 points

Stochastic gradient boosting modifies the standard algorithm in which way, and to what end?

A It replaces the deterministic gradient with a noisy mini-batch estimate, mirroring SGD in neural networks.
B It subsamples rows (and optionally columns) without replacement before each tree, encouraging diversity and reducing variance.
C It bootstraps the training data with replacement before each tree, mirroring bagging.
D It randomly drops a fraction of already-fit trees during prediction to reduce reliance on early trees.

Show answer

Correct answer: B

Friedman 2002 stochastic GBM: at each iteration, draw a fraction (typically $\le 0.5$) of rows without replacement, and optionally a fraction of columns, then fit the next tree on this subsample. Each tree sees different data, the ensemble is more diverse, ensemble variance drops.

A describes neural-network mini-batch SGD — different "stochastic" mechanism. C claims bootstrap (with replacement); the prof flagged that this is exactly not what stochastic GBM does — it's without replacement. D describes the dropout trick from XGBoost, not stochastic subsampling.

Atoms: stochastic-gradient-boosting, regularization.

Question 13 4 points

Mark each statement about XGBoost as true or false. (Internals are out of scope; "in scope" means the conceptual menu of what it adds on top of vanilla GBM.)

a) XGBoost uses second-order (Newton) gradient information in addition to the first-order gradient. True False
b) XGBoost replaces the forward-stagewise additive modeling framework of gradient boosting with a fundamentally different fitting strategy. True False
c) XGBoost applies $L_1$ and $L_2$ regularization to the leaf weights of each tree. True False
d) XGBoost cannot do row or column subsampling; for that you must switch to a stochastic GBM implementation. True False

Show answer

True — XGBoost's "Newton" step uses both the gradient (direction) and the Hessian (curvature), giving better-sized steps and typically fewer iterations.
False — same gradient-boosting skeleton, just engineered (parallelism, approximate split search) and more regularized (leaf $L_1/L_2$, dropout). "Same algorithm, more engineering."
True — the penalty $\Omega(f) = \gamma|T| + \tfrac12 \lambda \|w\|^2$ (or with $\|w\|_1$) imports ridge/lasso onto the leaf-value vector $w$.
False — XGBoost natively exposes both subsample (row) and colsample_bytree (column) parameters; it inherits everything stochastic GBM does and adds more.

Atoms: xgboost, stochastic-gradient-boosting. Lecture: L20-boosting-2.

Question 14 3 points

For a fitted black-box model $\hat f$, the empirical partial dependence of $\hat f$ on a single predictor $X_j$ at a value $x_j$ is computed as:

A $\bar f_j(x_j) = \frac{1}{N} \sum_{i=1}^N \hat f(x_j, x_{i, -j})$, averaging over the empirical $X_{-j}$.
B $\bar f_j(x_j) = \hat f(x_j, \bar x_{-j})$, evaluating at the mean of the other predictors.
C $\bar f_j(x_j) = \frac{1}{N} \sum_{i: x_{ij} = x_j} \hat f(x_i)$, restricting to observations whose $j$-th value equals $x_j$.
D $\bar f_j(x_j) = \hat f(x_j, 0, 0, \ldots, 0)$, setting all other predictors to zero.

Show answer

Correct answer: A

You sweep $X_j$ over a grid of values, and at each $x_j$ replace the $j$-th coordinate of every training row with $x_j$ while keeping the rest as observed, then average the predictions. The empirical $X_{-j}$ stands in for the unknown joint distribution.

B plugs in the marginal mean of $X_{-j}$ — that's a "predict at the centroid" trick, not a PDP, and it ignores the joint distribution. C usually has zero or one matching observations, so it's degenerate (and confounded with $X_j$'s marginal density). D plugs in zeros, an arbitrary point that may be far outside the data.

Atoms: partial-dependence-plots. Lecture: L21-unsupervised-1.

Question 15 3 points

Mark each statement about partial dependence plots (PDPs) as true or false.

a) A PDP shows the causal effect of $X_j$ on $Y$. True False
b) When predictors are highly correlated, a PDP averages over $(x_j, x_{-j})$ combinations that may rarely occur in the data, which can mislead. True False
c) Partial dependence plots and variable importance plots answer different questions: importance asks which variables matter; PDPs ask how the important ones matter. True False

Show answer

False — a PDP is a marginal average of model predictions; it tells you what the model learned, not what would happen in the world if $X_j$ changed and nothing else did.
True — the prof's "tiny house with 50 bedrooms" caution: the PDP estimator pins $X_j$ and averages over the marginal of $X_{-j}$, which under high correlation creates implausible synthetic rows.
True — the standard pairing for tree-ensemble interpretability: importance plots rank variables, PDPs reveal the shape of the dependence on the top-ranked ones.

Atoms: partial-dependence-plots, variable-importance.

Question 16 4 points

A boosted tree fit on Boston housing produces a variable-importance plot showing rm (rooms) and lstat (% lower-SES) at the top, well above crim and nox. What is the intended interpretation?

A rm and lstat have the largest causal effects on median house price after adjusting for the other predictors in the joint distribution.
B rm and lstat are the predictors whose partial-dependence curves have the largest average slope, which is what the importance algorithm aggregates.
C crim and nox failed an internal Wald-style test at each split and were therefore demoted at the 5% level.
D rm and lstat contribute the most to reducing impurity (or to OOB performance under permutation) across the ensemble's splits.

Show answer

Correct answer: D

Variable importance for a tree ensemble aggregates either total impurity reduction across all splits using that variable, or out-of-bag performance drop when that variable is permuted. Both flavors say "this predictor was relied on most by the model."

A overclaims: importance is correlational, not causal — the prof distinguished this from PDPs and from causal questions repeatedly. B confuses ensembles with linear regression; boosted trees have no scalar "coefficients" per predictor. C imports a hypothesis-test framework that doesn't apply to tree-based importance.

Atoms: variable-importance, partial-dependence-plots. Lecture: L19-boosting-1.

Question 17 4 points

A gradient-boosting fit on Ames housing reports training MSE and 10-fold CV MSE as functions of the number of trees $M$. Training MSE drops monotonically; CV MSE drops sharply, hits a minimum near $M^\star = 1100$, then slowly rises. What is the recommended choice and why?

A Pick $M$ as large as compute allows; later trees keep refining without overfitting in boosting.
B Pick $M = 1100$ (the CV minimum); past it, additional trees fit residual noise and CV error rises.
C Pick $M$ so that training MSE matches CV MSE; that's where bias and variance balance.
D Pick the smallest $M$ for which CV MSE first falls below training MSE.

Show answer

Correct answer: B

Standard early-stopping recipe: $\hat M = \arg\min_M \mathrm{CV}(M)$. Past the minimum, residuals are mostly noise and the new trees fit it, so generalization deteriorates. The U-shape on the CV curve is exactly the signal you're using.

A is the bagging/RF rule, not boosting's — for boosting $M$ overfits. C is meaningless: CV MSE is essentially always above training MSE, the gap is the generalisation cost, not the optimum. D doesn't happen with iid samples (CV MSE typically stays above training throughout).

Atoms: gradient-boosting, cross-validation. Lecture: L20-boosting-2.

Question 18 3 points ISLP §8 Q2

Boosting with depth-one trees (stumps) leads to an additive model $f(X) = \sum_{j=1}^p f_j(X_j)$. Why?

A Stumps can only represent strictly linear functions of a single predictor at a time, so summing them mechanically produces a linear-in-$X$ model with no curvature in any $f_j$.
B The shrinkage parameter $\nu$ projects each tree's contribution onto the additive subspace, removing any interaction component before the new tree is added to the ensemble.
C Each stump splits on exactly one variable, so its contribution is a function of one $X_j$ only; the sum of such terms is, by definition, additive.
D Stumps have exactly two terminal nodes, and any two-leaf tree is mathematically equivalent to a single additive term, so summing them keeps the ensemble in the additive class.

Show answer

Correct answer: C

A depth-1 tree picks one split on one predictor and outputs two leaf values — that is, its contribution depends on a single $X_j$. Summing $M$ such trees gives $\sum_j f_j(X_j)$, where $f_j$ collects the contributions of all stumps split on $X_j$. (ISLP §8.4 Q2.)

A overstates: stumps are non-linear (step functions). The point isn't linearity, it's that no interaction terms appear. B is irrelevant — $\nu$ scales each tree but does not change which variables the tree uses. D gets the leaf count right (two regions per stump) but the conclusion wrong: additivity comes from "one variable per term," not from leaf count — a depth-2 tree also has a small leaf count yet introduces interactions.

Atoms: boosting, weak-learner-and-learning-rate.

Question 19 4 points

Which of the following is not a regularization mechanism present in standard gradient-boosted machines (or their XGBoost extension)?

A Shrinking each new tree's contribution by the learning rate $\nu \le 0.1$ to take many small gradient steps.
B Choosing $M$ by early stopping at the minimum of a held-out validation curve rather than the training one.
C Subsampling rows (without replacement) and optionally columns before each tree to diversify successive trees.
D Increasing each tree's depth so that early trees can capture the bulk of the signal in a few iterations.

Show answer

Correct answer: D

Deeper trees do the opposite of regularizing — each individual tree becomes a stronger learner, increasing variance per step and undermining the "many small corrections" philosophy. The other three (shrinkage, early stopping, subsampling) are exactly the prof's "three regularization strategies" in boosting; XGBoost adds two more (leaf $L_1/L_2$, dropout).

A, B, and C are textbook regularizers in this context. D is the trap — it's the random-forest reflex applied to the wrong setting. RF wants high-variance/low-bias trees; boosting wants the opposite.

Atoms: regularization, weak-learner-and-learning-rate, stochastic-gradient-boosting.

Question 20 3 points

XGBoost's penalty $\Omega(f) = \gamma |T| + \tfrac{1}{2} \lambda \|w\|^2$ (with $|T|$ the number of leaves and $w$ the leaf-value vector) is best described as:

A Cost-complexity pruning $+$ ridge regression, with both penalties applied to the tree's leaves.
B Lasso applied to a hidden vector of linear coefficients learned alongside the tree ensemble.
C A Bayesian prior on the tree structure with a Laplace likelihood on leaf values.
D Cross-entropy classification loss plus an extra bias term absorbing the ensemble's constant offset.

Show answer

Correct answer: A

$\gamma|T|$ is exactly cost-complexity pruning ($\alpha|T|$ from module 8) applied per boosting iteration; $\tfrac12 \lambda \|w\|^2$ is ridge applied to leaf weights. The prof's framing: same regularizers as elsewhere in the course, only here the parameter vector is the leaf values rather than regression coefficients.

B confuses XGBoost with linear models — there are no $\beta$'s here. C overlays a Bayesian story the prof explicitly excluded ("really don't think I'd put this on the test"). D substitutes a classification loss for a regularization penalty; the two are different terms in the objective.

Atoms: xgboost, regularization, cost-complexity-pruning.

Question 21 3 points

In AdaBoost, weak classifier $G_3$ has weighted error $\text{err}_3 = 0.20$. What is its classifier weight $\alpha_3 = \log\!\big((1 - \text{err}_3)/\text{err}_3\big)$, and what does its sign mean?

A $\alpha_3 \approx \log 4 \approx 1.39$, positive — $G_3$ is better than chance and gets a positive vote in the final ensemble.
B $\alpha_3 \approx \log(1/4) \approx -1.39$; the negative sign tells you AdaBoost has detected an over-confident weak learner and is flipping its predictions before adding them to the ensemble.
C $\alpha_3 \approx 0.20$, equal to the weighted error itself; AdaBoost makes each classifier's voice in the final vote directly proportional to the error it just achieved.
D $\alpha_3 = \tfrac12 \log(1 - 0.20) \approx 0.11$, positive but small because the standard formula uses a one-half log of $(1 - \text{err}_m)$ instead of a ratio.

Show answer

Correct answer: A

$(1-0.20)/0.20 = 4$, so $\alpha_3 = \log 4 \approx 1.39$. Because $\text{err}_3 < 0.5$, the ratio is $> 1$ and $\alpha_3 > 0$: a better-than-chance classifier gets a positive weight in the final $\operatorname{sign}(\sum_m \alpha_m G_m(x))$ vote.

B inverts the ratio inside the log — the formula is $(1-\text{err})/\text{err}$, not $\text{err}/(1-\text{err})$. C confuses $\alpha_m$ with $\text{err}_m$. D mixes up AdaBoost's $\alpha$ formula with the $\tfrac12 \log(\cdot)$ variant from the SAMME / "discrete AdaBoost" alternative coding (and even there, the ratio inside the log is unchanged).

Atoms: adaboost.

Question 22 4 points

Mark each statement as true or false.

a) In bagging, the number of trees $B$ must be tuned via cross-validation because adding too many trees overfits the bootstrap distribution and inflates OOB error past a critical $B^\star$. True False
b) In boosting, the number of trees $M$ is "use enough" in the same sense as bagging. True False
c) Random forests typically grow each tree fully (no pruning); boosting typically uses shallow trees with depth $4 \le J \le 8$ leaves. True False
d) Random forests reduce bias by averaging many decorrelated trees; boosting reduces variance by accumulating many sequential corrections. True False

Show answer

False — bagging's $B$ is "use enough": extra trees average more with diminishing returns, but never inflate OOB error. There is no overfitting-in-$B$ for bagging.
False — boosting overfits if $M$ is too large because each new tree fits whatever residual is left, and after the signal is exhausted that's just noise.
True — the standard contrast: RF wants high-variance/low-bias deep trees to average; boosting wants high-bias/low-variance weak learners to sum.
False — the bias-variance roles are inverted. RF reduces variance via decorrelated averaging (each tree is low-bias/high-variance); boosting reduces bias via sequential corrections (each tree is high-bias/low-variance).

Atoms: boosting, random-forest, bias-variance-tradeoff.

Question 23 3 points Ex9.2

In Algorithm 10.3 (gradient tree boosting) at iteration $m$, after computing the negative-gradient pseudo-residuals $r_{im}$, the next step (2b) is:

A Fit a regression tree to the original responses $y_i$ using least squares.
B Fit a classification tree to $\operatorname{sign}(r_{im})$ to determine which observations to upweight next round.
C Solve $\arg\min_\gamma \sum_i L(y_i, f_{m-1}(x_i) + \gamma)$ to obtain a single constant update.
D Fit a regression tree to the pseudo-residuals $r_{im}$ using least squares, obtaining regions $R_{jm}$.

Show answer

Correct answer: D

Step 2b of Algorithm 10.3: fit a regression tree to the negative-gradient pseudo-residuals by least squares to obtain the regions $R_{jm}$. Step 2c then fills in the leaf values $\gamma_{jm}$ using the original loss $L$ on those regions. The least-squares fit on residuals is the "smooth proxy" the prof emphasized.

A would just refit the original problem, wasting the gradient information. B is the AdaBoost-style sample reweighting, which is a different mechanism (and applies to classification, not arbitrary loss). C is the initialization step (step 1, $f_0$), not the per-iteration tree fit.

Atoms: gradient-boosting.

Question 24 4 points

You have a regression dataset where a sensor occasionally records implausible spikes — a few $y$ values are roughly 100× the typical scale. You want a boosting fit that is not heavily distorted by these outliers but still fits the bulk of the data smoothly. Which loss is the most appropriate?

A Squared-error (Gaussian) loss — the default and the most efficient under Gaussian noise.
B Huber loss — quadratic for small residuals, linear for large ones; smooth and outlier-tolerant.
C Exponential loss — its sharp penalty isolates the outliers from the rest of the fit.
D Multinomial deviance — the multi-class generalization handles the outlier "class" of spikes naturally.

Show answer

Correct answer: B

Huber is exactly the engineered compromise: quadratic ($(y-f)^2$) for $|y-f| \le \delta$ (smooth, doesn't aggressively zero things), linear ($2\delta|y-f| - \delta^2$) beyond (doesn't blow up on outliers). Designed for this scenario.

A is the loss the question's setup is warning against: 100×-scale residuals get squared into 10,000×-scale penalties, which yanks the fit toward the outliers. C is a binary-classification loss (and exponential growth makes outliers even worse, not better). D is for $K$-class classification — wrong problem entirely.

Atoms: boosting-loss-functions. Lecture: L20-boosting-2.

Question 25 4 points

Mark each statement as true or false.

a) The standard recipe is to fix $\nu$ small (≈ 0.1) and then tune $M$ via cross-validation or early stopping. True False
b) Setting $\nu = 1$ effectively disables shrinkage and lets each tree contribute its full prediction. True False
c) Tree depth in boosting controls only computational cost, not the bias-variance behavior of the model. True False
d) "Plain" gradient boosting machines (vanilla GBM) have four core hyperparameters: tree depth, minimum observations per terminal node, number of trees $M$, and learning rate $\nu$. True False

Show answer

True — the prof's canonical strategy. Joint-tuning $M$ and $\nu$ wastes compute because they are coupled along an iso-fit curve.
True — with $\nu = 1$ each tree gets full weight; the ensemble jumps to a near-perfect training fit immediately and loses the "many small corrections" diversity.
False — depth controls the interaction order the model can capture and how strong each individual learner is. Bias-variance is precisely the lens.
True — the four canonical hyperparameters. XGBoost adds more (leaf $L_1/L_2$, $\gamma$, dropout, subsample fractions), but the GBM core is these four.

Atoms: weak-learner-and-learning-rate, gradient-boosting.

Question 26 3 points Exam 2024 P5d

On a binary classification dataset (orange-juice purchase, $n_{\text{train}} = 800$, $\sim 60/40$ class split), you fit a boosted tree with `distribution = "bernoulli"`, `n.trees = 1000`, `shrinkage = 0.01`, `interaction.depth = 2`, `cv.folds = 10`. The CV-optimal number of trees is 280, giving a test misclassification error of 0.18 — slightly higher than logistic regression on the same data. The most defensible interpretation is:

A Boosting must be misconfigured; given a sufficiently large $M$ and a small $\nu$, boosting is guaranteed to dominate logistic regression on any binary classification problem because it is a universal function approximator.
B The boosting fit is fine, but the data may be well-described by a (near-)linear / additive logistic model, so the extra flexibility brings little benefit and possibly more variance.
C The misclassification rate is meaningless under any class imbalance, so the comparison itself is invalid; only the AUC under the ROC curve should be reported and compared across methods.
D The prof would deduct for using cv.folds on a boosted fit; OOB error is the only acceptable validation strategy for boosting and the resulting CV-optimal $M$ is therefore not a defensible early-stopping target.

Show answer

Correct answer: B

From the 2024 exam answer key: most students find boosting's error not lower than logistic regression / GAM here, and the prof's expected interpretation is that the data is largely linear in the log-odds, so boosting's extra complexity isn't paying off. Simpler models can or should be used (Occam).

A is the textbook over-claim — boosting wins often, not always; bias-variance has to break even to pay for the extra flexibility. C overstates: the imbalance here (60/40) is mild, error rate is fine to report alongside other metrics. D is fabricated — `cv.folds` is the canonical hook the slides and the textbook recommend.

Atoms: gradient-boosting, cross-validation, bias-variance-tradeoff.

Question 27 3 points

Consider the regression-boosting update from Algorithm 8.2: $$\hat f(x) \leftarrow \hat f(x) + \nu\, \hat f^b(x), \qquad r_i \leftarrow r_i - \nu\, \hat f^b(x_i).$$ Suppose at iteration $b$, the new tree's prediction at observation $i$ is $\hat f^b(x_i) = 2$, the current residual is $r_i = 5$, and $\nu = 0.1$. After this step, the residual $r_i$ becomes:

A $4.8$ (we subtract $\nu \cdot \hat f^b(x_i) = 0.2$).
B $5.2$ (we add $\nu \cdot \hat f^b(x_i)$).
C $3$ (we subtract the full prediction $\hat f^b(x_i) = 2$).
D $0.5$ (residual is multiplied by $\nu = 0.1$).

Show answer

Correct answer: A

$r_i \leftarrow r_i - \nu\, \hat f^b(x_i) = 5 - 0.1 \cdot 2 = 5 - 0.2 = 4.8$. Each step nudges the residual by a small fraction of the new tree's prediction; that's the whole point of $\nu$ — keep the corrections small.

B flips the sign — the residual is being reduced by what the new tree explains, not increased. C forgets to apply $\nu$ and subtracts the full tree contribution, defeating shrinkage. D wrongly multiplies the residual by $\nu$ as if $\nu$ scaled the existing residual rather than the new tree.

Atoms: gradient-boosting, weak-learner-and-learning-rate.

Question 28 3 points

You compare two GBM runs on the same dataset, both with depth-3 trees and 10-fold CV: run A uses $\nu = 0.001$, run B uses $\nu = 0.1$. Holding everything else fixed, you'd expect:

A Run A's CV curve plateaus before ever reaching a minimum, because $\nu = 0.001$ is too small to make measurable progress within any practical number of trees.
B Both runs reach their CV minimum at roughly the same $M$, since $\nu$ rescales each tree but does not change the iteration at which CV error stops decreasing.
C Run A reaches its CV minimum at a far larger $M$ than run B.
D Run A overfits at very small $M$ because its small $\nu$ amplifies each tree's contribution.

Show answer

Correct answer: C

Smaller $\nu$ means each tree contributes less per step, so more steps are needed to traverse the same fit landscape. The optimal $M$ scales roughly as $1/\nu$ along the iso-fit curve.

A overstates: a small $\nu$ does slow learning, but the CV curve still hits a minimum once enough trees are added — it just happens at a much larger $M$. B ignores the $M$–$\nu$ coupling that the prof spent significant time on. D claims $\nu$ amplifies the trees, but $\nu$ is a multiplier in $(0,1]$ that shrinks them; small $\nu$ = small contribution per step, the opposite of overfit-friendly.

Atoms: weak-learner-and-learning-rate, cross-validation. Lecture: L20-boosting-2.

Question 29 4 points

A student replaces the depth-4 trees in a gradient-boosting fit with much deeper trees (depth 20) of comparable interpretation cost, keeping $\nu$ and $M$ as before. CV error rises sharply. The most accurate diagnosis:

A Each individual tree now over-explains the residual at its own step, leaving little signal for later trees and inviting variance, which breaks the additive-correction logic.
B Deeper trees can only lower bias and never raise variance, so a rise in CV error is impossible under correct implementation; the student must have introduced a coding error somewhere in the CV split or the residual update.
C Depth-20 trees on small bootstrap-like subsets effectively converge to an unweighted random forest, so the CV error is now the random-forest error and is mathematically independent of the boosting iteration count $M$.
D Replacing depth-4 with depth-20 trees increases the irreducible noise $\sigma^2$ of the data-generating distribution, since deeper trees absorb noise into the residual at training time and inflate the test-time error floor.

Show answer

Correct answer: A

Boosting's whole point is that each tree is a weak learner — a small step in the gradient direction. A depth-20 tree is essentially a strong learner, fitting most of the residual at its own iteration; subsequent trees see noise, the ensemble's variance shoots up. The prof flagged this directly: "We want learners that make good progress and a step in the right direction, but not ones that are going to throw you out."

B reflexively claims "deeper = lower bias" without thinking about the variance pay-back, the textbook bias-variance trap. C is wrong: deeper trees do not turn boosting into RF; sequential vs parallel structure is unchanged, and $M$ still matters. D conflates the model's variance with the data's irreducible error — the latter is fixed by $\sigma^2$ and cannot be moved by the model.

Atoms: weak-learner-and-learning-rate, bias-variance-tradeoff, boosting. Lecture: L20-boosting-2.