Gradient boosting

The general boosting algorithm: at each step, fit a tree to the negative gradient of the loss with respect to the current ensemble’s predictions, then add a shrunken version. The prof’s punchline, fitting residuals in the squared-error case is literally taking a gradient-descent step in function space, and once you see this, the whole framework generalizes to any differentiable loss.

Definition (prof’s framing)

The compact verbal definition the prof landed on:

“By updating according to the residuals, what really we were doing is actually fitting the gradient of the function that we’re optimizing. We’re estimating the direction we have to go in. We’re updating the gradient that will get us into a better direction.” - L19-boosting-1

The rephrasing one lecture later:

“The main conceptual idea is that basically when we were boosting the stuff for regression we were using the residuals which seemed like a hack that someone figured out, it was actually estimating the gradient. And so more generally you can just say, I let R be the gradient or the negative gradient, and then fit a tree to that negative gradient.” - L20-boosting-2

So gradient boosting = forward-stagewise additive modeling where each new tree is fit by least squares to the negative-gradient pseudo-residuals of any chosen loss.

Notation & setup

  • , the loss. Free choice (squared, absolute, Huber, deviance, exponential, …), see boosting-loss-functions.
  • , the boosted model after trees. , a constant baseline.
  • , gradient of the loss at observation with respect to the function value , evaluated at the current ensemble:
  • , the negative-gradient pseudo-residuals.
  • , the tree fit at step , with regions and leaf values .
  • , number of terminal nodes in tree (typically 4–8; same across ).
  • , learning rate / shrinkage. See weak-learner-and-learning-rate.

Formula(s) to know cold

The simple regression-tree algorithm (ISL Algorithm 8.2)

Concrete special case the prof teaches first (L19-boosting-1):

Initialize , . For :

  1. Fit a tree with splits to .
  2. Update .
  3. Update residuals .

Output .

The general gradient tree boosting algorithm (Elements Algorithm 10.3)

The version that lifts the idea to any loss (L20-boosting-2):

  1. Initialize .
  2. For :
    • (a) Compute for .
    • (b) Fit a regression tree to the by least squares, giving regions for .
    • (c) For each region, compute .
    • (d) Update .
  3. Output .

The squared-error sanity check

Plug into 2(a):

So Algorithm 10.3 with squared-error loss reduces exactly to the regression-tree algorithm: fitting residuals = fitting the gradient.

Insights & mental models

  • Residuals = gradient. This is the prof’s headline “ah-ha” of the module. Once you see it, every loss becomes pluggable, and that’s why “gradient boosting” is the umbrella name for the family.
  • Steepest descent in function space.

    “The beauty of calculus is that it really points you in the direction you should go.” - L20-boosting-2 Gradient descent in parameter space says “move θ in the direction of .” Gradient boosting in function space says “move in the direction of , but project that direction onto the space of trees by fitting a tree to it via least squares.”

  • Two sub-problems decoupled. Regions are found via a smooth proxy (least-squares fit to ); leaf values are found via the original loss . The prof: “we’re going to find the cute function we can minimize to find these regions, right, these R-tildes, and then we’re going to use those to find the gammas.” (L20-boosting-2)
  • Three ingredients (the prof’s recap):
    1. Weak learners (small trees).
    2. A loss function (free choice, must be differentiable).
    3. An additive way to combine them (sum + shrinkage).
  • Four hyperparameters (the canonical exam-flagged structure):
    1. Tree depth (or number of leaves ), typically .
    2. Minimum observations per terminal node, less critical for big data.
    3. Number of trees , overfits if too large; tune by CV / early stopping.
    4. Learning rate , typically . See weak-learner-and-learning-rate.
  • and are coupled. Smaller ⇒ larger needed. Standard strategy: fix small (≈ 0.1), then choose by early stopping on a held-out validation set.
  • Boosting reduces bias; variance reduction comes from subsampling and XGBoost-style regularization.

Worked example, Boston housing (slide deck)

gbm(medv~., distribution="gaussian", n.trees=5000, interaction.depth=4). Trains both with default and with ; “doesn’t make a big difference” on this dataset. Test MSE comparable to RF / better than single trees / better than bagging.

Worked example, Ames housing (slide deck)

gbm(Sale_Price ~ ., distribution="gaussian", n.trees=3000, shrinkage=0.1, interaction.depth=3, n.minobsinnode=10, cv.folds=10). Diagnostic via gbm.perf(..., method="cv"): training error monotonically drops, CV error shows the U, drops sharply, hits a minimum, slowly creeps up. Pick = which.min(cv.error). Switching bag.fraction = 0.5 (stochastic GBM) trims the RMSE slightly.

Exam signals

“I won’t have you memorize the names of the R functions, of course, but you should know what tree boosting is.” - L27-summary

“We need to specify the number of trees, the depth, and the shrinkage. How do you determine good values?” - L27-summary

(Cross-validation is the answer he wanted on Q6c of the 2025 walk-through.)

“It’s probably because there’s some non-linear interactions that weren’t captured in the linear regression model.” - L27-summary

(The kind of one-sentence interpretation the prof wanted when contrasting boosting test MSE with linear-regression test MSE on Boston.)

Pitfalls

  • Confusing the regression-tree special case with the general algorithm. The “fit residuals” recipe is just plugged into Algorithm 10.3. The general algorithm fits the gradient, which equals the residual only for squared error.
  • Treating as “use enough” the way you would for RF. in boosting is a real tuning parameter, too large overfits.
  • Dropping the learning rate. Without each tree contributes its full predictions, the ensemble jumps to a near-perfect training fit immediately, no diversity, terrible generalization. The prof: “each step should be a small step in the right direction.”
  • Picking and independently. They’re coupled; smaller requires larger . The standard recipe: fix (≈ 0.1), tune .
  • Mistakenly thinking gradient boosting only works for regression. The same framework works for classification, just plug binomial / multinomial deviance into Algorithm 10.3 (see boosting-loss-functions).
  • Forgetting that for -class classification you fit trees per round (one per class, see boosting-loss-functions).

Scope vs ISLP

  • In scope: the conceptual chain residuals → gradient → general gradient boosting; the squared-error sanity check; the four hyperparameters and how / are coupled; CV / early stopping for tuning; the contrast with bagging / RF; the practical demos (Boston, Ames); generalization to any differentiable loss; partial dependence plots for interpretation.
  • Look up in ISLP: §8.2.3 (book’s Algorithm 8.2, squared-error special case), pp. 343–347. For the general Algorithm 10.3 and the gradient/loss derivations, see Elements ch. 10 (Anders does not need this).
  • Skip in ISLP (book-only, prof excluded):
    • Detailed boosting pseudocode line by line - L27-summary: “you should know what tree boosting is” but not the steps verbatim.
    • BART (book §8.2.4), never lectured.
    • Heavy formal derivations of why steepest descent in function space works: the prof gives the intuition, no proofs.

Exercise instances

  • Exercise9.2, explain the meaning of , , , in the gradient tree boosting algorithm; what changing each does; how to choose them.
  • Exercise9.4a, fit gbm(y ~ ., distribution="gaussian", n.trees=3, shrinkage=0.1, interaction.depth=7, n.minobsinnode=10, cv.folds=10) on simulated genomic data (the prompt also asks how to make it less computationally intensive, answer: subsample rows and/or columns).
  • Exercise9.4b, see stochastic-gradient-boosting for the h2o stochastic-GBM grid search.
  • Exercise9.4c, see xgboost for the xgb.cv setup.
  • Exercise9.4d, find the best hyperparameter combination, then fit on the full training data with no CV.

How it might appear on the exam

  • Conceptual short-answer (à la 2025 Q6c): “we need to specify , depth, and , how would you determine good values?” Expected answer: cross-validation (or early stopping), with fixed small and then chosen at the CV minimum.
  • Output interpretation: given a gbm.perf()-style training-error-vs-CV-error plot, identify the optimal at the CV minimum, explain the U-shape (training error monotonically falls, CV error rises again past the minimum because residuals start to be noise).
  • Method comparison: given a table of test MSEs (linear regression, GAM, boosting), say which performs best and why (boosting wins when there are non-linear interactions; per the 2025 Q6c walk-through).
  • Conceptual T/F: “fitting trees to residuals is gradient descent in function space for squared-error loss” (true), “gradient boosting works only for squared-error loss” (false), “in -class classification gradient boosting fits one tree per round” (false, trees per round).
  • Mathy-ish: derive the negative gradient for squared-error loss and observe it equals the residual; or compute the gradient for binomial deviance and observe it equals “indicator − probability.”