Weak learners and the learning rate

The two “force everything to be small” hyperparameters that make boosting work. Weak learner = a deliberately shallow / dumb base model (typically a tree with few splits) that individually does barely better than random. Learning rate = a scalar in multiplying each tree’s contribution before it’s added to the ensemble, so each step is a small step. Together they are why boosting reduces bias without exploding variance.

Definition (prof’s framing)

“We want weak learners, and this way we’re kind of forcing them to be weak by not letting them have a strong vote.” - L19-boosting-1

Restated more carefully a lecture later, in the steepest-descent picture:

“Weak learners sound like a weird thing to want, but if you remember, the thing we really don’t want to do is overfit. So weak learners won’t overfit. And also, in the gradient descent picture, we don’t want to make that huge jump.” - L20-boosting-2

Notation & setup

  • Weak learner = the per-iteration base model. For tree boosting, a small tree characterized by:
    • Tree depth (or equivalently number of leaves ): typically in practice. → a stump (one split). → captures pairwise interactions. Rarely .
    • Minimum observations per terminal node: secondary, less important for big data.
  • Learning rate (also , the prof switches mid-lecture): the scalar multiplying each new tree before it’s added to the running model: Empirical rule: . Common values: 0.1, 0.05, 0.01, 0.001.

Formula(s) to know cold

Where enters the gradient tree boosting algorithm, modify step 2(d) of Algorithm 10.3:

Equivalently for the regression-tree special case (Algorithm 8.2 from the slides):

The trade-off (L19-boosting-1 / L20-boosting-2):

“If you had a very weak eta… like 0.00001 then you’d need to do this a whole bunch of times before you get anywhere… whereas if it was bigger then you’d need fewer.” - L20-boosting-2

Operational consequence: smaller ⇒ larger needed to reach the same fit. Standard recipe: fix small (≈ 0.1), then tune via early stopping on a validation set or CV.

Insights & mental models

  • The steepest-descent analogy. Each tree is one step on the loss landscape. is the step size, too big you overshoot the minimum or oscillate, too small you take forever. The same intuition you have for vanilla gradient descent, only here the “step direction” is itself an estimated tree.
  • Forcing weakness is a feature, not a workaround. Strong individual learners would each over-explain the residual at their step, leaving little for subsequent trees → poor diversity → worse generalization. Weak learners + many of them = the bias gets aggregated down without inviting variance.

    “We want weak learners… we don’t want overly precise learners. We want learners that make good progress and a step in the right direction, but not ones that are going to throw you out into the, you know, too far or overfit. We’re specifically trying to reduce variance in this setting.” - L20-boosting-2

  • Tree depth controls interaction order. splits ⇒ at most variables in any path ⇒ at most -way interactions are modeled. So (stumps) → purely additive; → up to 2-way interactions; → up to 5-way. The Elements rule of thumb corresponds to “allow up to 3- to 7-way interactions.”
  • Same depth across all trees is the modern norm. Historically people grew big trees and pruned, but

    “they realized it didn’t help as much as they wanted.” - L20-boosting-2

  • is regularization: it’s the boosting analog of ridge’s , the lasso’s , the smoothing spline’s , and the NN learning rate (where it doubles as both optimization step size and implicit regularizer). Smaller ⇒ more regularization ⇒ more trees needed.
  • Why this is different from random forests. With RF, you can keep adding trees almost for free, variance reduction asymptotes but doesn’t go negative. With boosting, residuals eventually become noise; additional trees overfit. So and are real, coupled tuning parameters here, not “use enough.”

Worked example: Boston via gbm() (slide deck)

Two runs with trees: one at default , one at . Test MSE comparable on Boston (“doesn’t make a big difference”) because is plenty big in both cases. The point of the comparison: with enough , the practical effect of on the final fit is small; the training-time and the minimum-CV-error iteration shift dramatically.

Exam signals

“We need to specify the number of trees, the depth, and the shrinkage. How do you determine good values?” - L27-summary

The prof’s answer: cross-validation. Fix small (e.g. 0.1), pick at the CV minimum, sanity-check depth against bias-variance reasoning.

“You should be prepared to think about and discuss in like an exam setting for example: like why would we want the trees not to be too deep?” - L20-boosting-2

The bias-variance answer the prof gave himself: shallow trees are weak learners that take small, precise steps; deep trees overfit per-step and break the additive correction logic.

“It’s funny when you can only get the result that the people show if you set the seed equal to like 1, 2, 3 because it shouldn’t matter. You should get something very similar if you don’t set the seed.” - L20-boosting-2

(Honest skepticism: be wary of cherry-picked / choices that work only with a specific random seed.)

Pitfalls

  • Treating and independently. They’re coupled. The standard recipe is fix then tune , not the other way around.
  • Setting . That kills the “small step” property, each tree contributes its full predictions, ensemble jumps to a near-perfect training fit instantly, no diversity, no benefit over a single bigger tree.
  • Growing deep trees in boosting. This is the random-forest reflex transferred to the wrong setting. RF wants high-variance/low-bias trees because it averages them. Boosting wants high-bias/low-variance weak learners because it sums their corrections.
  • Forgetting to do early stopping. With small , you don’t know in advance what should be; running for fixed may overfit or underfit. Use cv.folds (in gbm()) or early_stopping_rounds (in xgboost) to find the optimal .
  • Confusing in boosting with in stochastic GBM. The prof uses for both the learning rate (in some lectures) and for the row-subsample fraction (Friedman 2002 notation). Here = learning rate, depends on context; check the deck.
  • Ignoring the seed. Stochastic-flavored boosting (subsampled rows / cols) can give visibly different results across seeds, set one and report it.

Scope vs ISLP

  • In scope: the concept of weak learners (small trees), the role of tree depth as interaction-order control, the empirical rule , the role of as a step-size / shrinkage regularizer, the rule , the coupling, the early-stopping recipe, the connection to gradient descent.
  • Look up in ISLP: §8.2.3 spells out shrinkage (book’s notation for what the slides call ) and tree depth , pp. 343–347. The empirical rule and the deeper “interaction order = depth” framing live in Elements ch. 10 (reference, not exam material).
  • Skip in ISLP (book-only, prof excluded):
    • Detailed pseudocode of where exactly multiplies in: L27-summary / L20-boosting-2: concept matters, line-by-line doesn’t.
    • Heavy theory of step-size choice / convergence rates: out per the prof’s “no fancy proofs” comment.

Exercise instances

  • Exercise9.3: explain the learning rate : what does it mean, how would one choose it, where does it enter Algorithm 10.3 (modify step 2(d)). The expected discussion: shrinks each tree’s contribution → forces weak / slow learning → smaller requires larger → typically pick and tune by early stopping on a validation set.

How it might appear on the exam

  • Conceptual short-answer: “why do we want trees in boosting to be weak?” Expected answer: weak learners take small steps that don’t overshoot the gradient direction; many weak learners summed reduces bias without overfitting; deep trees individually overfit and break the additive-correction logic.
  • Hyperparameter explanation (à la 2025 Q6c): “how would you choose the learning rate and the number of trees?” Expected answer: fix small (≈ 0.1), then choose at the CV minimum / via early stopping; never tune alone with .
  • True/false: “in boosting, increasing requires fewer trees” (true), ” controls the depth of each tree” (false), “the trees in gradient boosting should be deep so they don’t underfit” (false, opposite).
  • Where-does-it-enter: given Algorithm 10.3 without , modify the update step to incorporate (Exercise 9.3 verbatim).
  • Bias-variance reasoning: given a training-error / CV-error vs. plot, identify the optimal and explain the U using “additional trees fit residual noise = variance increases past the minimum.”
  • boosting: the parent concept; weak learners and are the two regularizers that make boosting work.
  • gradient-boosting: where enters explicitly (Algorithm 10.3 step 2(d)).
  • adaboost: historically used full-strength stumps with no ; modern AdaBoost variants add a learning rate.
  • stochastic-gradient-boosting: the third regularizer (subsampling), on top of weak learners + .
  • xgboost: adds L1/L2 leaf-weight penalties and dropout to the regularization stack.
  • regularization: is just shrinkage in another guise; the prof groups it with ridge/lasso , smoothing-spline , NN weight decay.
  • bias-variance-tradeoff: weak learners reduce bias via accumulation; trades a bit of training-fit speed for variance reduction.
  • cross-validation: the standard tool for picking given a fixed .