Bias-variance trade-off (decomposition)

The course’s running theme. The prof flagged it as guaranteed exam material, multiple times, across multiple lectures, and called it a decomposition, not a “trade-off,” to push back on the implication that lower variance must cost you bias. The decomposition is mathematically exact; the trade-off framing is the lens you apply in the U-shape regime, but regularization and over-parameterization can drive both bias and variance down at once (see double-descent).

Definition (prof’s framing)

“I most likely will put an exam question about bias variance. Maybe something like, why am I critical of the word trade-off? Even though it’s not wrong.” - L03-statlearn-2

“I mentioned that this is definitely going to be on the exam. I mean, just this concept. And I don’t want to say how the question will be, but this is an important concept to really wrap your head around because it’s deceptively confusing. It seems really simple. It’s not. It’s like an onion with layers.” - L13-modelsel-2

“I think the first day they’re definitely going to be a question about bias variance in the test because i think that concept is the kind of the running theme through the course in all the models we looked at, we thought we talked about at least briefly or not. In what way is this affecting the trade-off between the bias and the variance?” - L26-nnet-3

The expected squared prediction error of $\hat{f}$ at a fixed test point $x_{0}$ decomposes into three terms:

$E [(y_{0} - \hat{f} (x_{0}))^{2}] = irreducible σ^{2} + Bias^{2} (f (x_{0}) - E [\hat{f} (x_{0})])^{2} + Variance Var (\hat{f} (x_{0}))$

Expectation is over training samples: bias asks “where does the model land on average across resamples”, variance asks “how much does the model jitter around that average across resamples”, noise is the irreducible floor.

Returns in other modules

L02-statlearn-1: first mention. Sets up the U-shape narrative; the prof says he “doesn’t like the word trade-off” and flags an exam question about that critique.
L03-statlearn-2: full board derivation of the three-term decomposition with the polynomial-regression toy (true model = $x^{2}$ , fit polys of degree 1, 2, 10, 20). Standalone exam-flag.
L04-statlearn-3: recap, then the double-descent / benign-overfitting digression. Bias and variance still sum to the test MSE in the over-parameterized regime; what changes is the U-shape, see double-descent.
L05-linreg-1: variance reframed as “how the fitted line wiggles across resamples”; the standard error $σ / \sum (x_{i} - \overset{x}{ˉ})^{2}$ tells you how to design an experiment to reduce variance.
L09-classif-3: the LDA-fitted-vs-Bayes-optimal-on-training confusion. In-sample LDA “beats” the Bayes classifier because it chases noise (lower bias on training data); out of sample, Bayes wins. QDA vs LDA = canonical bias-variance argument: QDA has $K \cdot p (p + 1) /2$ covariance parameters vs LDA’s $p (p + 1) /2$ , so QDA’s lower bias gets eaten by variance when $n$ is small.
L10-resample-1: the U-shape recap motivates CV. KNN-regression as the running example: K=1 → high variance, K=N → high bias.
L11-resample-2: applied to CV variants: validation set = high bias / low variance; LOOCV = low bias / high variance (folds nearly identical → highly correlated estimates → variance of average blows up); k=5/10 is the sweet spot. “Typically in this setting, you’re winning by having less variance” because what you really want is generalization across data sets.
L12-modelsel-1: explicitly sets up the rest of module 6 as variance reduction. “Often we can substantially reduce the variance at the cost of a negligible increase in bias. I think that’s been always surprising to me, just how little bias you need to get a lot of reduction in variance.”
L13-modelsel-2: verbatim re-flagging as exam material; full thought experiment that “increasing bias a little can reduce variance a lot because of the squared term.”
L14-modelsel-3: frames the whole regularization module as variance reduction. Cross-validation picks $λ$ at the bias-variance balance (bias rises with $λ$ , variance drops, test MSE U-shapes through a minimum).
L18-trees-2: “High variance is like death… that’s like what all of modern machine learning is like remove variance, remove variance. Bias is easy. Let’s get rid of variance.” Trees illustrate it perfectly: greedy splits give very high data-set sensitivity.
L20-boosting-2: depth of weak learners is justified by variance: “We want weak learners… we don’t want overly precise learners. We want learners that make good progress and a step in the right direction… We’re specifically trying to reduce variance in this setting.”
L23-nnet-1: mini-batch SGD has implicit L2 regularization → variance reduction in the over-parameterized regime.
L24-nnet-2: explicit regularization menu (L1/L2, dropout, augmentation, early stopping, transfer learning) all framed as variance reduction without paying full bias cost. Benign overfitting (training fit perfect, generalization still good) is the prof’s headline NN claim.
L26-nnet-3: the full double descent treatment. Bias-variance still adds up exactly past the interpolation point; the model picks the minimum-norm interpolator. The prof explicitly restates the exam guarantee.
L27-summary: restated for the third time: “There will be a question on the bias-variance decomposition. So if you haven’t learned anything in the course, I recommend reading that part of the book.” Q3a of the 2025 walkthrough is “lasso correct because increase in bias is less than decrease in variance”, direct application.

Notation & setup

Truth: $y_{0} = f (x_{0}) + ε$ , $E [ε] = 0$ , $Var (ε) = σ^{2}$ , $ε ⊥ x_{0}$ .
Fitted model: $\hat{f}$ , trained on a random training set. The randomness in $\hat{f} (x_{0})$ comes from the training sample.
Expectation: $E [\cdot]$ is taken jointly over the random training set and the noise $ε$ in $y_{0}$ . Two important sub-expectations:
- $E [\hat{f} (x_{0})]$ : average prediction at $x_{0}$ across many independent training samples.
- $Var (\hat{f} (x_{0}))$ : variance of the prediction at $x_{0}$ across many independent training samples.

Formula(s) to know cold

The decomposition the prof derives on the board (and CE1 problem 1b makes you re-derive):

$E [(y_{0} - \hat{f} (x_{0}))^{2}] = Var (ε) + Bias^{2} (\hat{f} (x_{0})) (f (x_{0}) - E [\hat{f} (x_{0})])^{2} + Var (\hat{f} (x_{0}))$

Derivation (board version, L03-statlearn-2)

Substitute $y_{0} = f (x_{0}) + ε$ :

$E [(f - \hat{f} + ε)^{2}] = E [(f - \hat{f})^{2}] + E [ε^{2}] + 2 E [(f - \hat{f}) ε]$

Cross term vanishes because $E [ε] = 0$ and $ε ⊥ \hat{f}$ . So $E [(y_{0} - \hat{f})^{2}] = E [(f - \hat{f})^{2}] + σ^{2}$ , the reducible / irreducible split.

Now decompose the reducible part by adding and subtracting $E [\hat{f} (x_{0})]$ :

$f - \hat{f} = (f - E [\hat{f}]) + (E [\hat{f}] - \hat{f})$

Square it. The cross term is $2 \cdot (f - E [\hat{f}]) \cdot E [E [\hat{f}] - \hat{f}] = 0$ because $E [E [\hat{f}]] = E [\hat{f}]$ . What’s left:

$E [(f - \hat{f})^{2}] = (f - E [\hat{f}])^{2} + E [(\hat{f} - E [\hat{f}])^{2}] = Bias^{2} + Var .$

Two cross-term cancellations, one substitution, done. “We don’t like cross terms, it’s too much work to keep track of them. So conveniently we now have… those first two terms.” - L03-statlearn-2.

Classification analogue

Same idea: an irreducible Bayes error rate plays the role of $σ^{2}$ . The Bayes classifier (assigns $ar g max_{k} P (Y = k ∣ X)$ ) is the optimal classifier; its error is the floor any classifier must respect. See classification-setup.

Insights & mental models

Where each term comes from

Irreducible: pure noise; can’t be removed by any model. “You can improve on that if you have more data, or if you can get better samples without noise.” - L04-statlearn-3.
Bias²: the systematic error of using the wrong model class. If the truth is quadratic and you fit a line, $E [\hat{f}]$ is itself the wrong shape regardless of how much data you have. Bias absorbs both “wrong model class” and “wrong sample” contributions (L04-statlearn-3).
Variance: how much $\hat{f}$ wobbles when you re-draw the training data. Highly flexible models (KNN with K=1, deep trees, polynomials of degree $n$ ) wobble a lot.

The U-shape

ISL Figure 2.12, repeatedly cited:

x-axis = flexibility (polynomial degree, tree depth, $1/ K$ for KNN, $1/ λ$ for ridge…)
bias↓ as flexibility↑
variance↑ as flexibility↑
test MSE = bias² + variance + $σ^{2}$ , U-shaped, minimized somewhere in the middle.
training MSE strictly decreases, never tells you where to stop.

Why the prof rejects “trade-off”

The decomposition is exact; the trade-off framing implies movement on bias necessarily forces opposite movement on variance. That isn’t true: you can pick a better model (e.g. add a regularizer to a flexible class) and flatten the variance curve without sacrificing bias:

“What you can do is you can change how your model behaves and how much variance it has to give up in order to reduce the bias or how much bias it has to give up to reduce the variance. Because ultimately, the goal is to minimize both of these terms. And one way to do that is actually change what model you’re fitting to your data.” - L03-statlearn-2

“If I made this same plot, you wouldn’t see the variance go up as much… the variance term would be more like that or something. The variance wouldn’t explode because it’s a different model.” - L03-statlearn-2

“The reason I don’t really like the word tradeoff is that it doesn’t always have to be a tradeoff. In fact, you can reduce both. For example, here it starts with degree 2, but if instead you start with degree 1 then you’d see that both the bias and the variance are decreasing.” - L04-statlearn-3

This same thread becomes double-descent in the over-parameterized regime: “a model that interpolates the training data perfectly can still generalize well, variance shrinks again because the optimization picks the minimum-norm solution among infinitely many interpolators.”

The “small bias buys big variance reduction” lever

Why so many modern methods work:

“If you reduce the bias a little bit, sorry, if you increase the bias a little bit, you can reduce the variance a lot. Because you have the squared term there.” - L13-modelsel-2

The bias term is squared, so a small absolute increase in bias contributes very little to MSE. Variance can drop by orders of magnitude in exchange. This is the pitch for ridge/lasso, smoothing splines, dropout, mini-batch SGD, bagging, and pruning. Every regularizer in the course is implicitly making this trade.

How to recognize each piece in a plot

Standard “ISL Figure 2.12 / 5.5 / 6.5” picture:

dotted horizontal line = $σ^{2}$ (irreducible)
one curve dropping monotonically = bias²
one curve rising monotonically (often steeply at high flexibility) = variance
their sum (plus the dotted line) = test MSE, U-shaped

In simulations where you know the truth (CE1 1e), you can plot all four. In real data you only ever see test MSE estimated by CV, the “bias” inside CV is the training-data bias, not the truth-relative bias (the prof flagged this distinction in L10-resample-1).

Exam signals

“I most likely will put an exam question about bias variance. Maybe something like, why am I critical of the word trade-off? Even though it’s not wrong.” - L03-statlearn-2

“I mentioned that this is definitely going to be on the exam. I mean, just this concept. And I don’t want to say how the question will be, but this is an important concept to really wrap your head around because it’s deceptively confusing. It seems really simple. It’s not. It’s like an onion with layers.” - L13-modelsel-2

“I think the first day they’re definitely going to be a question about bias variance in the test because i think that concept is the kind of the running theme through the course in all the models we looked at.” - L26-nnet-3

“There will be a question on the bias-variance decomposition. So if you haven’t learned anything in the course, I recommend reading that part of the book.” - L27-summary

“I think this is, of the theoretical stuff that we talk about [in] the course… a particularly interesting one.” - L03-statlearn-2 (recommending you re-derive it yourself)

The 2025 exam Q3a (lasso vs least squares) was solved via direct bias-variance reasoning (“less flexible than LS, improved accuracy when the increase in bias is less than the decrease in variance” - L27-summary). The 2024 exam mathy theory question is the MLE = OLS derivation under Gaussian errors, second most likely “mathy” question after the bias-variance derivation itself.

Pitfalls

Bias is NOT zero just because $E [\hat{f}] - \hat{f}$ vanishes in expectation. Bias is the truth-vs-expected-fit gap; the expected fit needn’t equal the truth.
The cross terms vanish for clean reasons: $E [ε] = 0$ in the first cancellation; $E [E [\hat{f}] - \hat{f}] = 0$ in the second. Don’t claim “noise is independent of the fit”, that’s true but it’s not what kills the cross term (L03-statlearn-2 is careful here).
Variance is over the training distribution, not the noise distribution. A common confusion: students treat $Var (\hat{f})$ as if it included $σ^{2}$ . It doesn’t, those are separate terms.
In CE1 1e, the “bias” in the slide is computed on the fit data, not the true model. L10-resample-1 explicitly flags this, in real settings you never know the truth, so you estimate test-MSE-shape via CV, not the bias decomposition directly.
Bias-variance is more relevant for prediction than inference (CE1 1d (i) is FALSE). For inference you care about the sampling distribution of $\hat{β}$ , not its squared prediction error. (Many students reverse this.)
As $n \to \infty$ , test MSE does NOT approach zero (CE1 1d (ii) is FALSE). Even with infinite data, $σ^{2}$ stays. Variance shrinks; bias can shrink to zero only if the model class contains the truth.
Lower bias does NOT imply better predictions (CE1 1d (iii) is FALSE). Comparing two methods on bias alone ignores the variance term, and modern methods accept some bias for big variance reduction.
In high-noise settings ( $σ^{2}$ large), you do NOT want a more flexible model (CE1 1d (iv) is FALSE). The opposite: you want low-variance methods because the noise floor is high regardless, and flexibility just adds variance on top.
Don’t confuse “bias-variance trade-off” with “overfitting”: they’re related but distinct. Overfitting describes the symptom (good train, bad test); bias-variance is the mechanism (variance > bias-improvement-from-flexibility past the U-minimum).

Scope vs ISLP

In scope: the 3-term decomposition, the U-shape, why “trade-off” is a misleading label, applications across every method (regression, KNN, ridge/lasso, trees, RF, boosting, splines, NNs, double-descent).
Look up in ISLP: §2.2.2 (the original derivation); §5.1.4 (bias-variance trade-off for k-fold CV); §6.2 introduction (regularization framed via bias-variance).
Skip in ISLP (book-only, prof excluded): the formal derivation of “variance scales with $p$ ” - L13-modelsel-2: “In the other book, written by the same authors, they go through a more formal decomposition of this to show that this variance term depends heavily on the number of parameters. We’re not going to go through that because it’s left out of this course.”

Exercise instances

Exercise2.2: flexible vs rigid methods, test error, overfit/underfit relation to bias-variance. “Relate the problem of over- and underfitting to the bias-variance trade-off.”
Exercise2.5: full polynomial regression simulation: simulate data from a known model, fit polynomials of degrees 1, 2, 10, 20, decompose test MSE empirically into bias², variance, and irreducible.
Exercise5.2: bias-variance comparison across CV schemes (val-set, LOOCV, k-fold).
Exercise8.1c: the role of bagging and random forest in attacking variance.
CE1 problem 1a: write down the expected test MSE at $x_{0}$ .
CE1 problem 1b: derive the three-term decomposition (verbatim of the L03 board work).
CE1 problem 1c: interpret each term in plain English.
CE1 problem 1d: true/false on common misconceptions (see Pitfalls above).
CE1 problem 1e: read a pre-made bias-variance plot for KNN over $K$ .

How it might appear on the exam

Mathy question: derive the three-term decomposition (the prof’s flagged “at least one mathy theory question”). Almost certainly the exam’s mathy slot if it’s not the MLE = OLS derivation. Show both cross-term cancellations and identify each piece.
True/False on direction-of-effect: flexibility ↑ → bias↓, variance↑; $λ$ ↑ on ridge → bias↑, variance↓; KNN K ↑ → bias↑, variance↓.
Why is the prof critical of the word “trade-off”? Two-perspective answer: (i) the decomposition is exact, but a good model choice (e.g. a regularized flexible class) flattens the variance curve without paying full bias cost; (ii) double-descent in the over-parameterized regime, bias and variance both shrink past the interpolation point.
Method comparison: given two models’ train and test errors, which wins and why? The answer is always “less flexible model has more bias and less variance; whichever balance lands lower test error wins.”
Read a plot: given a U-shape with overlaid bias², variance, and noise lines, identify each curve and predict the optimal flexibility.
Why ridge/lasso work: “the increase in bias is less than the decrease in variance” (verbatim 2025 Q3a answer).
Why bagging / random forests work: variance reduction through averaging; correlated trees still leave the floor $ρ σ^{2}$ (L19-boosting-1 derivation).
Why deep NNs need regularization: without it, variance explodes; even with infinite parameters, regularizers (dropout, weight decay, mini-batch SGD’s implicit L2) keep variance manageable.
Why double descent doesn’t break the decomposition: bias and variance still sum exactly to test MSE − $σ^{2}$ . The U-shape is just one possible profile.

double-descent: the prof’s “second descent” past the interpolation point; bias and variance still add up, the U-shape just isn’t the only possibility
regularization: every regularizer in the course attacks variance with a small bias cost; this is the lever that makes modern ML work
cross-validation: how we estimate the U-shape in practice when we don’t know the truth
ridge-regression: canonical “small bias, big variance reduction” via L2 shrinkage
lasso: same idea via L1 (also does variable selection)
k-fold-cv vs leave-one-out-cv: direct bias-variance argument (k=5/10 trades a little extra bias for much less variance than LOOCV)
bagging / random-forest: variance reduction through averaging; the $ρ σ^{2}$ floor formula motivates random forest’s decorrelation trick
boosting / gradient-boosting: bias reduction through sequential weak learners; weak learners chosen to keep variance low
nn-regularization: explicit (L1/L2/dropout/augmentation/early stopping) and implicit (mini-batch SGD) variance control for NNs
flexibility-overfitting-underfitting: the symptom; bias-variance is the mechanism
reducible-vs-irreducible-error: the first decomposition (one term coarser); bias-variance refines the reducible part

statistical.dog

Explorer

bias-variance-tradeoff

Bias-variance trade-off (decomposition)

Definition (prof’s framing)

Returns in other modules

Notation & setup

Formula(s) to know cold

Derivation (board version, L03-statlearn-2)

Classification analogue

Insights & mental models

Where each term comes from

The U-shape

Why the prof rejects “trade-off”

The “small bias buys big variance reduction” lever

How to recognize each piece in a plot

Exam signals

Pitfalls

Scope vs ISLP

Exercise instances

How it might appear on the exam

Graph View

Table of Contents

Backlinks

statistical.dog

Explorer

bias-variance-tradeoff

Bias-variance trade-off (decomposition)

Definition (prof’s framing)

Returns in other modules

Notation & setup

Formula(s) to know cold

Derivation (board version, L03-statlearn-2)

Classification analogue

Insights & mental models

Where each term comes from

The U-shape

Why the prof rejects “trade-off”

The “small bias buys big variance reduction” lever

How to recognize each piece in a plot

Exam signals

Pitfalls

Scope vs ISLP

Exercise instances

How it might appear on the exam

Related

Graph View

Table of Contents

Backlinks