Double descent / benign overfitting

The prof’s hobbyhorse , he returns to it five times across the course. Past the interpolation point (#parameters ≈ samples), test error climbs to a peak, then comes back down, sometimes below the classical U-shape minimum. This does not break the bias-variance decomposition , bias and variance still sum exactly to test MSE − , the U-shape just isn’t the only possibility. The mechanism the prof emphasizes: in the over-parameterized regime, the optimization changes from “fit + penalty” to “minimum-norm solution among infinitely many interpolators.” That implicit norm-minimization is what makes modern over-parameterized ML work.

Definition (prof’s framing)

“What they call this is actually benign overfitting. Benign being like, you know, it won’t hurt you. So you’re overfitting, but it’s okay.” - L04-statlearn-3

“It doesn’t break any of the math. It doesn’t break any of the statistics.” - L04-statlearn-3

“I think this is one of the main points of, or one of the main reasons machine learning works. … For me, the right way of looking at it is actually it’s changing the optimization. Because you in this regime where you can interpolate between an infinite number of correct solutions … so that part of the magic of the model.” - L26-nnet-3

The slide claim (L26-nnet-3 verbatim from deck): “Though double descent can sometimes occur in neural networks, we typically do not want to rely on this behavior.” The prof’s qualified agreement: “We don’t want to rely on this maybe yeah again it depends on what you’re trying to do.”

The phenomenon: as model complexity (#parameters , polynomial degree , NN width, …) grows past the interpolation threshold (), test error follows a non-U shape:

  1. Classical regime (): bias↓, variance↑, U-shape minimum somewhere around .
  2. Interpolation peak at : variance explodes, test error peaks.
  3. Second descent (): test error drops again, often below the classical minimum.

Returns in other modules

  • L04-statlearn-3: first appearance and most extensive treatment. The prof’s own simulations: fit polynomials of degree up to 100,000 to a step function via the pseudoinverse. Test MSE explodes near , then drops below the small- minimum. He flags two mechanisms (both deferred to later modules): implicit ridge (“the parameters are being controlled in some way so that they don’t get out of hand”) and model averaging. He stresses the over-parameterized win only happens when the truth is not in the assumed function class , switch the truth from a step to and the second descent disappears (low-degree poly recovers the true model).
  • L11-resample-2: re-references the L04 simulation as “this bagging trick implicitly.” “You end up with so many different parameters that actually the collection of parameters finds different models and different parts of the parameters and then essentially averages them together. … It’s super weird, but it’s interesting. Comes back in the double-descent discussion much later.”
  • L13-modelsel-2: connection to ridge-regression in the over-parameterized regime: ridge maintains uniqueness when where OLS blows up, “and there’s a deeper connection too: ridge in the over-parameterised regime lets you average over multiple equally-good solutions , directly related to boosting.”
  • L24-nnet-2: benign overfitting treated head-on for NNs: “You can be in a regime where you’re actually fitting all of your data really well, so your training samples are actually perfectly explained by a model , which I mean as a statistician that sounds impossible , and yet it still can generalize well.” The trick that makes it possible: regularization (explicit + implicit). Statisticians are paranoid about overfitting; AI people aren’t, because they have regularization machinery the statistical tradition didn’t develop.
  • L26-nnet-3: the full double-descent treatment. ISL Fig 10.21 sine + spline example. Then the prof’s contrived square-pulse + Legendre polynomial example: train and test error vs degree , with test error showing the canonical “first descent → minimum near → blows up at interpolation → second descent below the first minimum.” Bias/variance/irreducible computed empirically , they sum exactly. The over-parameterized solutions visualized: spikes at every data point, local-mean elsewhere , fit the training set perfectly while gently interpolating in between.

Notation & setup

  • = training set size; = number of parameters in the model (or polynomial degree, NN width, etc.).
  • Interpolation point: , where the model has just enough capacity to fit every training point exactly. For polynomials with data points, is the smallest degree that interpolates.
  • Interpolating model: with training error = 0, i.e. for all . Past the interpolation point, infinitely many ‘s achieve this.
  • Minimum-norm interpolator: among interpolators, the one with smallest (or smallest in some function-space norm). The pseudoinverse / SGD picks this one.

Formula(s) to know cold

The optimization changes flavor across the interpolation point (L26-nnet-3):

Classical regime (, regularized): minimize , RSS + L2 penalty. The fit term is not zero.

Over-parameterized regime (, post-interpolation): minimize , L2 norm subject to fitting every training point exactly. The data-fit term becomes a hard constraint; the L2 penalty becomes the only objective.

This is the minimum-norm solution among infinite interpolators. The pseudoinverse (when ) gives exactly this. Mini-batch SGD also converges to it (L23-nnet-1: “Whenever you use this and you’re in a problem setting where there’s an infinite number of exact solutions, it will find the solution where the L2 norm is minimized.”).

Bias-variance still adds up

For any , in any regime:

The decomposition is exact. What changes past the interpolation point:

  • Variance shoots up at (the optimization is ill-conditioned).
  • Past , variance comes back down because the implicit norm-minimization is itself a variance-control device.
  • Bias² stays low (or grows slowly, on log scale per L04-statlearn-3).
  • Their sum traces the double-descent curve.

“If you compute all of the things from the bias and variance trade off it all always adds up that the bias and the variance are always trading off of each other right because you can always just go up in one or down in the other. … It’s just the model at this complexity is where it blows up. But, you know, you just keep going.” - L26-nnet-3

Insights & mental models

Why the second descent happens , the prof’s preferred explanation

Past the interpolation point, the model class contains infinitely many zero-training-error solutions. The training loss alone can’t distinguish them. So the optimization picks based on a secondary criterion , and that criterion happens to be implicit L2 (smallest ), via the pseudoinverse or mini-batch SGD or any other norm-controlled fitting procedure.

The minimum-norm solution has low variance because it’s the “most regularized” choice among interpolators , small coefficients mean small wobble across resamples. So past the interpolation point, you’re effectively running ridge regression with implicit chosen by the geometry of the optimization, not by you. That’s the “benign” in benign overfitting.

“We’ve changed the optimization that we’re doing… we’ve still… constructing it the same way but since we have so many parameters that … the model is so flexible it has an infinite number of ways of fitting it and therefore it doesn’t have to pick a model that fits the training data well, all of them do it’s finding the one that has the best variance.” - L26-nnet-3

What the over-parameterized fit actually looks like

L26-nnet-3 visualizes the contrived square-pulse + Legendre polynomial example:

  • Best classical model (): smooth, OK fit, residuals scattered.
  • Best second-descent model (): spikes through every training point, then regresses to a local mean between data points.

This is the geometric picture of “minimum norm among interpolators.” The model can’t avoid going through each (constraint), but with so many parameters it has room to choose the gentlest interpolator , which means shooting back toward the local mean as fast as possible after each data spike.

When the over-parameterized regime wins

L04-statlearn-3: only when the truth is not in the assumed function class. The prof’s illustration:

  • True = step function (NOT a polynomial) → fit polynomials → high- regime beats low- minimum.
  • True = (IS a polynomial) → fit polynomials → low- minimum recovers truth perfectly; high- regime doesn’t beat it.

“If the underlying model, the f of x, exists as part of the functions that you’re assuming in your model, then even if you increasingly add more and more degrees of flexibility, you’re not going to improve over what you can get with a few parameters. … But in the real world, I don’t know how often you really can assume that you have the right model.” - L04-statlearn-3

In real ML, the truth is essentially never in the model class (the misspecified-model regime , see statistical-learning). So the second descent is a real, exploitable phenomenon.

Why this matters for the bias-variance “trade-off” critique

The prof rejects “trade-off” because it implies you must pay bias to reduce variance. Double descent shows you can have:

  • Low bias (model is flexible enough to fit anything),
  • Low variance (implicit norm-minimization controls the wobble),
  • Zero training error.

All three at once, past the interpolation point. The U-shape is just one possible profile of the bias-variance decomposition , not a law of nature.

“The reason I don’t really like the word tradeoff is that it doesn’t always have to be a tradeoff. In fact, you can reduce both.” - L04-statlearn-3

This is the canonical “two perspectives on the trade-off” answer for the bias-variance-tradeoff critique exam question.

Connection to bagging (the prof’s surprise)

“It’s actually using this bagging trick implicitly. So you end up with so many different parameters that actually the collection of parameters finds different models and different parts of the parameters and then essentially averages them together.” - L11-resample-2

The intuition: a hugely over-parameterized model implicitly contains many “sub-models” inside it; the optimization selects a configuration that’s effectively their average. Same variance-reduction mechanism as bagging trees, but inside a single fitted model rather than across many independently fitted ones.

Caveats from the slide deck

L26-nnet-3 verbatim:

  • Double descent is “achievable mostly in high signal-to-noise problems , natural image recognition, language translation.”
  • “Most statistical learning methods covered in this course do not exhibit double descent.” (Trees, GAMs, classical regression with explicit regularization don’t go past the interpolation point in the relevant sense.)
  • “Though double descent can sometimes occur in neural networks, we typically do not want to rely on this behavior.”

But the prof: “This is one of the main points of, or one of the main reasons machine learning works.”

Exam signals

“It’s overfitting, but it’s okay. … It doesn’t break any of the math. It doesn’t break any of the statistics.” - L04-statlearn-3

“I think this is one of the main points of, or one of the main reasons machine learning works. … For me, the right way of looking at it is actually it’s changing the optimization.” - L26-nnet-3

“Why am I critical of the word trade-off? Even though it’s not wrong.” - L03-statlearn-2 , double descent is one of the two answers to this prompt (the other being “regularization can flatten the variance curve”).

The prof restated the bias-variance exam guarantee in L26-nnet-3 immediately before the double-descent presentation , strong signal that double descent is part of what he expects you to know about bias-variance.

Pitfalls

  • “Double descent breaks bias-variance”: FALSE. The decomposition is exact at every . Double descent is just a non-U profile of test MSE = + bias² + variance.
  • “Past the interpolation point, the variance is zero”: FALSE. Variance shrinks but stays positive. It’s the competing growth in bias that matters; in real cases bias² grows slowly while variance shrinks faster, giving a second descent.
  • “You should always over-parameterize”: FALSE. The prof’s slide warning: “we typically do not want to rely on this behavior.” Use it when you’re in the high-SNR regime with massive parameter budgets and good implicit/explicit regularization. Otherwise stay in the classical regime with explicit regularization.
  • “Double descent works because the model finds the truth”: FALSE. The truth is generally NOT in the model class (that’s the prof’s whole misspecified-model framing). Double descent works because among the infinite interpolators, the smallest-norm one happens to generalize well in high-SNR regimes.
  • “It’s the same as ridge regression”: partially true. Ridge in the regime is the explicit version of what the pseudoinverse / mini-batch SGD does implicitly past the interpolation point. The prof: “Ridge regression is what’s going on here even though it’s not actually put in there explicitly.” But ridge with doesn’t interpolate; the implicit-ridge interpretation is specifically about in the limit.
  • Don’t memorize the precise location of the interpolation peak: it’s “around ” but the precise location depends on regularization, optimizer, model class. Just know it’s there.
  • Most things in the course don’t exhibit double descent. It’s specific to: (i) high-SNR data, (ii) over-parameterized models trained without strong explicit regularization, (iii) optimization that implicitly controls a norm. Trees, GAMs, lasso/ridge with CV-chosen , none go past the interpolation point.

Scope vs ISLP

  • In scope: the phenomenon (test error has a second descent past the interpolation point), the explanation (minimum-norm interpolator among infinitely many), the bias-variance reconciliation (decomposition still holds, just non-U), why it matters for modern ML, when it does and doesn’t help.
  • Look up in ISLP: §10.8 (the deep-learning chapter’s “What does double descent really mean?” section, with Figs 10.20–10.21).
  • Skip in ISLP (book-only, prof excluded): formal proofs of when minimum-norm solutions generalize (high-dimensional statistics literature); precise interpolation-point characterization for specific model classes. The prof’s L04 / L26 simulations are the in-scope working knowledge.

Exercise instances

None. Double descent is purely lecture material , no recommended-exercise problem touches it directly.

How it might appear on the exam

  • “Why is the prof critical of the word ‘trade-off’?”: two-perspective answer: (i) regularization can flatten the variance curve without paying full bias cost; (ii) double descent in the over-parameterized regime , bias and variance both shrink past the interpolation point because the optimization picks the minimum-norm interpolator. Cite this lecture explicitly.
  • “Explain double descent / benign overfitting in your own words”: phenomenon (second descent past ), mechanism (minimum-norm interpolator), reconciliation with bias-variance (decomposition still holds, profile just isn’t U-shaped), when it works (high SNR, over-parameterized, weak/no explicit regularization).
  • T/F: “Double descent contradicts the bias-variance decomposition”: FALSE. The decomposition is always exact; double descent is just a non-U profile.
  • T/F: “A model that perfectly fits the training data cannot generalize well”: FALSE under double descent. The minimum-norm interpolator can generalize well in high-SNR regimes.
  • Conceptual question on why over-parameterized NNs work: the prof’s headline answer: “the optimization changes from ‘fit + penalty’ to ‘min penalty subject to fitting’ , that’s the magic.”
  • Connection to ridge-regression: the pseudoinverse / SGD past interpolation = implicit ridge with . Ridge in keeps the optimization unique; double descent extends this insight to the limit of huge .
  • “When does the over-parameterized regime not help?” , when the truth is in the model class (then the low- minimum recovers it perfectly and double descent has nothing to add) or when the data is low SNR (the second descent is shallow or absent).
  • bias-variance-tradeoff: the decomposition stays exact across the double-descent curve; this is the canonical “why I don’t like the word trade-off” answer
  • regularization: double descent is implicit regularization (norm minimization among interpolators); explicit ridge/L2/dropout/weight-decay are the same idea written down
  • ridge-regression: the explicit version of what the pseudoinverse / SGD does implicitly past the interpolation point
  • gradient-descent-and-sgd: mini-batch SGD’s implicit L2 is the operational mechanism for the second descent in NNs
  • bagging: the prof’s analogy: over-parameterized models implicitly average over many sub-models, same variance-reduction logic
  • high-dimensional-regression: module 6’s treatment; double descent is the L26 callback to this regime
  • flexibility-overfitting-underfitting: double descent extends the U-shape story past the interpolation point