Module 02: Statistical Learning — Book delta
ISLP chapter 2 sets up vocabulary (supervised vs unsupervised, regression vs classification, prediction vs inference, parametric vs nonparametric), states the reducible/irreducible split (eq. 2.3), states the bias-variance decomposition (eq. 2.7), and introduces the Bayes classifier and KNN. What the chapter does not contain — and what Benjamin built on the board across L02–L04 — is the derivation of the bias-variance decomposition, the full random-vector / multivariate-normal machinery that the rest of the course rides on, and the over-parameterized / double-descent digression with the minimum-norm interpolator. ISLP states the bias-variance result as a fact: “Though the mathematical proof is beyond the scope of this book, it is possible to show that …” (§2.2.2, eq. 2.7). The prof’s derivation is the lookup-able piece this file reproduces.
The module-2 atoms also pull in the multivariate-normal density, the covariance-of-linear-transformation identity , the correlation-from-covariance formula, contrasts, and the pseudo-inverse / minimum-norm story for double descent. None of these appear in ISLP ch. 2 (some appear later — MVN density in §4.4.2, double descent in §10.8 — but those are outside the mapped chapter, so they are delta for this module).
1. Bias-variance decomposition — the full derivation
[L03, bias-variance-tradeoff, reducible-vs-irreducible-error; ISLP eq. 2.7 states the result, not the proof]
The prof flagged this derivation as exam-likely (“I most likely will put an exam question about bias variance”; “there will be a question on the bias-variance decomposition”) and reiterated the guarantee in L13, L26, and L27. CE1 problem 1b is exactly this derivation. ISLP §2.2.2 states the conclusion (eq. 2.7) and explicitly skips the proof; reproducing it here.
1.1 Setup
Let be an unseen test point with
Let be a model fit on a random training set . The expectation below is taken jointly over the training set and the noise in .
1.2 Step 1 — Reducible / irreducible split
Substitute inside the squared prediction error:
Group as with and expand:
Cross term vanishes. is a function of and only; is independent of both and has mean zero. So . Note (prof, L03): the cross term vanishes because , not merely because of “noise/fit independence”. Don’t conflate the two.
The term. .
Result:
This is the L03 board step that gives ISLP eq. 2.3 as the first algebraic move.
1.3 Step 2 — Decompose the reducible part with an add-and-subtract trick
Insert inside the squared bracket:
Call these two pieces (deterministic — depends only on ) and (random — depends on ). Square the sum:
Take expectation over :
- — deterministic, comes out of the expectation. This is .
- . Cross term vanishes by the law of iterated expectations ().
- — the definition of variance.
Substitute back:
1.4 Putting it together
Combining steps 1 and 2:
This is ISLP eq. 2.7, derived. Two cross-term cancellations: (i) kills the noise/fit cross term, (ii) kills the bias/variance cross term.
1.5 What each term means (prof’s framing)
- Irreducible — pure noise; cannot be reduced by any choice of . Only better data (lower-noise sensors, measuring the missing covariates) lowers it.
- Squared bias — the gap between the truth and the expected fit across resampled training sets. Captures both “wrong model class” and “wrong sample” errors (L04 explicit). Bias is truth-relative, not fit-relative.
- Variance — how much the prediction wobbles across resamples of the training data. Variance is taken over the training distribution, not over . These two are separate sources.
1.6 Why the prof rejects the word “trade-off”
The decomposition is mathematically exact at every and every model. “Trade-off” implies that reducing one term forces the other up. That isn’t always true:
- Better model class flattens the variance curve at no bias cost. A regularized flexible class (ridge / lasso / a polynomial-degree-20 fit with shrinkage on the high-order coefficients) can hold bias low while suppressing variance — both terms drop together.
- Double descent / benign overfitting (see §6) — past the interpolation threshold, bias and variance can both shrink as grows, the U-shape isn’t a law of nature.
- Local geometry of the U-shape. Bias is squared in the decomposition, so a small absolute increase in bias contributes very little to MSE; variance can drop by orders of magnitude in exchange. This is the operating principle of every regularizer in the course (ridge, lasso, smoothing splines, dropout, mini-batch SGD’s implicit L2, bagging, pruning).
Verbatim, L03: “What you can do is you can change how your model behaves and how much variance it has to give up in order to reduce the bias … ultimately, the goal is to minimize both of these terms. And one way to do that is actually change what model you’re fitting to your data.” L13: “if you increase the bias a little bit, you can reduce the variance a lot. Because you have the squared term there.”
1.7 Classification analogue
The same decomposition idea applies in classification, with the Bayes error rate playing the role of . The Bayes classifier is the optimal classifier; its error rate is the floor any classifier must respect. ISLP states this (eq. 2.11) but does not derive a classification analogue of the bias-variance split.
2. Random vectors — expectation and covariance algebra
[L04, random-vector-and-covariance; slides modules/2StatLearn/2StatLearn.2.md; NOT in ISLP ch. 2]
ISLP ch. 2 has no formal treatment of random vectors, covariance matrices, or expectation rules for matrix products. The prof developed all of this on the board in L04 because modules 3 (sampling distribution of ) and 4 (LDA/QDA, Naive Bayes) require it.
2.1 Random vector and mean vector
A random vector is a -dimensional vector of random variables:
is computed from the marginal of and carries no information about dependencies with for .
The joint distribution governs the whole vector. The marginal of one coordinate is obtained by integrating out the others:
2.2 Rule I — expectation is linear and matrix-additive
For random matrices and :
2.3 Rule II — constants pull out of expectation (with a board proof)
For random matrix and conformable constant matrices and :
Element-wise proof (L04 on the board). The element of is
The and are constants; only is random. So
which is exactly the element of . ∎
Univariate analogue: .
2.4 Covariance matrix
For each pair :
When : (the prof flagged this as a quiz-style fact: “what is ?” → variance).
Sign reading: high positive covariance → variables vary together; negative → they vary opposite; near zero → no linear co-variation (this is the prof’s most-emphasized framing).
Stack into the covariance matrix:
Shortcut identity (slides + L04):
is symmetric (by construction) and positive semi-definite: for any constant vector ,
A variance can never be negative, so is PSD by the variance interpretation. If then is singular, which means some linear combination has zero variance, i.e. is deterministic given the others. In that case the multivariate normal density is not defined (division by ).
Key conceptual point (L04, verbatim flavor): “Covariance is really getting at this notion of a slope … we’re sort of assuming a linear line.” Zero covariance does not imply independence in general — it only means no linear co-variation. The exception is joint normality, where zero covariance does imply independence (see §3).
2.5 Correlation matrix
Rescale by the standard-deviation diagonal:
In matrix form, with the diagonal matrix of standard deviations:
The correlation matrix has all 1’s on the diagonal and Pearson correlations off-diagonal. The hand-calculation trap (CE1 1f): given , then . Distractors: (forgot the square root), (used only ), (used only ). Take the square root of the product of variances.
2.6 Linear-transformation identities
Let where is a constant matrix. Then:
These two identities are the working engine of the rest of the course. They give:
- The OLS sampling distribution has covariance (module 3).
- The LDA discriminant linearity and the QDA quadratic term (module 4).
- The variance of any contrast / hypothesis test on (module 3).
- PCA loadings as linear combinations whose variance you can read off (module 10).
Pitfall: order matters. It is , not . Transpose goes on the right.
3. Contrasts (linear combinations)
[L04, contrasts; NOT in ISLP ch. 2 — ISLP discusses contrasts only in the dummy-coding sense in §3.3.1]
A contrast is any linear combination of the components of . The prof uses “contrast” loosely (not in the strict statistical sense of “coefficients sum to zero” — e.g. counts as a contrast here).
Given and a contrast matrix :
Apply §2.6:
3.1 The cork worked example (L04 / slides)
Running dataset: — cork weight in 4 directions on a tree, trees, Rao (1948). The three contrasts of interest are , , and . The contrast matrix is
Then , and is the entry of .
3.2 Why this matters downstream
Each item below is exam-relevant; the covariance-of-linear-transformation identity is the bridge:
- Hypothesis tests on regression coefficients (M3): testing or any combination uses .
- LDA / QDA discriminants (M4): linear in for LDA precisely because is a contrast on .
- PCA loadings (M10): each principal component is a contrast on the standardized predictors; the loadings are the contrast coefficients.
- Categorical encoding with levels (M3): the dummy variables define contrasts against the reference level.
4. Multivariate normal distribution
[L04 / L05, multivariate-normal; NOT in ISLP ch. 2 — ISLP introduces MVN only in §4.4.2 inside the LDA discussion]
The prof introduced the MVN density in L04 as the generalization of the univariate normal, with the explicit motivation that minimizing the negative log-likelihood of a normal model with mean parameterized by is linear regression — and the multivariate case is the route into multiple regression. Three downstream uses: joint distribution of a random vector (M2), sampling distribution of (M3), class-conditional density in LDA / QDA / Naive Bayes (M4).
4.1 The density
For with mean vector and positive-definite covariance matrix :
Notation: .
Mapping from the univariate case :
| Univariate piece | Multivariate piece |
|---|---|
| (scalar) | (vector) |
| (Mahalanobis distance) | |
| in the normalizer | $ |
| in the exponent | (the precision matrix) |
When the density reduces to the univariate Gaussian.
The exponent where is the squared Mahalanobis distance — Euclidean distance after rotating and rescaling by . The natural distance under joint normality.
4.2 Singular
If , the density is not defined (you would divide by zero). This is the same pathology as collinearity in OLS ( singular) and as the failure mode of LDA when class-covariance estimates are singular. Practical fix: drop a redundant variable, regularize, or reduce dimensions via PCA.
4.3 Four useful properties (slides + L04)
Let .
- Contours are ellipsoids. Level sets are ellipsoids centered at , oriented and stretched by .
- Linear combinations are normal. for any constant matrix and vector . (Bridge to contrasts.)
- Marginals are normal. Any subset of components is multivariate normal in its own right.
- Zero covariance implies independence — under joint normality only. when is jointly normal. Crucially false in general for non-Gaussian distributions.
The prof emphasizes property 4 as the headline reason MVN is exam-bait — it’s the one place where the zero-cov-→-independence shortcut is valid.
4.4 Quantile / probability statement from the ellipsoid
. So the ellipsoid
has probability — useful for confidence ellipsoids and the LDA contour pictures in M4.
4.5 Constructing independent standard normals from
Slide quiz answer (5C):
This is the whitening transform — subtract the mean, rescale by the inverse-square-root of the covariance, get back to standard MVN.
4.6 Reading 2D contour plots — Σ-to-shape map
For :
| Σ pattern | Contour appearance |
|---|---|
| , | Circle, centered at |
| , | Axis-aligned ellipse, long axis along the larger-variance direction |
| Ellipse tilted upward-right (positive diagonal) | |
| Ellipse tilted upward-left (negative diagonal) | |
| , | Tilted ellipse, long axis closer to whichever variance is larger |
This is the CE1 1g question template.
4.7 The connection to regression (L04 verbatim setup)
Why this density is module 2’s destination: minimizing the negative log-likelihood of under a Gaussian model with is least-squares regression. The multivariate normal is the joint distribution view that gives the conditional view of regression in module 3.
If is jointly MVN, then the conditional is normal with mean linear in and constant variance — the linear regression model is the exact conditional structure of a joint MVN. This is the bridge into M3.
5. KNN regression — the averaging formula
[L03 (introduced) / L10 (used), knn-regression; ISLP ch. 2 develops KNN for classification only; KNN regression’s formula appears in ISLP §3.5]
ISLP §2.2.3 introduces KNN for classification as the running example. The regression variant is in §3.5, not §2 — so the formula is delta for the mapped chapter:
where is the index set of the training points closest to in Euclidean distance.
Same flexibility-knob story as classification: small → wiggly fit, low bias, high variance; → constant horizontal line at (pure underfit). Optimal is intermediate, chosen by cross-validation.
6. Over-parameterized regime — pseudoinverse, double descent, benign overfitting
[L04, double-descent; NOT in ISLP ch. 2 — ISLP discusses double descent only in §10.8 (deep-learning chapter)]
The L04 digression Benjamin ran with his own simulations because the textbook example only goes up to degree 8 or 9 and he wanted to enter “this ridiculous region of like a degree 50,000 or 100,000.” None of this is in ISLP ch. 2.
6.1 The setup and the phenomenon
Truth: a step function (deliberately a poor fit for any polynomial). Sample noisy points; fit polynomials of degrees by minimum-norm least squares (pseudoinverse). The training MSE vs. degree curve has three regimes:
- Classical regime (): test MSE U-shape, minimum somewhere around .
- Interpolation peak at : test MSE explodes. Same number of parameters as data points; noise blows up.
- Second descent (): test MSE drops again, often below the classical minimum.
This double-descent curve is the prof’s hobbyhorse and returns in L11, L13, L24, L26. The decomposition still adds up: bias² + variance + test MSE at every . “It doesn’t break any of the math. It doesn’t break any of the statistics.”
6.2 The optimization changes character across the interpolation point
Classical regime (, regularized version): minimize a fit+penalty objective
The fit term is non-zero at the optimum.
Over-parameterized regime (, post-interpolation): minimize the L2 norm subject to exactly fitting every training point
The data-fit term has become a hard constraint; the L2 penalty is the only objective. This is the minimum-norm interpolator — among the infinitely many ‘s that achieve zero training error, the one with the smallest . The Moore–Penrose pseudoinverse and mini-batch SGD both converge to this solution.
6.3 The minimum-norm interpolator via the pseudoinverse
When and has full row rank, the unique solution to the constrained problem above is
\hat\boldsymbol\beta = X^+ \mathbf y = X^\top (X X^\top)^{-1} \mathbf y.Here is the Moore–Penrose pseudoinverse of the design matrix in the wide case (). (In the classical case the pseudoinverse coincides with .) The pseudoinverse mechanics themselves are out of scope per scope; the only fact the prof needs you to know is that it selects the minimum-norm zero-training-error solution.
6.4 Why the second descent happens — the prof’s explanation
Past the interpolation point, the model class contains infinitely many zero-training-error solutions; the training loss can’t distinguish them. The optimization picks via a secondary criterion — implicit L2 norm minimization — via the pseudoinverse or via mini-batch SGD. The minimum-norm solution has low variance because small coefficients mean small wobble under resampling. So past the interpolation point, you’re effectively running ridge regression with an implicit chosen by the geometry of the optimization.
This is the “benign” in benign overfitting: zero training error and good generalization simultaneously, because the implicit regularization controls variance.
6.5 When over-parameterization wins (L04 explicit)
The over-parameterized win requires the truth not to be in the assumed function class. Prof’s two illustrations:
- True step function (NOT a polynomial), fit polynomials → high- regime beats the low- minimum.
- True (IS a polynomial), fit polynomials → low- poly2 recovers truth perfectly; high- regime cannot improve on it.
“If the underlying model … exists as part of the functions that you’re assuming in your model, then even if you increasingly add more and more degrees of flexibility, you’re not going to improve over what you can get with a few parameters … But in the real world, I don’t know how often you really can assume that you have the right model.” — L04
6.6 What it implies for the bias-variance picture
The decomposition still holds exactly:
What changes profile:
- Variance shoots up at (ill-conditioned).
- Past , variance drops back down because the implicit norm-minimization is a variance-control device.
- Bias² stays low (or grows slowly, visible only on a log scale per L04).
- Sum traces the double-descent shape.
This is the canonical answer to “why is the prof critical of the word ‘trade-off’?”: (i) regularization can flatten the variance curve without paying bias cost, (ii) double descent shows you can have low bias and low variance and zero training error simultaneously.
6.7 Slide-deck caveats
- Double descent is achievable mostly in high signal-to-noise problems (image classification, language modelling).
- Most statistical learning methods covered in this course do not exhibit double descent — trees, GAMs, explicit-regularization regression don’t go past the interpolation point in the relevant sense.
- “Though double descent can sometimes occur in neural networks, we typically do not want to rely on this behavior” — slide, qualified by the prof: “depends on what you’re trying to do.”
Notation and naming differences
These are points where the prof’s wording or framing differs meaningfully from ISLP ch. 2. None of them is a new artifact — they are just naming conventions to be aware of.
- “Decomposition” vs “trade-off.” Benjamin calls eq. 2.7 the bias-variance decomposition and is explicit that he dislikes “trade-off” because it implies a forced exchange. ISLP §2.2.2 uses “trade-off” throughout. Treat them as the same equation; the framing critique is the prof’s contribution.
- “Fit data” for “training data.” Benjamin uses these interchangeably; ISLP uses only “training data.”
- Data-matrix convention. Benjamin’s lecture board uses columns = individuals/samples, rows = variables. He flags that ISLP uses the opposite convention (rows = observations, columns = variables) and says it doesn’t matter for the math as long as you’re consistent.
- “Independent variables” — avoided. Benjamin specifically warns against the term “independent variables” for predictors (they’re rarely independent of each other). Preferred terms: predictors, regressors, covariates, features, variables. ISLP uses “predictors / independent variables” interchangeably (§2.1).
- in KNN vs. in classification. The in -nearest neighbors (number of neighbors) is not the same as the in ” classes” (number of categories). Both ISLP and the prof use for both; Benjamin flags the collision explicitly in L07.
- “Contrast” — loose vs strict. Benjamin uses “contrast” for any linear combination of components (so counts). The strict statistical definition requires the coefficients to sum to zero. ISLP ch. 2 doesn’t use the term; ISLP §3.3.1 uses it in the strict dummy-coding sense.
- Bias term in linear regression. Benjamin calls the “bias term / intercept” — note that “bias” here is the ML usage (the constant offset of a linear unit) and is unrelated to the statistical bias of an estimator that appears in the bias-variance decomposition. Same word, two meanings.
- “Sigma” overload. is the noise variance in the regression model and the variance of a single component in the random-vector context. Context disambiguates. (capital) is always the covariance matrix.