Module 06: Model Selection and Regularization — Book delta
ISLP chapter 6 covers the same skeleton the prof teaches ( backward, ridge, lasso, PCR, PLS, high-dim), but a handful of concrete, lookup-able artifacts that Benjamin used at the board or that he made central to his framing are absent from chapter 6. The book’s ridge derivation is verbal — it does not give the matrix closed form ; the PCA-as-eigendecomposition machinery ISLP defers to ch. 12; and the per-PC ridge-shrinkage factor that the prof read off the slide is footnoted to ESL §3.5 rather than reproduced. This file collects everything in that gap, plus the prof’s matrix-form / closed-form / variance machinery that the book leaves implicit.
Anything only in the book and not reached by lectures/slides/exercises is out per scope and is not in this file. Anything covered by ISLP ch. 6 in clean form (e.g. eqs 6.5, 6.7, 6.14, 6.15, lasso constraint geometry, the bias-variance figure, soft-thresholding for orthonormal ) is not reproduced here — go look it up.
1. Ridge regression: matrix closed form and what ISLP doesn’t give
[L12, L13, concept: ridge-regression; ISLP §6.2.1]
ISLP §6.2.1 presents ridge as the minimizer of (6.5),
then immediately discusses the bias-variance trade-off (Fig 6.5) and standardization (eq 6.6). It does not give a matrix-form solution. The prof gestured at it but didn’t write it on the board (“the math is just as easy as before: take derivatives, get a closed form” — L12, L13). Here it is in full, since this is the artifact most-cited in the wiki and not lookup-able in ch. 6.
1.1 Setup (centered/standardized form)
Assume has been centered (column means subtracted) and standardized (eq 6.6: divide each column by its sample standard deviation ). Then we can drop the intercept from the penalty: regardless of (ISLP states this in the paragraph after eq 6.5; the centered form is its consequence).
In matrix form, the ridge objective is
1.2 Derivation
Differentiate w.r.t. and set to zero:
Hence
The intercept is recovered as (or simply if was centered).
1.3 Why this matters (and why ISLP avoids it)
-
Unique solution at : is singular when , but has eigenvalues bounded below by and is invertible for any . This is exactly the prof’s “ridge stays unique” claim from L13: “If you have too many parameters, you can add a regularizer to keep the model finding a unique solution. It maintains the model as convex, meaning it will still have a unique solution for a given value of .” ISLP §6.2.1 states this verbally (last paragraph p. 256 in the printed edition) but never writes the matrix that makes it manifest.
-
Recovering OLS: when is non-singular.
-
Recovering the null model: .
1.4 Sampling distribution under the linear model
Assuming the standard linear-model setup with :
Let , so .
-
Expectation (bias).
This is not unless , so ridge is biased for . The bias is
i.e. ridge shrinks toward zero in the metric defined by .
-
Variance.
In the limit this reduces to the OLS variance ; for the variance is strictly smaller in the Loewner order. This is the algebraic instance of the prof’s “variance can shrink to about half of the OLS variance for the cost of a small bias bump” claim from L13.
-
Distribution. Linear combination of Gaussians ⇒
ISLP §6.2.1 plots bias and variance (Fig 6.5) but does not give these closed-form expressions. They are useful at the exam table for any “show that ridge is biased” / “write down the ridge variance” question.
2. The ridge hat matrix and effective degrees of freedom
[L13, L15, concepts: ridge-regression, principal-component-regression]
ISLP §6.4 (Fig 6.24) uses the phrase “degrees of freedom” for the lasso (number of non-zero coefficients) but never defines ridge’s effective degrees of freedom. The slide deck and the prof’s “ridge shrinks the small-eigenvalue directions” framing rely on this object, so reproduce it here.
2.1 Ridge hat matrix
Predictions are linear in :
where
is symmetric and idempotent only at (the OLS hat matrix ); for it is symmetric but not idempotent — it is a smoother, not a projection.
2.2 Effective degrees of freedom
The effective degrees of freedom of a linear smoother are . For ridge:
where are the singular values of (so are the eigenvalues of ).
Two limits:
- (or ): plain OLS spends one effective parameter per predictor.
- : full shrinkage, zero effective parameters.
So slides continuously from to , the prof’s “reduce the number of effective parameters” framing on the slide (and in L12: “it tries to reduce the number of parameters effectively — effective parameters — because if the beta is zero, then essentially that parameter is not in there.”).
3. Per-principal-component ridge shrinkage factor
[L15, slide deck “Example: Shrinkage Factor”, concept: ridge-regression]
The slide deck (selection_regularization_presentation_lecture2.md, “PCR can be seen as discretized version of Ridge regression”) explicitly gives the per-PC ridge shrinkage factor as
where are the eigenvalues of the (standardized) matrix. Note: the slide uses for eigenvalues and for the ridge tuning parameter, an unfortunate clash. With singular values of , , and the slide formula is exactly — matching §2.2 above term-by-term.
3.1 Why this artifact matters
In the SVD basis (where ), ridge acts on each principal direction independently, with shrinkage strength on the -th PC direction:
- Large (high-variance PC direction) (almost no shrinkage).
- Small (low-variance PC direction) (heavy shrinkage).
This is the algebraic content of the prof’s “higher pressure on less important PCs” slide bullet and his “PCR can be seen as a discretized version of ridge regression” L15:
- PCR: discards the smallest-eigenvalue PC directions outright (hard threshold).
- Ridge: shrinks each direction by the continuous factor — heavier on the smaller-eigenvalue ones (soft threshold).
ISLP §6.3.1 says the connection between ridge and PCR exists (“one can even think of ridge regression as a continuous version of PCR!”) and points the reader to ESL §3.5 in footnote 8 — that is, ISLP ch. 6 deliberately does not derive this. The prof’s slide shows the formula directly; this file reproduces it for exam lookup.
3.2 SVD form (the cleanest derivation, light)
Write with , orthogonal, . Then
and the fitted values are
where is the -th column of . The coefficient on is exactly . (Prof flagged spectral decomposition as out of scope per L04, so this is the lightest form — present for completeness.)
4. PCA fraction of variance via eigenvalues
[L14, L15, slide deck, concept: principal-component-regression]
The slide deck contains the formula
i.e. the fraction of -variance captured by the first principal components is the cumulative sum of the top- eigenvalues of the (standardized) -covariance matrix divided by the total sum of eigenvalues.
The prof made this explicit in L15 after a student asked about it the prior day:
“The eigenvalues of the covariance matrix of are equal to the variances of the corresponding principal components. So fraction of variance explained by the first PCs = . That’s what the explained-variance plot is showing: running normalized cumulative sum of eigenvalues.”
4.1 Why this is a chapter-6 delta
ISLP ch. 6 does not contain this formula. The §6.3.1 treatment of PCA introduces PCs verbally (the green line in Fig 6.14, the loadings in eq 6.19) and refers the eigendecomposition treatment to ch. 12 (unsupervised learning). The “fraction of variance explained” object is defined in ISLP §12.2.3 / eq 12.10 (a Pearson correlation-based formula), not in ch. 6. At the exam table, a question about scree plots / PVE on a module-6 context will land you flipping between ch. 6 and ch. 12; this delta puts the eigenvalue formula in one place.
4.2 The eigenvalue = PC variance identity
Let (after standardization, so ). The spectral decomposition with () and orthogonal gives the loadings: (the -th eigenvector). Then
So is literally the variance of the -th PC. The total variance in the standardized data is (each standardized column has variance ) and equivalently . The slide formula is the cumulative fraction. The prof flagged the chain “eigenvalue → PC variance → cumulative fraction” as the right way to read a scree plot in L15; the algebra above is the one-line reason.
(Per scope, spectral-decomposition derivations are deferred to Linear Statistical Models, but the identity eigenvalue = PC variance is in scope as the slide gives the formula and the prof read it aloud.)
5. Elastic net penalized objective
[L13, concept: elastic-net; ISLP §6.2.2 brief mention only]
ISLP §6.2.2 (last paragraphs before §6.2.3) mentions elastic net by name as a hybrid between ridge and lasso but does not write the objective. The prof wrote it on the board in L13:
Two regularization parameters: (L2 strength), (L1 strength), both tuned by cross-validation over a 2D grid.
5.1 Library reparameterization
The standard library form (glmnet, sklearn) uses a single overall strength and a mixing parameter :
Match-up: , . Edge cases:
- ⇒ pure lasso.
- ⇒ pure ridge.
- ⇒ elastic net.
The prof noted “often parameterized slightly differently in libraries, but the idea is the same” L13. Both forms are reproduced here so neither comes as a surprise at the exam table.
5.2 Constraint-region geometry
The elastic-net constraint region is the rounded diamond : corners on the axes (sparsity from L1) plus rounded edges (averaging from L2). See ridge-vs-lasso-geometry for the underlying logic; this object is not drawn in ISLP Fig 6.7.
6. Bias-variance decomposition under the linear model (slide-flagged scaling)
[L13, slide deck; supplemental to ISLP §6.2.1 Fig 6.5]
The bias-variance trade-off is discussed in ISLP §2.2.2 (introduction) and visualized for ridge in §6.2.1 Fig 6.5. What ISLP does not explicitly carry into chapter 6 is the prof’s restated algebraic claim that motivates the entire module:
“If you increase the bias a little bit — you can reduce the variance a lot. Because you have the squared term there.” L13
Spelled out: at a fixed prediction point , the expected squared prediction error decomposes as
The prof’s stylised observation: if Bias is multiplied by a small factor , the Bias² term grows like — linearly small in . So a small increase in bias buys a comparatively large decrease in variance, provided the variance term was larger to begin with — which it is when is comparable to .
This is in-scope per the prof’s explicit exam flag (L13):
“I mentioned that this is definitely going to be on the exam. I mean, just this concept. … It’s like an onion with layers.”
ISLP ch. 6 visualizes this with Fig 6.5 but doesn’t repeat the §2.2.2 derivation; the prof rederived it on the board in L13. Treat the derivation as needing the algebra from ISLP §2.2.2 (eq 2.7) — but the link to ridge/lasso shrinkage as the active lever is the module-6 framing the book leaves out.
7. Algorithm counts (for fill-in / true-false)
[L12, slide deck, concept: subset-selection; ISLP §6.1.1, §6.1.2 prose]
ISLP states the counts in prose (“there are models”, “forward stepwise requires fitting only 211 models for ”). The prof drilled the formulas on the board. Reproduce them in clean lookup form:
For : vs .
The formula is the same for all three stepwise variants because each scans through candidates plus the null model. ISLP states the result "" once mid-paragraph (§6.1.2); the derivation that the slide shows is what makes this a clean fill-in. Counted as delta because the prof gave both forms on the slide and the second form is not in ISLP.
Hard requirement (slide-flagged, L12):
“Backwards selection requires that the number of samples is larger than the number of parameters.”
Because step 1 fits the full OLS model, which needs invertible, which needs . Forward stepwise has no such requirement and can be run when by capping the algorithm at submodels (slide-flagged).
ISLP §6.1.2 states this in prose; the prof drilled it as an exam-pattern fill-in. The combination forward survives , backward doesn’t is the canonical MC.
8. PLS algorithm (clean form)
[L15, slide deck “Partial Least Squares (Algorithm)”, concept: partial-least-squares]
ISLP §6.3.2 (pp. 286–288) describes PLS in two paragraphs and writes only ” equal to the coefficient from the simple linear regression of onto ” and the deflation step. The prof gave the same content but the slide labels it as a multi-step algorithm; reproduce as a numbered algorithm for exam lookup.
After standardizing each and centering :
For :
- For each , regress on alone (simple linear regression). The coefficient is
- Set , so
For :
- Orthogonalize: for each , regress on and take residuals. Call the residualized predictors .
- Compute from using exactly the procedure of steps 1–2 (regress each on , use those coefficients as ).
Final fit:
- Least-squares regression of on . chosen by cross-validation.
The key contrast with PCR (the only conceptual delta worth memorizing)
| PCR | PLS | |
|---|---|---|
| Objective for | s.t. | s.t. |
| Supervision | Unsupervised (uses only) | Supervised (uses and ) |
| First-component construction | Eigenvector of |
ISLP §6.3.2 states the contrast; this table reproduces it as a side-by-side lookup with the explicit objective functions, which the book gives only in prose.
9. Slide-flagged scope items the book uses different definitions for
[L15, slide deck; supplemental to ISLP §6.4]
Two artifacts the prof endorsed verbatim from the slide and the book hides differently:
9.1 Multicollinearity, extreme version
The prof read this aloud and endorsed it L15:
Multicollinearity in high dim: “any variable in the model can be written as a linear combination of all of the other variables in the model.” — slide deck
ISLP §6.4.4 states this in prose (p. 293 in the printed edition) — but the prof’s wording is the version that appeared on the slide and is the version Anders should recognize on an exam stem.
9.2 What can and cannot be recovered
Verbatim slide endorsed by the prof:
“We can never know exactly which variables (if any) truly are predictive of the outcome. We can never identify the best coefficients for use in the regression. At most, we can hope to assign large regression coefficients to variables that are correlated with the variables that truly are predictive of the outcome. We will find one of possibly many suitable predictive models.”
ISLP §6.4.4 says the same thing in different words (“there are likely to be many sets of 17 SNPs that would predict blood pressure just as well as the selected model”). The slide framing is the one the prof committed to.
10. Notation / terminology drift between prof and ISLP
Minor, mostly cosmetic. Listed for safety at the exam table.
- Ridge regression. Prof switches freely between “ridge”, “L2”, and (occasionally) “Tikhonov” (L13: “It’s entirely possible that I will just start calling it L2 one day.”). ISLP uses “ridge regression” exclusively.
- Lasso. Prof: “lasso” or “L1”. ISLP: “lasso” or “the lasso”. Both use / norm notation interchangeably with “L1” / “L2”.
- Tuning parameter . Same symbol in both, with one internal clash that the slide deck does not resolve: in the PCR/ridge shrinkage-factor formula , the are eigenvalues of and the is the ridge tuning parameter. The prof flags this orally in L15 (“this is a confusing figure, I’ll try to make another one for next time”) but the formula stayed as-is on the slide. ISLP avoids the clash entirely by deferring to ESL.
- Effective parameters / effective degrees of freedom. Prof says “effective parameters” verbally (L12: “reduce the number of parameters effectively”); ISLP uses “degrees of freedom” only for lasso (number of non-zero coefficients) and never assigns a numerical to ridge in ch. 6.
- vs vs . Number of components / predictors / subset size: ISLP uses for subset selection (algorithms 6.1–6.3) and for dimension reduction (eq 6.16). Prof uses for subset size in L12 then for PCR/PLS. No source uses these symbols consistently across all three families.
- Standardization formula. ISLP eq 6.6 uses (centering implicit in ). Slide deck reproduces this verbatim. Both also accept “subtract mean, divide by SD” interchangeably.
- “Capitalist vs socialist.” Prof’s only-in-lecture framing (L13, L14) for lasso vs ridge. Not in ISLP. Useful for explanations; don’t write it on the exam unironically.
Coverage map (what’s reproduced here vs. what’s lookup-only in ISLP ch. 6)
| Item | Source | This file? | ISLP location |
|---|---|---|---|
| Best subset model count | §6.1.1 | Yes (§7) | eq before 6.1.1, prose |
| Stepwise model count | §6.1.2 | Yes (§7) | footnote 2, prose |
| Backward needs | L12 / slide | Yes (§7) | §6.1.2 p. 247 prose |
| / AIC / BIC / adj- formulas | §6.1.3 eqs 6.2–6.4 | No (out of scope) | §6.1.3 |
| Ridge penalized objective | eq 6.5 | Lookup | §6.2.1 eq 6.5 |
| Ridge closed form | — | Yes (§1) | not in ch. 6 |
| Ridge sampling distribution | — | Yes (§1.4) | not in ch. 6 |
| Ridge hat matrix , | — | Yes (§2) | not in ch. 6 |
| Per-PC ridge shrinkage | slide / L15 | Yes (§3) | footnote 8 → ESL §3.5 |
| Ridge orthonormal-case | eq 6.14 | Lookup | §6.2.2 |
| Standardization formula | eq 6.6 | Lookup | §6.2.1 |
| Lasso penalized objective | eq 6.7 | Lookup | §6.2.2 |
| Lasso constraint form $\sum | \beta_j | \leq s$ | eq 6.8 |
| Ridge/lasso constraint geometry (Fig 6.7) | Fig 6.7 | Lookup | §6.2.2 |
| Soft-thresholding (orthonormal) | eq 6.15 | Lookup | §6.2.2 |
| Bayesian Gaussian/Laplace priors | §6.2.2 | No (out of scope) | §6.2.2 |
| Elastic net penalized objective | L13 / slide | Yes (§5) | mentioned only |
| Bias-variance decomposition (for ridge) | Fig 6.5 | Pointer (§6) | §2.2.2 + §6.2.1 |
| Dimension reduction setup | eq 6.16 | Lookup | §6.3 |
| Back-transform | eq 6.18 | Lookup | §6.3 |
| PCA fraction of variance via eigenvalues | slide / L15 | Yes (§4) | not in ch. 6 (in §12.2.3) |
| Eigenvalue = variance of corresponding PC | L15 | Yes (§4.2) | not in ch. 6 |
| PCR algorithm | §6.3.1 | Lookup | §6.3.1 |
| PLS algorithm (numbered) | slide | Yes (§8) | §6.3.2 prose |
| High-dim breakdown of OLS | §6.4.2 | Lookup | §6.4.2 |
| Slide-flagged multicollinearity statement | slide / L15 | Yes (§9) | §6.4.4 in prose |
What’s deliberately not in this file
Per scope:
- / AIC / BIC / adjusted algebra (out of scope, conceptual claim only — see aic-bic-conceptual).
- Bayesian interpretation of ridge / lasso (Gaussian / Laplace priors) — out of scope per L14.
- L0 norm / “Optimal Brain Damage” — out of scope per L14.
- Full spectral / eigendecomposition derivations beyond stating the identity — out of scope per L04.
- Detailed PLS history / chemometrics-specific tuning — out of scope per L14 / L15.
- Elastic net detailed tuning (the form in §5 is in scope; the L1/L2-mixing optimization details are not).
- Moore-Penrose pseudoinverse details — out of scope (prof flagged in L08).
- R / Python code, package names, function syntax — out of scope per scope §“Programming policy”.