Module 03: Linear Regression — Book delta
Module 03 is the heaviest-delta module of the course. ISLP ch. 3 covers simple-LR algebra in detail (eq. 3.4), the SE / CI / t-test machinery for the simple case (eq. 3.7–3.10), the -statistic formula (eq. 3.23–3.24), (eq. 3.17) and adjusted (in passing), categorical encoding, interactions, polynomial regression, the “potential problems” list, and the simple-LR leverage formula (eq. 3.37). But the matrix-form theory that Benjamin built in L06 is largely absent: the book explicitly says of multiple regression “the coefficient estimates have somewhat complicated forms that are most easily represented using matrix algebra. For this reason, we do not provide them here” (§3.2.1). Everything downstream of that statement — the closed-form derivation, the hat matrix and its properties, the multivariate-normal sampling distribution of , the residual covariance , the MLE-equals-LS proof, the matrix-form via , the matrix CI / PI formulas — is delta and is reproduced here in full.
Out-of-scope material per docs/scope.md (F-test mechanics beyond stating the null, VIF, Moore–Penrose details, formal normality tests, spectral theory of ) is excluded.
1. The matrix-form linear model and design matrix
[L06, linear-regression, design-matrix-and-hat-matrix]
ISLP §3.2 presents the multiple-LR model in scalar form (eq. 3.19) but never writes the matrix form or defines the design matrix as an object. Benjamin’s matrix-form setup is the foundation for everything that follows.
Model
Dimensions:
- — response vector.
- — design matrix (Benjamin’s grudging name: “never understood why. It’s not really a design of any kind. But it’s what people call it” L06).
- — parameter vector, intercept plus slopes.
- — error vector.
Explicit design matrix
The leading column of ones absorbs the intercept so the bias term disappears from the matrix equation. “Behind this beta is actually an X. It’s just all the values of X are one. So you don’t need to write it” L06.
Multivariate-normal form of the assumptions
ISLP §3.1.2 gives the scalar assumptions ( uncorrelated, common ). Benjamin restates them as one -dimensional multivariate-normal statement:
The covariance matrix has on the diagonal, zeros off — so the off-diagonals encode independence (assumption 5 in L05’s list), the diagonal equality encodes homoscedasticity (assumption 3). Geometrically the error vector is a spherical -dimensional Gaussian — “no matter which direction you look in, has the same variance, and it’s just kind of a big n-dimensional bell” L06.
Notation gotcha (flagged by the prof)
“You can define as including or not the intercept or bias term. This is just a note for those who are taking both classes that the notation is different in the books” L06. In this delta file, = number of slopes, so has columns and df-for-noise is . ISLP uses the same convention (eq. 3.25).
Classical regime
”: more data points than parameters.” L06 When , is necessarily singular (rank ) and OLS has no unique solution. This sets up module 6.
2. The OLS derivation in matrix form
ISLP states (eq. 3.4) the simple-LR closed form, and explicitly declines to derive the multiple-LR matrix form (§3.2.1). Benjamin did it on the board, three lines, and flagged it as exam-template material via Exercise 6.1a / L12. Here is the full derivation.
Step 1 — write RSS in matrix form
Step 2 — expand
The two cross terms and combine because each is a scalar and a scalar equals its transpose.
Step 3 — differentiate w.r.t. , set to zero
Using and for symmetric :
Step 4 — normal equations
Step 5 — invert when full rank
If is invertible (i.e. has full column rank , which requires and no collinearity):
Uniqueness
“I’m not showing, but you can prove if you want to take more derivatives, that this problem has a unique solution. There’s only one solution” L06. The second-derivative is , positive definite when has full column rank — so the stationary point is a strict minimum, and it is the unique one.
Why this matters at the meta-level
“Most of the time, you actually have to go iteratively… most of the time when we’re trying to find this peak, we have to like climb up and go there. In fact, I would argue most of machine learning is finding good tricks to get to that peak… But in the case of linear regression with full-rank X, just right to the top. Very convenient” L06. This is why OLS is the canonical model for everything downstream (CIs, t-tests, exact distributions): we have the estimator in closed form, so every other quantity is also exact.
Reduction to simple-LR
In the case, , and direct calculation of reproduces ISLP eq. 3.4:
This consistency check is the recommended exercise L06.
3. The MLE ⇔ least-squares equivalence
[L05, L06, L27, least-squares-and-mle]
ISLP never proves this. It is the prof’s flagged theory-question template for the exam: “I do generally like to keep one theory question. … assume an additive Gaussian error model … Show that maximum likelihood and least squares are equivalent in ” L27. Reproduced here in full.
Setup
Assume , so .
Likelihood
The joint density of given :
Log-likelihood
The argument
The first term does not depend on . The factor in front of the sum is a positive constant that doesn’t change the location of the maximum. So
So .
MLE for (as a side-effect)
Differentiating the log-likelihood w.r.t. and solving gives . The unbiased estimator (which is what is used everywhere else in the course, including residual standard error) is , with the accounting for the parameters consumed by . Benjamin says of the difference: “if it barely matters” L05.
What other loss → what other distribution
A useful by-product the prof drew on the board: a different choice of penalty implies a different error distribution.
| Loss | MLE-equivalent error distribution | Property |
|---|---|---|
| Gaussian | quadratic cost; outlier-sensitive | |
| Laplace (double-exponential) | linear cost; robust to outliers | |
| exotic, never used | ”would really, really penalize anything far away” L05 |
“If we had our data was like this and then there was a point here, that point would have a stronger effect when fitting the model with a least squares fit, whereas a Laplace fit it wouldn’t be pulling it as strongly” L05.
Pitfall the prof himself stumbled on
“A student caught the prof on a sign during the L27 walkthrough” least-squares-and-mle. On the exam, state explicitly that maximizing is the same as minimizing , which is the same as minimizing RSS up to constants. The sign-flip is the standard place to lose a point.
4. The hat matrix
[L06, L08, design-matrix-and-hat-matrix]
ISLP §3.3.3 mentions (the diagonal element only) in eq. 3.37 for simple LR and notes “there is a simple extension of to the case of multiple predictors, though we do not provide the formula here.” Everything else about is delta. This is the matrix the prof said “has all the shit you need to get your hats for your parameters. So it’s called the hat matrix” L06.
Definition
Why “hat matrix”
Predictions are obtained from by applying :
“In math we call it a hat. It’s a pointy hat. But it’s a hat. And so this matrix H has all the shit you need to get your hats for your parameters” L06.
Properties (provable in two lines each)
These are the structural facts Benjamin emphasizes and that ISLP never lists.
(P1) Symmetric.
using that is symmetric, so its inverse is symmetric.
(P2) Idempotent. .
(P3) Orthogonal projection. is the orthogonal projection of onto the column space of . Combined: a symmetric idempotent matrix is exactly an orthogonal projector. Geometrically: is the closest point to in the column-space of (which is exactly what least squares means).
(P4) Residual projector. is also symmetric and idempotent, and projects onto the orthogonal complement (the “residual space”):
(P5) Orthogonality of fitted values and residuals.
The fitted values and the residuals are orthogonal vectors in .
(P6) Trace = rank = .
using the cyclic property of trace. Equivalently . Average leverage is (ISLP §3.3.3 mentions this fact without deriving it).
(P7) Each diagonal entry . Lower bound follows from including the intercept column; upper bound from idempotency.
(P8) Leverage in multiple LR (the formula ISLP declined to give).
where is the -th row of written as a column vector (including the leading 1).
(P9) Leverage depends only on , not on . So a high-leverage point can be flagged from the design alone, before any response data is observed.
Residual covariance
The fact ISLP never states. Starting from and :
Implications:
- — raw residuals have unequal variances. High-leverage points have smaller residual variance (they pull the fit toward themselves, so their residual is small).
- — raw residuals are correlated, even when the true errors are independent.
This motivates standardized residuals:
which have approximately unit variance and let the QQ plot / residuals-vs-fitted plot be read with the assumed Gaussian behaviour. “Your betas stay the same. It’s just a way to say, is my model any good?” L08.
Studentized residuals swap in (the residual SE computed from the data with point deleted) to remove the circular use of in fitting and evaluating point . For , and are essentially indistinguishable L08.
Leverage in simple LR (the prof flagged this as the exercise question)
Direct algebra on for gives
ISLP states this (eq. 3.37) but the derivation is delta. The point: grows with distance from — extreme -values are high-leverage.
LOOCV shortcut for OLS
For OLS fits only, leave-one-out cross-validation can be computed from one full-data fit using the hat matrix:
ISLP §5.1.2 states this for OLS without much justification. The reason it works is exactly P9 — leverage depends only on , not on — so leaving out the leaves unchanged and the leave-one-out fitted value can be recovered analytically. Owned by leave-one-out-cv in module 5.
5. Sampling distribution of
[L05, L06, sampling-distribution-of-beta]
ISLP gives diagonal SE formulas for the simple-LR case in eq. 3.8 and waves vaguely at multiple LR. The clean multivariate theorem and its derivation are delta and are the load-bearing fact for all of regression inference.
Theorem
Under the Gaussian linear model with :
Derivation (three lines)
Write , so . Use and the linear-transformation property of the multivariate normal: is multivariate normal with
Consequences
(C1) Unbiasedness. . “That’s what we want. If it was biased then we’d be upset because then our model is not going to give us the right shit” L06.
(C2) Per-coefficient variance. — the -th diagonal of . The SE estimator is .
(C3) Coefficients are correlated. The off-diagonals of are generally nonzero. In simple LR, ; zero iff (center the data and the intercept becomes uncorrelated from the slope). ISLP nowhere states this.
(C4) Collinearity blow-up. As columns of become near-linearly-dependent, becomes near-singular, its inverse’s entries blow up, and individual . The prof’s load-bearing observation: “This factor X transpose X comes into play in particular when two variables are basically the same, because then they can trade off each other and then this variance explodes” L06. ISLP discusses collinearity qualitatively in §3.3.3 but never connects it to the matrix algebra explicitly.
(C5) Why centering helps numerics. Centering around its mean kills the term in the off-diagonal of for simple LR (and reduces correlations between intercept and slopes in MLR), giving a better-conditioned inversion.
Residual standard error (matrix-form derivation)
The unbiased estimator of is
Unbiasedness: using property P6 of . Dividing by gives an unbiased estimator. ISLP states the divisor (eq. 3.25) without this derivation; the is the rank of the residual projector .
In simple LR this collapses to (ISLP eq. 3.15). “Two degrees of freedom are eaten by and ” L05. In general, ” degrees of freedom are eaten by the entries of .”
Independence of and
A classical result Benjamin invokes implicitly when justifying t-tests with df : under the Gaussian linear model, and are independent random variables. Sketch: depends on ; depends on . The two random vectors and are jointly Gaussian and uncorrelated (), hence independent. This independence is what licenses the t-statistic to have an exact distribution under , rather than just approximately Gaussian.
Walpole is the prof’s recommended classical reference; the result is needed to get the df exactly right sampling-distribution-of-beta.
6. The -aware t-statistic and matrix-form CI
[L06, t-test-and-significance, confidence-and-prediction-intervals]
ISLP gives the simple-LR t-statistic (eq. 3.14) with df. The matrix-form version with df and — and the explicit pointer to which diagonal of — is delta.
Per-coefficient t-test
where .
Per-coefficient CI
CI for the mean response at
ISLP §3.2.2 mentions CIs for verbally but does not give the matrix-form formula. Delta:
Derivation: is a linear function of the multivariate-normal , so .
PI for a future observation at
Derivation: a future observation has independent of the past data, so
The +1 under the square root is the irreducible noise — the source of “PI always wider than CI” L06.
Band shape
Both bands are narrowest where is near the centroid of the data (because is small there) and fan out at the extremes. The CI band hugs the line; the PI band is wider by a constant-σ² floor.
7. The F-statistic and the partial F-statistic (statement only)
Scope flag. Benjamin was emphatic: “I’m going to say right now I probably won’t ask any questions about an F-test. … I’m not going to make you compute it because I honestly don’t care” L06. The mechanics are out of scope; the null hypothesis and the why-you’d-use-it reasoning are in. ISLP §3.2.2 gives both eq. 3.23 and eq. 3.24 in full. No delta on F-test math.
The one structural point that is delta and worth stating: , so the F-test on a single coefficient is equivalent to the squared t-test f-test. ISLP §3.2.2 mentions this in a footnote (footnote 7) but does not derive it.
8. Notation and naming differences
”Bias” vs “intercept”
ISLP uses “intercept” throughout. Benjamin prefers “bias” for and uses both interchangeably. “The bias is ” / “the intercept is ” — same object. L05
counting
ISLP and Benjamin both use to mean number of slopes, with the design matrix having columns and df . The convention is consistent across both sources, but Benjamin flagged that some books count the intercept inside . L06
”Residuals are predictions of errors”
Benjamin draws a sharp distinction that ISLP does not:
“The error terms are random variables and cannot be estimated. They can be predicted.” L05
So is an unobservable random variable, and is a prediction of , not an estimate. ISLP uses “estimate” loosely. Stating this distinction may earn marks on a careful T/F.
”Design matrix”
ISLP uses “design matrix” with no commentary. Benjamin keeps the name but mocks it: “It’s often called the design matrix. The data. Never understood why. It’s not really a design of any kind. But it’s what people call it” L06.
Independence as the load-bearing assumption
ISLP §3.3.3 lists “correlation of error terms” as item 2 of six “potential problems” with no ranking. Benjamin ranks the assumptions, with independence (4 and 5 in his list) as the dangerous ones and Gaussian / zero-mean / homoscedastic as relatively benign: “violations [of independence] ruin everything” L05. This is not a formula difference, but it is a framing the exam might test (e.g. “which assumption violation most invalidates the SE estimates?”).
”Main-effects rule” vs “hierarchical principle”
Same thing. ISLP §3.3.2 calls it the hierarchical principle. Benjamin calls it the main-effects rule L06 — verbatim: “whenever you include an interaction, you want to include what is referred to as the main effects."
"Statistical vs practical significance”
Not in ISLP. Benjamin’s organizing framing for the t-test discussion: large makes everything statistically significant; the slope size is what tells you whether it actually matters. “Significance is just sample size” L05. This is a framing the exam will likely test via T/F.
Five-item assumption list (vs ISLP’s six-item problem list)
L05’s positive assumption list:
- Normally distributed .
- Mean zero .
- Common variance .
- Independent of any other variable.
- Independent of each other.
ISLP §3.3.3 gives the violations:
- Non-linearity.
- Correlation of error terms.
- Non-constant variance.
- Outliers.
- High-leverage points.
- Collinearity.
These are inverse views. Benjamin’s framing is “what you assumed”; ISLP’s is “what can go wrong.” Mapping: (1)→(non-Gaussian residuals shown on QQ), (3)→(non-constant variance), (4)+(5)→(error correlation, e.g. time series tracking).