Design matrix, normal equations, and the hat matrix
The algebra under multiple regression. is with a column of ones (the intercept hides in there). is the hat matrix , it puts the hats on , its diagonal is leverage, it appears in the LOOCV shortcut, and the prof’s annoyance with “design matrix” terminology is part of the lecture’s flavor.
Definition (prof’s framing)
“It’s often called the design matrix. The data. Never understood why. It’s not really a design of any kind. But it’s what people call it.” - L06-linreg-2
Design matrix:
The leading column of ones lets you absorb the intercept into the same vector. “Behind this beta is actually an X. It’s just all the values of X are one. So you don’t need to write it.” - L06-linreg-2
Hat matrix:
“This matrix H is also known as the hat matrix. The reason we call it a hat matrix is that it’s where all the hats come from.” - L08-classif-2
Notation & setup
- samples, predictors, is .
- Some books include the intercept in , gotcha he flagged.
- is , symmetric () and idempotent (). It’s the orthogonal projection onto the column space of .
- is the orthogonal projection onto the residual space.
- Leverage of observation : , diagonal entry of , depends only on , not on .
Formula(s) to know cold
Normal equations (from differentiating RSS):
Predictions:
Leverage in simple linear regression (closed form, also asked in the exercise class):
Sum-to-trace identity: .
Residual covariance:
LOOCV shortcut for OLS , only one fit needed, because already encodes how each point is fitted:
Insights & mental models
Three roles of in this course
The hat matrix is the binding object across modules:
- Predictor: , turns observed into fitted values.
- Leverage: the diagonal measures how much each pulls the fit. A point with high AND large residual is the dangerous combination , see residual-diagnostics and the “fat kid on a seesaw” image from L08.
- LOOCV shortcut for OLS: the hat matrix lets you compute leave-one-out CV without re-fitting times. The prof showed this in L10-resample-1.
Tug-of-war with
The factor is the load-bearing element:
“I think this is really why statisticians love these distributions, because you can read out what’s going to happen when you look at them. You can be like , ah, that X transpose X is going to screw us later.” - L06-linreg-2
When two predictors are nearly identical, is near-singular → its inverse blows up → the variance of explodes (the collinearity story). See also sampling-distribution-of-beta for the covariance.
Reduction to univariate
Recommended exercise: show that the matrix formula reduces to the simple-LR formula for when . Same RSS, same derivative, same answer , just less notation.
Exam signals
“This matrix H is also known as the hat matrix. The reason we call it a hat matrix is that it’s where all the hats come from.” - L08-classif-2
“One exercise you can do for your exercise class is figuring out / verify this formula [for in simple regression] comes from linear regression.” - L08-classif-2
“I think this is really why statisticians love these distributions, because you can read out what’s going to happen when you look at them. You can be like , ah, that X transpose X is going to screw us later.” - L06-linreg-2
Pitfalls
- Forgetting the intercept column. Without the leading column of ones, is fixed at zero. If you write without the ones column, you’re forcing the fit through the origin.
- Identifiability with categorical. dummies for levels makes columns linearly dependent , singular, no inverse. Use dummies plus a reference level. See categorical-encoding-and-interactions.
- High-leverage point. A point with near 1 effectively determines its own fitted value (); the rest of the data has little say. Doesn’t break the algebra but distorts conclusions.
- depends only on . Useful: leverage is a property of the design, not of the response. Detect high-leverage points before you even look at .
- The inverse may not exist. Rank deficiency (perfect collinearity or ) breaks the closed form. Without “tricks” (regularization, pseudoinverse), OLS literally has no unique solution.
Scope vs ISLP
- In scope: the design-matrix structure, the normal-equations derivation, definition of the hat matrix, leverage as , and the LOOCV shortcut formula.
- Look up in ISLP: §3.2.1 (estimating coefficients) , the matrix formulation is mostly in equation form, light on derivation. §3.3.3 covers leverage briefly (pp. 97–98). §5.1.2 covers LOOCV with the shortcut formula.
- Skip in ISLP (book-only / prof excluded): Moore–Penrose pseudoinverse details , L08-classif-2: “explicitly bracketed off.” Spectral / eigen-decomposition theory of , L04-statlearn-3 deferred to Linear Statistical Models.
Exercise instances
- Exercise6.1a: derive from RSS. (Listed under module 6 because it’s the LS estimator derivation; conceptually it’s module 3 material , and per L12, L12-modelsel-1, the prof considered this an already-done module-3 exercise.)
How it might appear on the exam
- Derive the normal equations. Take RSS, expand, differentiate, solve. The 6-line derivation in L06-linreg-2 is the template.
- Verify the simple-LR leverage formula. is the exercise the prof flagged in L08.
- Why does the LOOCV shortcut work for OLS? Because is independent of , you can compute from the full-data fit, then “remove” point ‘s effect via the correction.
- Read leverage off a plot. Given a residuals-vs-leverage plot (residual-diagnostics), identify the dangerous corner: high AND large residual.
Related
- linear-regression: the model whose algebra this is
- least-squares-and-mle: the derivation of
- sampling-distribution-of-beta: uses in the covariance
- residual-diagnostics: leverage and the leverage-vs-residual plot
- collinearity: what happens when is near-singular
- leave-one-out-cv: the shortcut lives here