Design matrix, normal equations, and the hat matrix

The algebra under multiple regression. $X$ is $n \times (p + 1)$ with a column of ones (the intercept hides in there). $H = X (X^{⊤} X)^{- 1} X^{⊤}$ is the hat matrix , it puts the hats on $y$ , its diagonal is leverage, it appears in the LOOCV shortcut, and the prof’s annoyance with “design matrix” terminology is part of the lecture’s flavor.

Definition (prof’s framing)

“It’s often called the design matrix. The data. Never understood why. It’s not really a design of any kind. But it’s what people call it.” - L06-linreg-2

Design matrix:

$X = 11 ⋮ 1 x_{11} x_{21} x_{n 1} x_{12} x_{22} x_{n 2} \dots \dots \dots x_{1 p} x_{2 p} ⋮ x_{n p} .$

The leading column of ones lets you absorb the intercept $β_{0}$ into the same $β$ vector. “Behind this beta is actually an X. It’s just all the values of X are one. So you don’t need to write it.” - L06-linreg-2

Hat matrix:

$H = X (X^{⊤} X)^{- 1} X^{⊤} .$

“This matrix H is also known as the hat matrix. The reason we call it a hat matrix is that it’s where all the hats come from.” - L08-classif-2

Notation & setup

$n$ samples, $p$ predictors, $X$ is $n \times (p + 1)$ .
Some books include the intercept in $p$ , gotcha he flagged.
$H$ is $n \times n$ , symmetric ( $H^{⊤} = H$ ) and idempotent ( $H^{2} = H$ ). It’s the orthogonal projection onto the column space of $X$ .
$I - H$ is the orthogonal projection onto the residual space.
Leverage of observation $i$ : $h_{ii} = H [i, i]$ , diagonal entry of $H$ , depends only on $X$ , not on $y$ .

Formula(s) to know cold

Normal equations (from differentiating RSS):

$X^{⊤} X \hat{β} = X^{⊤} y$

$⟹ \hat{β} = (X^{⊤} X)^{- 1} X^{⊤} y .$

Predictions:

$\hat{y} = X \hat{β} = H X (X^{⊤} X)^{- 1} X^{⊤} y = Hy .$

Leverage in simple linear regression (closed form, also asked in the exercise class):

$h_{ii} = \frac{1}{n} + \frac{( x _{i} - x ˉ ) ^{2}}{\sum _{j} ( x _{j} - x ˉ ) ^{2}} .$

Sum-to-trace identity: $\sum_{i} h_{ii} = tr (H) = p + 1$ .

Residual covariance:

$Cov (e) = σ^{2} (I - H) .$

LOOCV shortcut for OLS , only one fit needed, because $H$ already encodes how each point is fitted:

$CV_{n} = \frac{1}{n} \sum_{i = 1}^{n} (\frac{y _{i} - y ^ _{i}}{1 - h _{ii}})^{2} .$

Insights & mental models

Three roles of $H$ in this course

The hat matrix is the binding object across modules:

Predictor: $\hat{y} = Hy$ , turns observed $y$ into fitted values.
Leverage: the diagonal $h_{ii}$ measures how much each $x_{i}$ pulls the fit. A point with high $h_{ii}$ AND large residual is the dangerous combination , see residual-diagnostics and the “fat kid on a seesaw” image from L08.
LOOCV shortcut for OLS: the hat matrix lets you compute leave-one-out CV without re-fitting $n$ times. The prof showed this in L10-resample-1.

Tug-of-war with $X^{⊤} X$

The factor $(X^{⊤} X)^{- 1}$ is the load-bearing element:

“I think this is really why statisticians love these distributions, because you can read out what’s going to happen when you look at them. You can be like , ah, that X transpose X is going to screw us later.” - L06-linreg-2

When two predictors are nearly identical, $X^{⊤} X$ is near-singular → its inverse blows up → the variance of $\hat{β}$ explodes (the collinearity story). See also sampling-distribution-of-beta for the $σ^{2} (X^{⊤} X)^{- 1}$ covariance.

Reduction to univariate

Recommended exercise: show that the matrix formula reduces to the simple-LR formula for $\hat{β}_{1}$ when $p = 1$ . Same RSS, same derivative, same answer , just less notation.

Exam signals

“This matrix H is also known as the hat matrix. The reason we call it a hat matrix is that it’s where all the hats come from.” - L08-classif-2

“One exercise you can do for your exercise class is figuring out / verify this formula [for $h_{ii}$ in simple regression] comes from linear regression.” - L08-classif-2

“I think this is really why statisticians love these distributions, because you can read out what’s going to happen when you look at them. You can be like , ah, that X transpose X is going to screw us later.” - L06-linreg-2

Pitfalls

Forgetting the intercept column. Without the leading column of ones, $\hat{β}_{0}$ is fixed at zero. If you write $\hat{β} = (X^{⊤} X)^{- 1} X^{⊤} y$ without the ones column, you’re forcing the fit through the origin.
Identifiability with categorical. $K$ dummies for $K$ levels makes columns linearly dependent , $X^{⊤} X$ singular, no inverse. Use $K - 1$ dummies plus a reference level. See categorical-encoding-and-interactions.
High-leverage point. A point with $h_{ii}$ near 1 effectively determines its own fitted value ( $\overset{y}{^}_{i} \approx y_{i}$ ); the rest of the data has little say. Doesn’t break the algebra but distorts conclusions.
$H$ depends only on $X$ . Useful: leverage is a property of the design, not of the response. Detect high-leverage points before you even look at $y$ .
The inverse may not exist. Rank deficiency (perfect collinearity or $n < p + 1$ ) breaks the closed form. Without “tricks” (regularization, pseudoinverse), OLS literally has no unique solution.

Scope vs ISLP

In scope: the design-matrix structure, the normal-equations derivation, definition of the hat matrix, leverage as $h_{ii}$ , and the LOOCV shortcut formula.
Look up in ISLP: §3.2.1 (estimating coefficients) , the matrix formulation is mostly in equation form, light on derivation. §3.3.3 covers leverage briefly (pp. 97–98). §5.1.2 covers LOOCV with the shortcut formula.
Skip in ISLP (book-only / prof excluded): Moore–Penrose pseudoinverse details , L08-classif-2: “explicitly bracketed off.” Spectral / eigen-decomposition theory of $X^{⊤} X$ , L04-statlearn-3 deferred to Linear Statistical Models.

Exercise instances

Exercise6.1a: derive $\hat{β} = (X^{⊤} X)^{- 1} X^{⊤} y$ from RSS. (Listed under module 6 because it’s the LS estimator derivation; conceptually it’s module 3 material , and per L12, L12-modelsel-1, the prof considered this an already-done module-3 exercise.)

How it might appear on the exam

Derive the normal equations. Take RSS, expand, differentiate, solve. The 6-line derivation in L06-linreg-2 is the template.
Verify the simple-LR leverage formula. $h_{ii} = 1/ n + (x_{i} - \overset{x}{ˉ})^{2} / \sum_{j} (x_{j} - \overset{x}{ˉ})^{2}$ is the exercise the prof flagged in L08.
Why does the LOOCV shortcut work for OLS? Because $H$ is independent of $y$ , you can compute $\overset{y}{^}_{i}$ from the full-data fit, then “remove” point $i$ ‘s effect via the $1/ (1 - h_{ii})$ correction.
Read leverage off a plot. Given a residuals-vs-leverage plot (residual-diagnostics), identify the dangerous corner: high $h_{ii}$ AND large residual.

linear-regression: the model whose algebra this is
least-squares-and-mle: the derivation of $\hat{β}$
sampling-distribution-of-beta: uses $(X^{⊤} X)^{- 1}$ in the covariance
residual-diagnostics: leverage and the leverage-vs-residual plot
collinearity: what happens when $X^{⊤} X$ is near-singular
leave-one-out-cv: the $1/ (1 - h_{ii})$ shortcut lives here

statistical.dog

Explorer

design-matrix-and-hat-matrix

Design matrix, normal equations, and the hat matrix

Definition (prof’s framing)

Notation & setup

Formula(s) to know cold

Insights & mental models

Three roles of $H$ in this course

Tug-of-war with $X^{⊤} X$

Reduction to univariate

Exam signals

Pitfalls

Scope vs ISLP

Exercise instances

How it might appear on the exam

Graph View

Table of Contents

Backlinks

statistical.dog

Explorer

design-matrix-and-hat-matrix

Design matrix, normal equations, and the hat matrix

Definition (prof’s framing)

Notation & setup

Formula(s) to know cold

Insights & mental models

Three roles of H in this course

Tug-of-war with X⊤X

Reduction to univariate

Exam signals

Pitfalls

Scope vs ISLP

Exercise instances

How it might appear on the exam

Related

Graph View

Table of Contents

Backlinks

Three roles of $H$ in this course

Tug-of-war with $X^{⊤} X$