Sampling distribution of β̂

Under the classical Gaussian linear model, $\hat{β}$ is exactly multivariate normal, centered on the true $β$ (unbiased) with covariance $σ^{2} (X^{⊤} X)^{- 1}$ . This is the source of all subsequent inference: t-tests, F-tests, CIs, PIs.

Definition (prof’s framing)

“If you have a billion parameters, what’s the uncertainty of them, and they’re all working against each other? It becomes very confusing. But in this case, you can do it very well.” - L05-linreg-1

Multiple regression result (proved in Exercise 3.2a, derived in L06-linreg-2):

$\hat{β} \sim N_{p + 1} (β, σ^{2} (X^{⊤} X)^{- 1}) .$

Simple regression special case: each component is univariate Gaussian, centered on the true value, with variance read off the diagonal of the covariance matrix.

“That’s what we want. If it was biased then we’d be upset because then our model is not going to give us the right shit.” - L06-linreg-2

Notation & setup

$\hat{β} = (X^{⊤} X)^{- 1} X^{⊤} y$ : see design-matrix-and-hat-matrix.
True $β$ unknown; we estimate $\hat{β}$ .
$σ^{2}$ unknown; estimated by $\overset{σ}{^}^{2} = RSS / (n - p - 1)$ .
$SE (\hat{β}_{j})^{2} = \overset{σ}{^}^{2} \cdot c_{j j}$ , where $c_{j j}$ is the $j$ -th diagonal of $(X^{⊤} X)^{- 1}$ .

Formula(s) to know cold

Multiple regression:

$\hat{β} \sim N_{p + 1} (β, σ^{2} (X^{⊤} X)^{- 1})$

Per-coefficient variance:

$Var (\hat{β}_{j}) = σ^{2} [(X^{⊤} X)^{- 1}]_{j j} .$

Simple regression closed forms (the only ones easy to write without matrix inversion):

$SE (\hat{β}_{1})^{2} = \frac{σ ^{2}}{\sum _{i} ( x _{i} - x ˉ ) ^{2}}, SE (\hat{β}_{0})^{2} = σ^{2} [\frac{1}{n} + \frac{x ˉ ^{2}}{\sum _{i} ( x _{i} - x ˉ ) ^{2}}] .$

Residual standard error (estimator of $σ$ ):

$\overset{σ}{^} = RSE = \frac{RSS}{n - p - 1} .$

For simple regression, $n - 2$ in the denominator (two df eaten by $\hat{β}_{0}$ and $\hat{β}_{1}$ ). With multiple regression, $n - p - 1$ .

Insights & mental models

Derivation in three lines

The proof (Exercise 3.2a, sketched in L06-linreg-2):

Write $\hat{β} = Cy$ with $C = (X^{⊤} X)^{- 1} X^{⊤}$ . Use $y \sim N_{n} (X β, σ^{2} I)$ . Then by the linear-transformation property of the multivariate normal:

$E (\hat{β}) = CX β = (X^{⊤} X)^{- 1} X^{⊤} X β = β$ .
$Cov (\hat{β}) = C σ^{2} I C^{⊤} = σ^{2} (X^{⊤} X)^{- 1}$ .
Linear function of multivariate normal → multivariate normal.

Conclusion: $\hat{β} \sim N_{p + 1} (β, σ^{2} (X^{⊤} X)^{- 1})$ . ∎

Experiment design from the variance formula

The simple-regression form

$SE (\hat{β}_{1})^{2} = \frac{σ ^{2}}{\sum _{i} ( x _{i} - x ˉ ) ^{2}}$

tells you how to design experiments. We can’t shrink $σ^{2}$ (it’s a property of the noise), but we can:

Increase $n$ : more samples.
Spread $x$ wider: sample further apart in $x$ .

“It is kind of weird to think that you can look at these equations and then from that gain an intuition of how you can do your experiment better. But you do.” - L05-linreg-1

Significance is just sample size

“If $n$ is infinity… your standard [error] is going to be small as shit, which means it’s going to look significant even if it isn’t.” - L05-linreg-1

The variance shrinks like $1/ n$ , so any non-zero effect eventually becomes statistically significant for big enough $n$ . See t-test-and-significance.

Why $X^{⊤} X$ matters: collinearity

The variance has the constant- $σ^{2}$ baked in, plus the data-dependent factor $(X^{⊤} X)^{- 1}$ . When two predictors are nearly the same, $X^{⊤} X$ is near-singular, its inverse blows up, the diagonal entries grow without bound, variances explode. See collinearity.

“This factor X transpose X comes into play in particular when two variables are basically the same, because then they can trade off each other and then this variance explodes.” - L06-linreg-2

Estimated SE vs true SE

Strictly, the SE you can compute uses $\overset{σ}{^}$ in place of the unknown $σ$ . So tests use the t distribution (heavy-tailed Gaussian) with $n - p - 1$ df, not the standard normal. For $n ≳ 30$ they’re indistinguishable.

Exam signals

“A lot of the reasons we ask those questions is so we can make tests on them.” - L06-linreg-2

“Result (proving this is problem 2 of the recommended exercises)” - L06-linreg-2

“I think this is really why statisticians love these distributions, because you can read out what’s going to happen when you look at them.” - L06-linreg-2

Pitfalls

Wrong df. Simple LR uses $n - 2$ ; multiple LR uses $n - p - 1$ (where $p$ is the number of slopes, not counting the intercept). Conventions differ in books; say which you’re using.
Estimated vs known $σ$ . With unknown $σ$ , use t with $n - p - 1$ df, not standard normal. The hat is implicit in how R reports SE.
The multivariate covariance has off-diagonal entries. $\hat{β}_{0}$ and $\hat{β}_{1}$ are not generally independent in simple LR. Their covariance becomes zero only if $\overset{x}{ˉ} = 0$ .
Bias is a function of the model, not the estimator. $\hat{β}$ is unbiased for the true $β$ in the assumed model. If the true model is non-linear, the LS slope is unbiased for the best linear approximation, not for the curve.
Inflation under collinearity. A coefficient estimate may be near-zero with a huge SE, looks “insignificant” but the joint test (F) over the correlated set may still be highly significant. See t-test-and-significance and f-test.

Scope vs ISLP

In scope: the multivariate normal sampling distribution, derivation of mean and covariance, the simple-regression SE formulas, residual standard error.
Look up in ISLP: §3.1.2 (pp. 63–66, simple LR SE), §3.2.1 (matrix-form result, lighter derivation).
Skip in ISLP: specifics of the t- and F-distributions are referenced but not derived; ISLP is light here. Walpole is the prof’s recommended classical reference for the $χ_{n - p - 1}^{2}$ distribution of $\overset{σ}{^}^{2}$ .

Exercise instances

Exercise3.2a: full derivation: show $\hat{β}$ has the stated distribution; what assumptions are needed; what does this imply for $\hat{β}_{j}$ ; how to compute $Var (\hat{β}_{j})$ .

How it might appear on the exam

Write the distribution of $\hat{β}$ (and the assumptions under which it holds): could be a true/false or short-derivation question.
Derive the per-coefficient variance in simple LR. Standard “show your work” question; hand-derive $SE (\hat{β}_{1})^{2} = σ^{2} / \sum (x_{i} - \overset{x}{ˉ})^{2}$ .
What happens to the SE when…? Add more data (down by $n$ ); spread $x$ wider (down); two predictors become highly correlated (up, collinearity).
Read SE from regression output. 2025 Q6a-style: given the table, what is the estimate, what is the 95% CI? Use $\hat{β}_{j} \pm t_{0.975, n - p - 1} \cdot SE (\hat{β}_{j})$ .

linear-regression: the underlying model
least-squares-and-mle: derivation of $\hat{β}$
design-matrix-and-hat-matrix: source of $(X^{⊤} X)^{- 1}$
gaussian-error-assumptions: what gives us the multivariate-normal result
confidence-and-prediction-intervals: built directly from this distribution
t-test-and-significance: the standardized coefficient test
collinearity: what blows up the variance
multivariate-normal: the cross-cutting prerequisite distribution

statistical.dog

Explorer

sampling-distribution-of-beta

Sampling distribution of β̂

Definition (prof’s framing)

Notation & setup

Formula(s) to know cold

Insights & mental models

Derivation in three lines

Experiment design from the variance formula

Significance is just sample size

Why $X^{⊤} X$ matters: collinearity

Estimated SE vs true SE

Exam signals

Pitfalls

Scope vs ISLP

Exercise instances

How it might appear on the exam

Graph View

Table of Contents

Backlinks

statistical.dog

Explorer

sampling-distribution-of-beta

Sampling distribution of β̂

Definition (prof’s framing)

Notation & setup

Formula(s) to know cold

Insights & mental models

Derivation in three lines

Experiment design from the variance formula

Significance is just sample size

Why X⊤X matters: collinearity

Estimated SE vs true SE

Exam signals

Pitfalls

Scope vs ISLP

Exercise instances

How it might appear on the exam

Related

Graph View

Table of Contents

Backlinks

Why $X^{⊤} X$ matters: collinearity