Module 03: Linear Regression — Book delta

Module 03 is the heaviest-delta module of the course. ISLP ch. 3 covers simple-LR algebra in detail (eq. 3.4), the SE / CI / t-test machinery for the simple case (eq. 3.7–3.10), the $F$ -statistic formula (eq. 3.23–3.24), $R^{2}$ (eq. 3.17) and adjusted $R^{2}$ (in passing), categorical encoding, interactions, polynomial regression, the “potential problems” list, and the simple-LR leverage formula (eq. 3.37). But the matrix-form theory that Benjamin built in L06 is largely absent: the book explicitly says of multiple regression “the coefficient estimates have somewhat complicated forms that are most easily represented using matrix algebra. For this reason, we do not provide them here” (§3.2.1). Everything downstream of that statement — the closed-form derivation, the hat matrix and its properties, the multivariate-normal sampling distribution of $\hat{β}$ , the residual covariance $σ^{2} (I - H)$ , the MLE-equals-LS proof, the matrix-form $SE (\hat{β}_{j})$ via $(X^{⊤} X)^{- 1}$ , the matrix CI / PI formulas — is delta and is reproduced here in full.

Out-of-scope material per docs/scope.md (F-test mechanics beyond stating the null, VIF, Moore–Penrose details, formal normality tests, spectral theory of $X^{⊤} X$ ) is excluded.

1. The matrix-form linear model and design matrix

[L06, linear-regression, design-matrix-and-hat-matrix]

ISLP §3.2 presents the multiple-LR model in scalar form (eq. 3.19) but never writes the matrix form $y = X β + ε$ or defines the design matrix as an object. Benjamin’s matrix-form setup is the foundation for everything that follows.

Model

y = X β + ε, ε \sim N_{n} (0, σ^{2} I_{n}) .

Dimensions:

$y \in R^{n \times 1}$ — response vector.
$X \in R^{n \times (p + 1)}$ — design matrix (Benjamin’s grudging name: “never understood why. It’s not really a design of any kind. But it’s what people call it” L06).
$β \in R^{(p + 1) \times 1}$ — parameter vector, intercept $β_{0}$ plus $p$ slopes.
$ε \in R^{n \times 1}$ — error vector.

Explicit design matrix

The leading column of ones absorbs the intercept so the bias term disappears from the matrix equation. “Behind this beta is actually an X. It’s just all the values of X are one. So you don’t need to write it” L06.

X = 11 ⋮ 1 x_{11} x_{21} ⋮ x_{n 1} x_{12} x_{22} ⋮ x_{n 2} \dots \dots \dots x_{1 p} x_{2 p} ⋮ x_{n p} .

Multivariate-normal form of the assumptions

ISLP §3.1.2 gives the scalar assumptions ( $ε_{i}$ uncorrelated, common $σ^{2}$ ). Benjamin restates them as one $n$ -dimensional multivariate-normal statement:

E [ε] = 0, Cov (ε) = σ^{2} I_{n} .

The covariance matrix has $σ^{2}$ on the diagonal, zeros off — so the off-diagonals encode independence (assumption 5 in L05’s list), the diagonal equality encodes homoscedasticity (assumption 3). Geometrically the error vector is a spherical $n$ -dimensional Gaussian — “no matter which direction you look in, has the same variance, and it’s just kind of a big n-dimensional bell” L06.

Notation gotcha (flagged by the prof)

“You can define $p$ as including or not the intercept or bias term. This is just a note for those who are taking both classes that the notation is different in the books” L06. In this delta file, $p$ = number of slopes, so $X$ has $p + 1$ columns and df-for-noise is $n - p - 1$ . ISLP uses the same convention (eq. 3.25).

Classical regime

” $n ≫ p$ : more data points than parameters.” L06 When $p > n$ , $X^{⊤} X$ is necessarily singular (rank $\leq n < p + 1$ ) and OLS has no unique solution. This sets up module 6.

2. The OLS derivation in matrix form

[L06, least-squares-and-mle]

ISLP states (eq. 3.4) the simple-LR closed form, and explicitly declines to derive the multiple-LR matrix form (§3.2.1). Benjamin did it on the board, three lines, and flagged it as exam-template material via Exercise 6.1a / L12. Here is the full derivation.

Step 1 — write RSS in matrix form

RSS (β) = (y - X β)^{⊤} (y - X β) .

Step 2 — expand

RSS (β) = y^{⊤} y - y^{⊤} X β - β^{⊤} X^{⊤} y + β^{⊤} X^{⊤} X β = y^{⊤} y - 2 β^{⊤} X^{⊤} y + β^{⊤} X^{⊤} X β .

The two cross terms $y^{⊤} X β$ and $β^{⊤} X^{⊤} y$ combine because each is a $1 \times 1$ scalar and a scalar equals its transpose.

Step 3 — differentiate w.r.t. $β$ , set to zero

Using $\partial (a^{⊤} β) / \partial β = a$ and $\partial (β^{⊤} A β) / \partial β = (A + A^{⊤}) β = 2 A β$ for symmetric $A$ :

\frac{\partial RSS}{\partial β} = - 2 X^{⊤} y + 2 X^{⊤} X β = 0 .

Step 4 — normal equations

X^{⊤} X \hat{β} = X^{⊤} y (normal equations) .

Step 5 — invert when full rank

If $X^{⊤} X$ is invertible (i.e. $X$ has full column rank $p + 1$ , which requires $n \geq p + 1$ and no collinearity):

\hat{β} = (X^{⊤} X)^{- 1} X^{⊤} y .

Uniqueness

“I’m not showing, but you can prove if you want to take more derivatives, that this problem has a unique solution. There’s only one solution” L06. The second-derivative is $2 X^{⊤} X$ , positive definite when $X$ has full column rank — so the stationary point is a strict minimum, and it is the unique one.

Why this matters at the meta-level

“Most of the time, you actually have to go iteratively… most of the time when we’re trying to find this peak, we have to like climb up and go there. In fact, I would argue most of machine learning is finding good tricks to get to that peak… But in the case of linear regression with full-rank X, just right to the top. Very convenient” L06. This is why OLS is the canonical model for everything downstream (CIs, t-tests, exact distributions): we have the estimator in closed form, so every other quantity is also exact.

Reduction to simple-LR

In the $p = 1$ case, $X = [1 x]$ , and direct calculation of $(X^{⊤} X)^{- 1} X^{⊤} y$ reproduces ISLP eq. 3.4:

\hat{β}_{1} = \frac{\sum _{i} ( x _{i} - x ˉ ) ( y _{i} - y ˉ )}{\sum _{i} ( x _{i} - x ˉ ) ^{2}}, \hat{β}_{0} = \overset{y}{ˉ} - \hat{β}_{1} \overset{x}{ˉ} .

This consistency check is the recommended exercise L06.

3. The MLE ⇔ least-squares equivalence

[L05, L06, L27, least-squares-and-mle]

ISLP never proves this. It is the prof’s flagged theory-question template for the exam: “I do generally like to keep one theory question. … assume an additive Gaussian error model … Show that maximum likelihood and least squares are equivalent in $θ$ ” L27. Reproduced here in full.

Setup

Assume $ε_{i} \sim iid N (0, σ^{2})$ , so $y_{i} ∣ x_{i} \sim N (x_{i}^{⊤} β, σ^{2})$ .

Likelihood

The joint density of $(y_{1}, \dots, y_{n})$ given $X$ :

L (β, σ^{2}) = i = 1 \prod n \frac{1}{2 π σ ^{2}} exp (- \frac{( y _{i} - x _{i}^{⊤} β ) ^{2}}{2 σ ^{2}}) .

Log-likelihood

lo g L (β, σ^{2}) = - \frac{n}{2} lo g (2 π σ^{2}) - \frac{1}{2 σ ^{2}} i = 1 \sum n (y_{i} - x_{i}^{⊤} β)^{2} .

The argument

The first term $- \frac{n}{2} lo g (2 π σ^{2})$ does not depend on $β$ . The factor $1/ (2 σ^{2})$ in front of the sum is a positive constant that doesn’t change the location of the maximum. So

ar g β max lo g L (β, σ^{2}) = ar g β max [- i = 1 \sum n (y_{i} - x_{i}^{⊤} β)^{2}] = ar g β min i = 1 \sum n (y_{i} - x_{i}^{⊤} β)^{2} = ar g β min RSS (β) .

So $\hat{β}_{MLE} = \hat{β}_{LS}$ . $□$

MLE for $σ^{2}$ (as a side-effect)

Differentiating the log-likelihood w.r.t. $σ^{2}$ and solving gives $\overset{σ}{^}_{MLE}^{2} = RSS / n$ . The unbiased estimator (which is what is used everywhere else in the course, including residual standard error) is $\overset{σ}{^}^{2} = RSS / (n - p - 1)$ , with the $n - p - 1$ accounting for the $p + 1$ parameters consumed by $\hat{β}$ . Benjamin says of the difference: “if $n = 300$ it barely matters” L05.

What other loss → what other distribution

A useful by-product the prof drew on the board: a different choice of penalty implies a different error distribution.

Loss	MLE-equivalent error distribution	Property
$\sum_{i} (y_{i} - \overset{y}{^}_{i})^{2}$	$N (0, σ^{2})$ Gaussian	quadratic cost; outlier-sensitive
$\sum_{i} ∣ y_{i} - \overset{y}{^}_{i} ∣$	Laplace (double-exponential)	linear cost; robust to outliers
$\sum_{i} (y_{i} - \overset{y}{^}_{i})^{4}$	exotic, never used	”would really, really penalize anything far away” L05

“If we had our data was like this and then there was a point here, that point would have a stronger effect when fitting the model with a least squares fit, whereas a Laplace fit it wouldn’t be pulling it as strongly” L05.

Pitfall the prof himself stumbled on

“A student caught the prof on a sign during the L27 walkthrough” least-squares-and-mle. On the exam, state explicitly that maximizing $lo g L$ is the same as minimizing $- lo g L$ , which is the same as minimizing RSS up to constants. The sign-flip is the standard place to lose a point.

4. The hat matrix $H$

[L06, L08, design-matrix-and-hat-matrix]

ISLP §3.3.3 mentions $h_{i}$ (the diagonal element only) in eq. 3.37 for simple LR and notes “there is a simple extension of $h_{i}$ to the case of multiple predictors, though we do not provide the formula here.” Everything else about $H$ is delta. This is the matrix the prof said “has all the shit you need to get your hats for your parameters. So it’s called the hat matrix” L06.

Definition

H = X (X^{⊤} X)^{- 1} X^{⊤} \in R^{n \times n} .

Why “hat matrix”

Predictions are obtained from $y$ by applying $H$ :

\hat{y} = X \hat{β} = X (X^{⊤} X)^{- 1} X^{⊤} y = Hy .

“In math we call it a hat. It’s a pointy hat. But it’s a hat. And so this matrix H has all the shit you need to get your hats for your parameters” L06.

Properties (provable in two lines each)

These are the structural facts Benjamin emphasizes and that ISLP never lists.

(P1) Symmetric.

H^{⊤} = (X (X^{⊤} X)^{- 1} X^{⊤})^{⊤} = X ((X^{⊤} X)^{- 1})^{⊤} X^{⊤} = X (X^{⊤} X)^{- 1} X^{⊤} = H,

using that $X^{⊤} X$ is symmetric, so its inverse is symmetric.

(P2) Idempotent. $H^{2} = H$ .

H^{2} = X (X^{⊤} X)^{- 1} = I X^{⊤} X (X^{⊤} X)^{- 1} X^{⊤} = X (X^{⊤} X)^{- 1} X^{⊤} = H .

(P3) Orthogonal projection. $H$ is the orthogonal projection of $R^{n}$ onto the column space of $X$ . Combined: a symmetric idempotent matrix is exactly an orthogonal projector. Geometrically: $\hat{y}$ is the closest point to $y$ in the column-space of $X$ (which is exactly what least squares means).

(P4) Residual projector. $I - H$ is also symmetric and idempotent, and projects onto the orthogonal complement (the “residual space”):

e = y - \hat{y} = (I - H) y .

(P5) Orthogonality of fitted values and residuals.

\hat{y}^{⊤} e = (Hy)^{⊤} (I - H) y = y^{⊤} H (I - H) y = y^{⊤} (H - H^{2}) y = 0 .

The fitted values and the residuals are orthogonal vectors in $R^{n}$ .

(P6) Trace = rank = $p + 1$ .

tr (H) = tr (X (X^{⊤} X)^{- 1} X^{⊤}) = tr ((X^{⊤} X)^{- 1} X^{⊤} X) = tr (I_{p + 1}) = p + 1,

using the cyclic property of trace. Equivalently $\sum_{i = 1}^{n} h_{ii} = p + 1$ . Average leverage is $(p + 1) / n$ (ISLP §3.3.3 mentions this fact without deriving it).

(P7) Each diagonal entry $h_{ii} \in [1/ n, 1]$ . Lower bound follows from including the intercept column; upper bound from idempotency.

(P8) Leverage in multiple LR (the formula ISLP declined to give).

h_{ii} = x_{i}^{⊤} (X^{⊤} X)^{- 1} x_{i},

where $x_{i}$ is the $i$ -th row of $X$ written as a column vector (including the leading 1).

(P9) Leverage depends only on $X$ , not on $y$ . So a high-leverage point can be flagged from the design alone, before any response data is observed.

Residual covariance

The fact ISLP never states. Starting from $e = (I - H) y$ and $Cov (y) = σ^{2} I$ :

Cov (e) = (I - H) σ^{2} I (I - H)^{⊤} = σ^{2} (I - H) .

Implications:

$Var (e_{i}) = σ^{2} (1 - h_{ii})$ — raw residuals have unequal variances. High-leverage points have smaller residual variance (they pull the fit toward themselves, so their residual is small).
$Cov (e_{i}, e_{j}) = - σ^{2} h_{ij}$ — raw residuals are correlated, even when the true errors $ε_{i}$ are independent.

This motivates standardized residuals:

\tilde{r}_{i} = \frac{e _{i}}{σ ^ 1 - h _{ii}},

which have approximately unit variance and let the QQ plot / residuals-vs-fitted plot be read with the assumed Gaussian behaviour. “Your betas stay the same. It’s just a way to say, is my model any good?” L08.

Studentized residuals swap in $\overset{σ}{^}_{(i)}$ (the residual SE computed from the data with point $i$ deleted) to remove the circular use of $y_{i}$ in fitting and evaluating point $i$ . For $n ≳ 50$ , $\tilde{r}_{i}$ and $r_{i}^{*}$ are essentially indistinguishable L08.

Leverage in simple LR (the prof flagged this as the exercise question)

Direct algebra on $h_{ii} = x_{i}^{⊤} (X^{⊤} X)^{- 1} x_{i}$ for $X = [1 x]$ gives

h_{ii} = \frac{1}{n} + \frac{( x _{i} - x ˉ ) ^{2}}{\sum _{j = 1}^{n} ( x _{j} - x ˉ ) ^{2}} .

ISLP states this (eq. 3.37) but the derivation is delta. The point: $h_{ii}$ grows with distance from $\overset{x}{ˉ}$ — extreme $x$ -values are high-leverage.

LOOCV shortcut for OLS

For OLS fits only, leave-one-out cross-validation can be computed from one full-data fit using the hat matrix:

CV_{n} = \frac{1}{n} i = 1 \sum n (\frac{y _{i} - y ^ _{i}}{1 - h _{ii}})^{2} .

ISLP §5.1.2 states this for OLS without much justification. The reason it works is exactly P9 — leverage depends only on $X$ , not on $y$ — so leaving out the $y_{i}$ leaves $h_{ii}$ unchanged and the leave-one-out fitted value can be recovered analytically. Owned by leave-one-out-cv in module 5.

5. Sampling distribution of $\hat{β}$

[L05, L06, sampling-distribution-of-beta]

ISLP gives diagonal SE formulas for the simple-LR case in eq. 3.8 and waves vaguely at multiple LR. The clean multivariate theorem and its derivation are delta and are the load-bearing fact for all of regression inference.

Theorem

Under the Gaussian linear model $y = X β + ε$ with $ε \sim N_{n} (0, σ^{2} I)$ :

\hat{β} \sim N_{p + 1} (β, σ^{2} (X^{⊤} X)^{- 1}) .

Derivation (three lines)

Write $C = (X^{⊤} X)^{- 1} X^{⊤}$ , so $\hat{β} = Cy$ . Use $y \sim N_{n} (X β, σ^{2} I)$ and the linear-transformation property of the multivariate normal: $Cy$ is multivariate normal with

E [\hat{β}] = C E [y] = (X^{⊤} X)^{- 1} X^{⊤} X β = β,

Cov (\hat{β}) = C σ^{2} I C^{⊤} = σ^{2} (X^{⊤} X)^{- 1} X^{⊤} X (X^{⊤} X)^{- 1} = σ^{2} (X^{⊤} X)^{- 1} . □

Consequences

(C1) Unbiasedness. $E [\hat{β}] = β$ . “That’s what we want. If it was biased then we’d be upset because then our model is not going to give us the right shit” L06.

(C2) Per-coefficient variance. $Var (\hat{β}_{j}) = σ^{2} [(X^{⊤} X)^{- 1}]_{j j} \equiv σ^{2} c_{j j}$ — the $j$ -th diagonal of $(X^{⊤} X)^{- 1}$ . The SE estimator is $SE (\hat{β}_{j}) = \overset{σ}{^} c_{j j}$ .

(C3) Coefficients are correlated. The off-diagonals of $(X^{⊤} X)^{- 1}$ are generally nonzero. In simple LR, $Cov (\hat{β}_{0}, \hat{β}_{1}) = - σ^{2} \overset{x}{ˉ} / \sum_{i} (x_{i} - \overset{x}{ˉ})^{2}$ ; zero iff $\overset{x}{ˉ} = 0$ (center the data and the intercept becomes uncorrelated from the slope). ISLP nowhere states this.

(C4) Collinearity blow-up. As columns of $X$ become near-linearly-dependent, $X^{⊤} X$ becomes near-singular, its inverse’s entries blow up, and individual $Var (\hat{β}_{j}) \to \infty$ . The prof’s load-bearing observation: “This factor X transpose X comes into play in particular when two variables are basically the same, because then they can trade off each other and then this variance explodes” L06. ISLP discusses collinearity qualitatively in §3.3.3 but never connects it to the matrix algebra explicitly.

(C5) Why centering helps numerics. Centering $x$ around its mean kills the $\overset{x}{ˉ}$ term in the off-diagonal of $(X^{⊤} X)^{- 1}$ for simple LR (and reduces correlations between intercept and slopes in MLR), giving a better-conditioned inversion.

Residual standard error (matrix-form derivation)

The unbiased estimator of $σ^{2}$ is

\overset{σ}{^}^{2} = \frac{RSS}{n - p - 1} = \frac{e ^{⊤} e}{n - p - 1} .

Unbiasedness: $E [e^{⊤} e] = E [y^{⊤} (I - H) y] = σ^{2} tr (I - H) = σ^{2} (n - (p + 1))$ using property P6 of $H$ . Dividing by $n - p - 1$ gives an unbiased estimator. ISLP states the divisor (eq. 3.25) without this derivation; the $n - p - 1$ is the rank of the residual projector $I - H$ .

In simple LR this collapses to $n - 2$ (ISLP eq. 3.15). “Two degrees of freedom are eaten by $\hat{β}_{0}$ and $\hat{β}_{1}$ ” L05. In general, ” $p + 1$ degrees of freedom are eaten by the $p + 1$ entries of $\hat{β}$ .”

Independence of $\hat{β}$ and $\overset{σ}{^}^{2}$

A classical result Benjamin invokes implicitly when justifying t-tests with df $n - p - 1$ : under the Gaussian linear model, $\hat{β}$ and $RSS$ are independent random variables. Sketch: $\hat{β} = (X^{⊤} X)^{- 1} X^{⊤} y$ depends on $Hy$ ; $RSS = y^{⊤} (I - H) y$ depends on $(I - H) y$ . The two random vectors $Hy$ and $(I - H) y$ are jointly Gaussian and uncorrelated ( $Cov = H σ^{2} (I - H) = 0$ ), hence independent. This independence is what licenses the t-statistic $\hat{β}_{j} / SE (\hat{β}_{j})$ to have an exact $t_{n - p - 1}$ distribution under $H_{0}$ , rather than just approximately Gaussian.

Walpole is the prof’s recommended classical reference; the result is needed to get the $n - p - 1$ df exactly right sampling-distribution-of-beta.

6. The $\hat{β}$ -aware t-statistic and matrix-form CI

[L06, t-test-and-significance, confidence-and-prediction-intervals]

ISLP gives the simple-LR t-statistic (eq. 3.14) with $n - 2$ df. The matrix-form version with $n - p - 1$ df and $\overset{σ}{^} c_{j j}$ — and the explicit pointer to which diagonal of $(X^{⊤} X)^{- 1}$ — is delta.

Per-coefficient t-test

t_{j} = \frac{β ^ _{j}}{SE ( β ^ _{j} )} = \frac{β ^ _{j}}{σ ^ c _{j j}} \sim t_{n - p - 1} under H_{0} : β_{j} = 0,

where $c_{j j} = [(X^{⊤} X)^{- 1}]_{j j}$ .

Per-coefficient CI

\hat{β}_{j} \pm t_{1 - α /2, n - p - 1} \cdot \overset{σ}{^} c_{j j} .

CI for the mean response at $x_{0}$

ISLP §3.2.2 mentions CIs for $\overset{y}{^}_{0}$ verbally but does not give the matrix-form formula. Delta:

x_{0}^{⊤} \hat{β} \pm t_{1 - α /2, n - p - 1} \cdot \overset{σ}{^} x_{0}^{⊤} (X^{⊤} X)^{- 1} x_{0} .

Derivation: $\overset{y}{^}_{0} = x_{0}^{⊤} \hat{β}$ is a linear function of the multivariate-normal $\hat{β}$ , so $\overset{y}{^}_{0} \sim N (x_{0}^{⊤} β, σ^{2} x_{0}^{⊤} (X^{⊤} X)^{- 1} x_{0})$ .

PI for a future observation at $x_{0}$

x_{0}^{⊤} \hat{β} \pm t_{1 - α /2, n - p - 1} \cdot \overset{σ}{^} 1 + x_{0}^{⊤} (X^{⊤} X)^{- 1} x_{0} .

Derivation: a future observation $y_{new} = x_{0}^{⊤} β + ε_{new}$ has $ε_{new} \sim N (0, σ^{2})$ independent of the past data, so

Var (y_{new} - \overset{y}{^}_{0}) = Var (\overset{y}{^}_{0}) + σ^{2} = σ^{2} (1 + x_{0}^{⊤} (X^{⊤} X)^{- 1} x_{0}) .

The +1 under the square root is the irreducible noise — the source of “PI always wider than CI” L06.

Band shape

Both bands are narrowest where $x_{0}$ is near the centroid of the data (because $x_{0}^{⊤} (X^{⊤} X)^{- 1} x_{0}$ is small there) and fan out at the extremes. The CI band hugs the line; the PI band is wider by a constant-σ² floor.

7. The F-statistic and the partial F-statistic (statement only)

[L06, f-test]

Scope flag. Benjamin was emphatic: “I’m going to say right now I probably won’t ask any questions about an F-test. … I’m not going to make you compute it because I honestly don’t care” L06. The mechanics are out of scope; the null hypothesis and the why-you’d-use-it reasoning are in. ISLP §3.2.2 gives both eq. 3.23 and eq. 3.24 in full. No delta on F-test math.

The one structural point that is delta and worth stating: $F_{1, n - p - 1} = t_{n - p - 1}^{2}$ , so the F-test on a single coefficient is equivalent to the squared t-test f-test. ISLP §3.2.2 mentions this in a footnote (footnote 7) but does not derive it.

8. Notation and naming differences

”Bias” vs “intercept”

ISLP uses “intercept” throughout. Benjamin prefers “bias” for $β_{0}$ and uses both interchangeably. “The bias is $β_{0}$ ” / “the intercept is $β_{0}$ ” — same object. L05

$p$ counting

ISLP and Benjamin both use $p$ to mean number of slopes, with the design matrix having $p + 1$ columns and df $n - p - 1$ . The convention is consistent across both sources, but Benjamin flagged that some books count the intercept inside $p$ . L06

”Residuals are predictions of errors”

Benjamin draws a sharp distinction that ISLP does not:

“The error terms are random variables and cannot be estimated. They can be predicted.” L05

So $ε_{i}$ is an unobservable random variable, and $e_{i} = y_{i} - \overset{y}{^}_{i}$ is a prediction of $ε_{i}$ , not an estimate. ISLP uses “estimate” loosely. Stating this distinction may earn marks on a careful T/F.

”Design matrix”

ISLP uses “design matrix” with no commentary. Benjamin keeps the name but mocks it: “It’s often called the design matrix. The data. Never understood why. It’s not really a design of any kind. But it’s what people call it” L06.

Independence as the load-bearing assumption

ISLP §3.3.3 lists “correlation of error terms” as item 2 of six “potential problems” with no ranking. Benjamin ranks the assumptions, with independence (4 and 5 in his list) as the dangerous ones and Gaussian / zero-mean / homoscedastic as relatively benign: “violations [of independence] ruin everything” L05. This is not a formula difference, but it is a framing the exam might test (e.g. “which assumption violation most invalidates the SE estimates?”).

”Main-effects rule” vs “hierarchical principle”

Same thing. ISLP §3.3.2 calls it the hierarchical principle. Benjamin calls it the main-effects rule L06 — verbatim: “whenever you include an interaction, you want to include what is referred to as the main effects."

"Statistical vs practical significance”

Not in ISLP. Benjamin’s organizing framing for the t-test discussion: large $n$ makes everything statistically significant; the slope size is what tells you whether it actually matters. “Significance is just sample size” L05. This is a framing the exam will likely test via T/F.

Five-item assumption list (vs ISLP’s six-item problem list)

L05’s positive assumption list:

Normally distributed $ε_{i}$ .
Mean zero $E [ε_{i}] = 0$ .
Common variance $Var (ε_{i}) = σ^{2}$ .
Independent of any other variable.
Independent of each other.

ISLP §3.3.3 gives the violations:

Non-linearity.
Correlation of error terms.
Non-constant variance.
Outliers.
High-leverage points.
Collinearity.

These are inverse views. Benjamin’s framing is “what you assumed”; ISLP’s is “what can go wrong.” Mapping: (1)→(non-Gaussian residuals shown on QQ), (3)→(non-constant variance), (4)+(5)→(error correlation, e.g. time series tracking).

statistical.dog

Explorer

M03: Linear Regression — Book delta