Linear regression
The prof’s frame: regression is Y given X, not the joint distribution. The simplest parametric model with the deepest theoretical plumbing, closed-form estimator, exact sampling distribution of β̂, and a workshop for every concept that comes later in the course (bias–variance, regularization, basis expansion).
Definition (prof’s framing)
“Today we will discuss Y given X. So not the joint distribution of them, but Y given X. So we’re trying to essentially make a model of it… we will look at how things co-vary, but in the sense of how Y varies as a function of X.” - L05-linreg-1
Simple linear regression:
Multiple linear regression (matrix form, the version actually used):
The bias term is hidden in the leading column of ones in , see design-matrix-and-hat-matrix. Prof prefers “bias” to “intercept” for .
Notation & setup
- samples, predictors. is (the +1 is the bias column), is , and are .
- “Independent / explanatory / regressor” all refer to ; “dependent / outcome / response” all refer to , prof uses them interchangeably.
- Notation gotcha he flagged: some books include the intercept in , others don’t.
- Body-fat / BMI is the running worked example. Slope ≈ 1.8 → one unit of BMI ↔ ~1.8 percentage points of body fat.
Formula(s) to know cold
Simple least squares (closed-form, easily memorized):
Multiple regression closed form (the headline equation):
Predictions where is the hat matrix.
Residuals . Prof’s distinction: residuals are predictions of the underlying , not estimates, error terms are random variables.
Insights & mental models
Why study a model this simple
“Is linear regression too simple? Yeah, for some things… but for other things it can be quite useful. And the nice thing is that, like I said, it’s one you can understand.” - L05-linreg-1
Three reasons the prof keeps coming back to it: (i) parametric, you can construct, modify, reason about the parameters; (ii) phenomena from much fancier models (e.g. the second descent past interpolation, see double-descent) can be derived and understood from the regression lens; (iii) interpretable. He invokes “always aim to minimize the length of my method sections.”
He also pointed out that next-word language modeling is structurally not so different, Y is the next token, X is the prior text, just much more complicated.
”Linear” means linear in the parameters
The most-quoted point in the lecture pair, repeated with emphasis:
“It’s still called linear regression even though you’re fitting . It’s a quadratic function but it’s still a linear model. And it’s linear because it’s linear in the coefficients. Linear in the parameters.” - L06-linreg-2
You can throw in , , , dummies, interactions, still linear regression. See polynomial-regression and categorical-encoding-and-interactions.
Closed form is special
Most ML problems can’t be solved in one shot, you iterate up the likelihood. Linear regression with full-rank has the answer in a single matrix multiplication. The prof’s framing:
“Most of machine learning is finding good tricks to get to that peak… But in the case of linear regression with full-rank X, just right to the top. Very convenient. Which is also why we’re studying regression even though we want to study something more complicated.” - L06-linreg-2
”All models are wrong, some are useful”
He cites George Box (via David Hand) to organize the rest: two purposes, understand relationships (interpretive) or predict (decisional). “In this course, I think the focus is more about prediction.” - L06-linreg-2
What you can’t ask
“Is this relationship causal, or better explained by something else?” - L05-linreg-1
Regression results are “fancy correlations.” Don’t read causation into the slope.
Exam signals
“I might say in the multivariate linear regression case, how would I test if at least one of the predictors is useful in predicting the response, or I might ask: why would I want to know this, what’s the point? I like to ask kind of these questions… where you have to just sort of reason through why you’re doing something.” - L06-linreg-2
“It’s still called linear regression even though you’re fitting . … It’s linear in the parameters.” - L06-linreg-2
The four canonical “important questions” in multiple regression are slide structure he flagged he might exam on, see “How it might appear” below.
Pitfalls
- Causation. Regression is a fancy correlation. Don’t claim causation from a slope.
- “The eye line.” Drawing the best-looking line through a scatterplot tends toward the major axis (joint-distribution view), not the conditional regression line, which minimizes vertical residuals only.
- Categorical with K > 2 levels coded as 0/1/2. Imposes ordering; use dummies + reference category. See categorical-encoding-and-interactions.
- Confusing residual with error. Errors are random and unobservable; residuals are observed predictions of them.
- n vs p. OLS needs (full rank ). → no unique solution; need regularization (module 6).
Scope vs ISLP
- In scope: model statement, both estimators (LS / matrix), Gaussian-error assumptions, sampling distribution of , CI/PI, t/F tests, , categorical encoding, interactions, polynomial / nonlinear transforms, residual diagnostics, collinearity (qualitative).
- Look up in ISLP: §3.1, §3.2, §3.3 (pp. 59–104), especially §3.2.1 for the matrix derivation and §3.3 for the extensions.
- Skip in ISLP (book-only / prof excluded): F-test mechanics (L06-linreg-2: “won’t ask any questions about an F-test”), VIF formulas (L08-classif-2: self-study), Moore-Penrose pseudoinverse details, formal Shapiro–Wilk normality tests (L08-classif-2).
Exercise instances
- Exercise3.1c, multiple lm of
mpgon allAutopredictors; interpret summary (significant predictors, weight, year) - Exercise3.1d, F-test / ANOVA logic for the 3-level factor
origin - Exercise3.1g, interaction
year × origin; interpret slope differences - Exercise3.1h, try , , to fix residual issues
- CE1 problem 2b, decide on a transformation for the worm data so linearity is reasonable
- CE1 problem 2c, fit lm with
MAGENUMF(continuous, transformed) +Gattung(3-level factor); write three group equations - CE1 problem 2d, test whether
MAGENUMF × Gattunginteraction matters
How it might appear on the exam
- Output interpretation. Prof said explicitly: he won’t make you write
lm(...)code; he’ll give you the regression output table (estimate, SE, t-value, p-value) and ask interpretive questions. The 2025 Q6a Boston-housing reformulation is the template - L27-summary. - Q1 of the four canonical questions (“is at least one predictor useful?”), knowing why you’d use an F-test (without computing it) is fair game.
- Three group equations when a factor predictor enters, just like CE1 2c, write the model out explicitly for each level.
- Interaction interpretation: main-effect coefficient on is only the difference at ; this is a flagged exam trap (see Q2 of the 2025 walk-through, L27-summary). Always state the interaction-aware interpretation.
- Linear-in-parameters T/F. “Adding makes the regression nonlinear”, false; it’s linear in .
Related
- least-squares-and-mle: the estimator and its MLE equivalence under Gaussian errors
- gaussian-error-assumptions: what’s assumed and what breaks
- design-matrix-and-hat-matrix: the algebra of the closed form
- sampling-distribution-of-beta: distribution of for inference
- confidence-and-prediction-intervals: uncertainty around mean response vs. future observation
- t-test-and-significance: per-coefficient inference
- r-squared: goodness-of-fit measure (prof distrusts)
- residual-diagnostics: checking the assumptions
- collinearity: what blows up
- f-test: Q1 of the four important questions
- categorical-encoding-and-interactions: extension to factor predictors
- polynomial-regression: extension via basis expansion
- multivariate-normal: the foundation for the sampling distribution
- bias-variance-tradeoff: the lens for model selection on top of OLS