Module 03 — Linear Regression

Question 1 3 points

A model is fit as $y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon_i$ by ordinary least squares. Which statement best describes this model?

A It is nonlinear regression because the response curves with $x$.
B It is a generalised additive model because two basis terms enter additively.
C It can only be fit by gradient descent because of the quadratic term.
D It is linear regression because it is linear in the parameters $\boldsymbol\beta$.

Show answer

Correct answer: D

"Linear" refers to the parameters, not the predictor. Treat $(1, x, x^2)$ as columns of $\mathbf{X}$ and run ordinary OLS — closed-form $\hat{\boldsymbol\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ still applies.

A confuses curvature in $x$ with non-linearity in $\boldsymbol\beta$ (the canonical L06 trap). B is wrong because GAMs use *smoothed* univariate functions, not fixed polynomial bases. C is wrong: OLS has a one-shot closed form whenever $\mathbf{X}^\top\mathbf{X}$ is invertible.

Atoms: polynomial-regression, linear-regression. Lecture: L06-linreg-2.

Question 2 5 points ISLP §3 Q3

Starting salary (in $\$1{,}000$) is modelled as $$\hat y = 50 + 20\,\text{GPA} + 0.07\,\text{IQ} + 35\,\text{Level} + 0.01\,(\text{GPA}\cdot\text{IQ}) - 10\,(\text{GPA}\cdot\text{Level})$$ with $\text{Level} = 1$ for college and $0$ for high school. Which statement is correct, holding $\text{IQ}$ and $\text{GPA}$ fixed?

A College graduates earn more on average than high-school graduates regardless of GPA.
B High-school graduates earn more on average than college graduates regardless of GPA.
C College graduates earn more on average, provided that GPA is high enough.
D High-school graduates earn more on average, provided that GPA is high enough.

Show answer

Correct answer: D

The college-vs-high-school gap is $35 - 10\cdot\text{GPA}$. It is positive only when $\text{GPA} < 3.5$ and flips negative for $\text{GPA} > 3.5$. So at sufficiently high GPA, high-school graduates earn more.

A reads only the main effect $35$ and ignores the interaction (the canonical "main effect under interaction" trap). B ignores the +35 main effect entirely. C flips the sign of the interaction-driven crossover.

Atoms: categorical-encoding-and-interactions, linear-regression. Lecture: L27-summary.

Question 3 4 points Exam 2025 P2a

Adult weight (kg) is modelled on age (years) and sex (male = 1, female = 0) with an $\text{age}\times\text{sex}$ interaction. The fitted equation is $$\hat{\text{weight}} = 60.2 + 0.11\,\text{age} + 5.03\,\text{sex} + 0.68\,(\text{age}\cdot\text{sex}).$$ Mark each statement as true or false.

a) Female weight increases on average by $0.79$ kg per year of age. True False
b) Males weigh on average $5.03$ kg more than females across the entire age range in the data. True False
c) The age slope for males is $0.79$ kg per year. True False
d) The dependence of weight on age is allowed to differ between males and females in this model. True False

Show answer

False — for females ($\text{sex}=0$) the age slope is $0.11$, not $0.79$. The value $0.79 = 0.11 + 0.68$ is the *male* age slope; this is the canonical "read the slope through the interaction" trap.
False — the $5.03$ coefficient is the male-vs-female gap *only when age $= 0$*. Since the interaction is $0.68$, the gap at age $a$ is $5.03 + 0.68\,a$, very different from $5.03$ across the data range. This is the prof's flagged 2025 Q2 interaction trap.
True — male age slope = $0.11 + 0.68 = 0.79$ kg/year.
True — the $\text{age}\cdot\text{sex}$ term is precisely what allows different age slopes per sex.

Atoms: categorical-encoding-and-interactions. Lecture: L27-summary.

Question 4 4 points ISLP §3 Q4

With $n=100$ observations of a single predictor $X$ and quantitative response $Y$, you fit Model A: $\hat y = \hat\beta_0 + \hat\beta_1 x$, and Model B: $\hat y = \hat\beta_0 + \hat\beta_1 x + \hat\beta_2 x^2 + \hat\beta_3 x^3$. Suppose the *true* relationship is linear. What can you say about the **training** RSS of the two models?

A Training RSS is the same, since the truth is linear and OLS is unbiased.
B Training RSS is lower for Model B, because adding parameters cannot raise training RSS.
C Training RSS is lower for Model A, because the cubic terms add irreducible noise to the residuals.
D Cannot be determined without seeing the test set, since RSS depends on the held-out split.

Show answer

Correct answer: B

Model A is nested inside Model B (set $\beta_2 = \beta_3 = 0$). On the training data, OLS picks the coefficients that *minimise* RSS over the larger parameter space, so Model B's training RSS $\le$ Model A's. The prof's flagged keyword is training — for test RSS it would flip.

A misses that adding parameters lets the optimiser fit noise — the truth being linear matters for population-level bias, not for finite-sample training fit. C confuses irreducible noise with what the model *fits to*; cubic terms don't add noise, they absorb residuals. D conflates training and test — training RSS is computed only on the training split and is fully determined by the fit.

Atoms: polynomial-regression, r-squared, bias-variance-tradeoff. Lecture: L27-summary.

Question 5 3 points ISLP §3 Q4

Same setup as Q4 (true relationship is linear, $n=100$). What do you expect for the **test** RSS of Model A versus Model B?

A Lower for Model A on average; Model B overfits noise.
B Lower for Model B on average; more flexibility always helps prediction.
C Identical, because both models contain the truth.
D The cubic and linear models give exactly the same predictions, so test RSS is identical.

Show answer

Correct answer: A

The cubic model has the same bias as the linear model (both contain the truth) but extra variance from estimating two unnecessary coefficients, so on average test RSS is lower for the simpler Model A.

B inverts the bias-variance message — when the truth is in the simpler model, the simpler model wins. C ignores the variance term — Model B's noisier coefficient estimates inflate prediction MSE. D is wrong: with finite samples $\hat\beta_2, \hat\beta_3$ are nonzero point estimates, so Model B's predictions differ from Model A's.

Atoms: polynomial-regression, bias-variance-tradeoff.

Question 6 5 points Exam 2025 P4a

A Boston-housing-style regression of median home value (medv, in $\$1000$) on nine predictors plus an $\text{rm}^2$ quadratic term gives the following partial output for the $\text{chas}$ indicator (1 if tract bounds the Charles River, 0 otherwise):

chas Estimate 3.36 Std. Error 0.86 t value 3.91 Pr(>|t|) 0.0001

Using the working approximation $t_{0.975, n-p-1}\approx 2$, the approximate 95% confidence interval for the chas effect is closest to:

A $[3.36 \pm 0.86]$, i.e. $[2.50,\ 4.22]$ thousand dollars.
B $[3.36 \pm 1.72]$, i.e. $[1.64,\ 5.08]$ thousand dollars.
C $[3.36 \pm 7.82]$, i.e. $[-4.46,\ 11.18]$ thousand dollars.
D $[0.0001,\ 3.36]$ thousand dollars.

Show answer

Correct answer: B

$\hat\beta \pm t_{0.975}\cdot\mathrm{SE} \approx 3.36 \pm 2\cdot 0.86 = 3.36 \pm 1.72 \to [1.64, 5.08]$. Houses bounding the river fetch on average $\$1{,}640$–$\$5{,}080$ more, holding other predictors constant.

A drops the $t$ multiplier, using one SE rather than two. C uses the t-value $3.91$ instead of the t-quantile (a confusion of column names). D treats the p-value as a confidence bound — p-values and CIs are different objects.

Atoms: confidence-and-prediction-intervals, sampling-distribution-of-beta, t-test-and-significance. Lecture: L27-summary.

Question 7 3 points Exam 2025 P4a

A linear model is fit with an intercept, 7 continuous predictors, a 9-level factor (entered as dummies with one reference level), and one quadratic term I(rm^2). How many regression parameters does the fitted model consume (counting the intercept)?

A $7 + 9 + 1 = 17$.
B $1 + 7 + 8 + 1 = 17$.
C $1 + 7 + 9 + 1 = 18$.
D $1 + 7 + 8 + 2 = 18$.

Show answer

Correct answer: B

1 intercept + 7 continuous slopes + $K-1 = 8$ dummies for the 9-level factor + 1 for $\text{rm}^2$ = 17.

A forgets the intercept entirely. C uses $K=9$ dummies instead of $K-1$ — the 9-th level is absorbed into the intercept (otherwise $\mathbf{X}^\top\mathbf{X}$ is singular). D double-counts the quadratic term as if both $\text{rm}$ and $\text{rm}^2$ were *new* (they each consume one slot, and $\text{rm}$ is already in the 7 continuous predictors).

Atoms: categorical-encoding-and-interactions, design-matrix-and-hat-matrix.

Question 8 4 points

A multiple regression is fit with $n = 200$ observations and $p = 9$ slope parameters. For one predictor, the output gives $\hat\beta = 0.40$ and $\mathrm{SE}(\hat\beta) = 0.20$. Using the working cutoff $t_{0.975, 190}\approx 1.97$, is this coefficient significant at the 5% level, and what is its t-statistic?

A $t = 2.0$; significant at the 5% level.
B $t = 0.5$; not significant at the 5% level.
C $t = 0.08$; not significant at the 5% level.
D $t = 2.0$; not significant because $|t|$ has to exceed $t_{0.99}\approx 2.6$.

Show answer

Correct answer: A

$t = \hat\beta / \mathrm{SE}(\hat\beta) = 0.40 / 0.20 = 2.0 > 1.97$, so reject $H_0:\beta=0$ at $\alpha=0.05$.

B inverts numerator and denominator (computes $\mathrm{SE}/\hat\beta$). C drops the SE step and reads $\hat\beta\cdot\mathrm{SE}$. D applies the wrong cutoff (1% two-sided instead of 5%).

Atoms: t-test-and-significance, sampling-distribution-of-beta.

Question 9 4 points CE1 P2g

For a t-test with null $H_0:\beta_j = 0$ and observed two-sided p-value $p$, mark each statement as true or false.

a) $1 - p$ is the probability that $H_0$ is true. True False
b) If $p > 0.05$, we conclude that $H_1$ is not true. True False
c) $p$ is the probability of observing a test statistic at least as extreme as the one we got, computed under $H_0$. True False
d) $p$ tells us the probability that our results happened by random chance. True False

Show answer

False — the p-value is computed *conditional on* $H_0$; it is not a posterior probability of $H_0$.
False — failing to reject $H_0$ is "insufficient evidence", not "$H_1$ is false".
True — this is the canonical correct definition.
False — under $H_0$ "everything is random chance" by assumption; the p-value is a tail probability under that assumption, not a probability of "results being chance".

Atoms: t-test-and-significance.

Question 10 3 points

A regression on $n = 50{,}000$ observations gives $\hat\beta = 0.003$ kg/year for a body-fat predictor, with $p < 10^{-6}$. The investigator concludes that the effect is large because the p-value is tiny. The best critique is:

A A p-value below $10^{-6}$ on a sample this large proves that the linear model is correctly specified, so the conclusion about effect size is logically sound.
B The effect must be large because the p-value is so small; at sample sizes in the tens of thousands, effect size and statistical significance carry essentially the same information.
C The p-value is invalid because it falls below the conventional reporting threshold of $0.001$ and should be re-computed with a more conservative correction.
D A small p-value is evidence the effect is real, but the effect size is the slope, not the p-value; $0.003$ kg/year is practically negligible.

Show answer

Correct answer: D

The prof's "significance is just sample size" sermon: with huge $n$ the SE shrinks like $1/\sqrt n$, so any non-zero effect becomes statistically significant. Practical (effect-size) and statistical (p-value) significance are different — you want both, ideally.

A confuses significance with model adequacy — diagnostics, not p-values, check specification. B is the canonical conflation the prof warns against. C invents a non-existent reporting rule.

Atoms: t-test-and-significance, sampling-distribution-of-beta. Lecture: L05-linreg-1.

Question 11 3 points

Recall the simple-LR formula $\mathrm{SE}(\hat\beta_1)^2 = \sigma^2 / \sum_i (x_i - \bar x)^2$. Mark each statement about $\mathrm{SE}(\hat\beta_1)$ as true or false.

a) Doubling $n$ while keeping the spread of $x$ fixed approximately divides $\mathrm{SE}(\hat\beta_1)$ by $2$. True False
b) Spreading the $x$ values further from $\bar x$ (without changing $n$ or $\sigma^2$) decreases $\mathrm{SE}(\hat\beta_1)$. True False
c) Adding a second predictor that is highly correlated with $x_1$ generally decreases $\mathrm{SE}(\hat\beta_1)$. True False
d) Larger $\sigma^2$ (more noise in the response) increases $\mathrm{SE}(\hat\beta_1)$. True False

Show answer

False — variance scales like $1/n$, so SE scales like $1/\sqrt n$. Doubling $n$ divides SE by $\sqrt 2$, not by $2$ — this is the canonical "$\sqrt n$ rate" trap.
True — the denominator $\sum(x_i - \bar x)^2$ grows, shrinking SE; this is the prof's "design wider experiments" lever.
False — collinearity *inflates* SE: $\mathbf{X}^\top\mathbf{X}$ becomes near-singular, $(\mathbf{X}^\top\mathbf{X})^{-1}$ blows up.
True — SE is proportional to $\sigma$.

Atoms: sampling-distribution-of-beta, collinearity. Lecture: L05-linreg-1.

Question 12 3 points

At a fixed test point $\mathbf{x}_0$, you can construct (a) a 95% confidence interval for $\mathbf{x}_0^\top\boldsymbol\beta$ and (b) a 95% prediction interval for a future $Y$ at $\mathbf{x}_0$. Which of the following correctly contrasts them?

A The CI captures uncertainty in $\hat{\boldsymbol\beta}$ only; the PI adds the irreducible noise variance $\sigma^2$, so PI is always wider.
B The CI captures the true individual response; the PI captures the population mean, so PI is wider only when $\mathbf{x}_0$ is far from the data mean.
C Both intervals carry the same uncertainty; the only difference is the nominal coverage level.
D The PI is always narrower because it conditions on the observed $\mathbf{x}_0$ rather than averaging over data.

Show answer

Correct answer: A

The CI for the *mean response* uses $\hat\sigma\sqrt{\mathbf{x}_0^\top(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{x}_0}$; the PI for a *future observation* uses $\hat\sigma\sqrt{1 + \mathbf{x}_0^\top(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{x}_0}$ — that "+1" is the $\sigma^2$ contribution from $\varepsilon_{\text{new}}$.

B swaps the targets: the CI is for the mean, the PI is for an individual. C ignores the structurally different objects each interval covers. D inverts the inequality — PI is *always* wider, never narrower.

Atoms: confidence-and-prediction-intervals, sampling-distribution-of-beta.

Question 13 4 points

Mark each statement about $R^2$ and $R^2_{\text{adj}}$ in OLS as true or false.

a) Adding any predictor to an OLS model with intercept never decreases the training $R^2$. True False
b) Adjusted $R^2$ is monotone non-decreasing in the number of predictors, just like ordinary $R^2$. True False
c) For simple linear regression with intercept, $R^2$ equals the squared sample correlation $\mathrm{Cor}(\mathbf{x}, \mathbf{y})^2$. True False
d) An $R^2$ of $0.7$ means the model's predictions are correct $70\%$ of the time. True False

Show answer

True — the larger model contains the smaller as a special case ($\beta_{\text{new}} = 0$), so OLS minimisation cannot raise RSS.
False — that's the property of ordinary $R^2$. Adjusted $R^2$ has the $(n-1)/(n-p-1)$ penalty, so it *can* fall when added parameters do not pull their weight; that's exactly what makes it a model-selection signal rather than just a goodness-of-fit number.
True — a standard simple-LR identity.
False — $R^2$ is the *fraction of variance* in $Y$ explained, not a classification accuracy.

Atoms: r-squared, linear-regression.

Question 14 3 points

A linear model is fit to $n = 8$ observations with response variance summary $\mathrm{TSS} = 800$. The fitted residuals satisfy $\mathrm{RSS} = 200$. What is $R^2$?

A $0.20$.
B $0.25$.
C $0.75$.
D $4.00$.

Show answer

Correct answer: C

$R^2 = 1 - \mathrm{RSS}/\mathrm{TSS} = 1 - 200/800 = 0.75$.

A reports $\mathrm{RSS}/\mathrm{TSS} = 0.25$ as $0.20$ — arithmetic slip. B reports $\mathrm{RSS}/\mathrm{TSS}$ instead of $1 - \mathrm{RSS}/\mathrm{TSS}$ (forgets the $1-$). D inverts the ratio to $\mathrm{TSS}/\mathrm{RSS}$.

Atoms: r-squared.

Question 15 3 points

After fitting an OLS regression you produce a QQ plot of standardised residuals against theoretical normal quantiles. The middle of the plot lies on the reference line, but both tails curl strongly upward at the high end and downward at the low end (heavy "S-shape"). Which assumption is most directly called into question?

A Independence of the errors $\varepsilon_i$ across observations in time or space.
B Normality of the errors — the residuals look heavy-tailed relative to a Gaussian.
C Linearity of the conditional mean $E[Y\mid X]$ as a function of the predictors.
D Constancy (homoscedasticity) of $\sigma^2$ across the full range of fitted values.

Show answer

Correct answer: B

An S-curve with extreme tails pulling away from the reference line is the canonical "heavy tails / non-Gaussian errors" pattern on a QQ plot. The bulk being aligned says the centre of the distribution looks roughly normal; the deviations are at the extremes.

A would show as structure in the residuals-vs-fitted or order plots, not in the QQ plot. C shows up as curvature in residuals-vs-fitted. D shows up as a fan in residuals-vs-fitted or scale-location plots, not as an S in the QQ plot.

Atoms: residual-diagnostics, gaussian-error-assumptions. Lecture: L06-linreg-2.

Question 16 4 points

In simple linear regression with $n = 5$ predictor values $\mathbf{x} = (1, 2, 3, 4, 10)$, what is the leverage $h_{55}$ of the fifth observation ($x_5 = 10$)? Recall $h_{ii} = \tfrac{1}{n} + (x_i - \bar x)^2 / \sum_j (x_j - \bar x)^2$.

A $0.20$.
B $0.72$.
C $0.92$.
D $1.20$.

Show answer

Correct answer: C

$\bar x = (1+2+3+4+10)/5 = 4$. Deviations $(x_i - \bar x) = (-3, -2, -1, 0, 6)$, so $\sum(x_j-\bar x)^2 = 9 + 4 + 1 + 0 + 36 = 50$. Then $h_{55} = 1/5 + 36/50 = 0.20 + 0.72 = 0.92$. The fifth observation has very high leverage — it sits at the far end of $x$.

A reports only the baseline $1/n$ term and forgets the $(x_i - \bar x)^2$ contribution. B reports just the second term $36/50 = 0.72$ and forgets the $1/n$ baseline. D adds a full $1$ instead of $1/n$ — confusing the prediction-interval "+1" with the leverage formula.

Atoms: residual-diagnostics, design-matrix-and-hat-matrix.

Question 17 3 points

On a residuals-vs-leverage plot, four points are highlighted. Which one is *most* dangerous for the OLS fit (i.e. exerts the largest influence on $\hat{\boldsymbol\beta}$)?

A A point with low leverage and a large residual.
B A point with high leverage and a residual close to zero.
C A point with high leverage and a large residual.
D A point with low leverage and a residual close to zero.

Show answer

Correct answer: C

Cook's distance combines both: $D_i \propto \tilde r_i^2 \cdot h_{ii}/(1-h_{ii})$. The dangerous corner is high leverage *and* large residual — the "fat kid far from centre on the seesaw" image.

A has no lever arm — the point can have a large residual but limited pull on the line. B is "harmless high leverage" — the point sits on the fitted line, so it doesn't bend it. D has neither lever nor residual; the safest case.

Atoms: residual-diagnostics. Lecture: L08-classif-2.

Question 18 4 points

Two predictors $x_1$ and $x_2$ are highly correlated (sample correlation $\approx 0.99$). Mark each statement about the OLS fit as true or false.

a) The standard errors of $\hat\beta_1$ and $\hat\beta_2$ are inflated relative to a fit with only one of the two predictors. True False
b) The point estimates $\hat\beta_1, \hat\beta_2$ become biased. True False
c) Individual t-tests on $\beta_1$ and $\beta_2$ may both fail to reject even when the joint contribution is highly significant. True False
d) Predictions $\hat y$ on data similar to the training set are unreliable, because high collinearity propagates the inflated coefficient variance into prediction MSE. True False

Show answer

True — collinearity makes $\mathbf{X}^\top\mathbf{X}$ near-singular, so $(\mathbf{X}^\top\mathbf{X})^{-1}$ has large diagonal entries → SE explodes.
False — under the classical assumptions, OLS is unbiased even under collinearity; only the variance is inflated.
True — the canonical reason to do the joint F-test before drilling into per-coefficient t's.
False — predictions are typically *stable*, not unreliable: even when individual coefficients have huge SEs, the *sum* $\beta_1 x_1 + \beta_2 x_2$ is well-determined, so $\hat y$ on data similar to the training set is fine. Collinearity hurts inference about individual $\beta_j$, not in-sample prediction.

Atoms: collinearity, sampling-distribution-of-beta, multivariate-normal (the joint $N_{p+1}(\boldsymbol\beta, \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1})$ distribution of $\hat{\boldsymbol\beta}$ is the foundation for inflated SEs under collinearity).

Question 19 3 points

You want to test whether *any* of the predictors in a multiple regression with $p = 5$ slopes contributes to predicting $Y$. Which is the appropriate procedure?

A Pick the smallest individual p-value and report it as the overall test.
B Compute the residual standard error and reject if it is below $\sigma$.
C Run an F-test on $H_0:\beta_1 = \cdots = \beta_5 = 0$ before drilling into individual t-tests.
D Compare $R^2$ between the full model and the intercept-only model and reject if $R^2 > 0.5$.

Show answer

Correct answer: C

The F-test on the joint null is exactly the question being asked. It is also robust to collinearity: individual t's can both fail while the joint F is highly significant. Note: you are not expected to compute the F-statistic on this exam — only to know what null it tests and why you'd reach for it.

A inflates the family-wise false-positive rate (the multiple-testing trap), and ignores collinearity-masked signals. B is meaningless — RSE estimates $\sigma$ but doesn't test "any predictor matters". D invents a non-existent rejection rule.

Atoms: f-test, t-test-and-significance, collinearity. Lecture: L06-linreg-2.

Question 20 4 points CE1 P2c

Earthworm weight $Y$ is modelled on stomach circumference $X$ (continuous) and genus $G \in \{L, N, Oc\}$ (factor; L as reference) via $\hat Y = \hat\beta_0 + \hat\beta_1 X + \hat\beta_2 \mathbf{1}_{G=N} + \hat\beta_3 \mathbf{1}_{G=Oc}$. The fitted equation for genus Oc is:

A $\hat Y = \hat\beta_0 + \hat\beta_1 X$.
B $\hat Y = (\hat\beta_0 + \hat\beta_3) + \hat\beta_1 X$.
C $\hat Y = (\hat\beta_0 + \hat\beta_2) + \hat\beta_1 X$.
D $\hat Y = (\hat\beta_0 + \hat\beta_2 + \hat\beta_3) + \hat\beta_1 X$.

Show answer

Correct answer: B

For genus Oc, $\mathbf{1}_{G=N} = 0$ and $\mathbf{1}_{G=Oc} = 1$, so the dummy contribution is $\hat\beta_3$ added to the intercept. Slope $\hat\beta_1$ is the same across genera (no interaction).

A is the reference-genus equation (L). C is genus N. D adds both dummies as if they could be active simultaneously — but each observation belongs to exactly one genus.

Atoms: categorical-encoding-and-interactions.

Question 21 3 points

A categorical predictor takes $K = 4$ levels. Why does R encode it as $3$ dummy columns rather than $4$?

A Because the column of ones plus $4$ dummies is linearly dependent, so $\mathbf{X}^\top\mathbf{X}$ is singular and the fit is unidentifiable.
B Because using all $4$ dummies would force the four levels onto an artificial $0$-$1$-$2$-$3$ ordering, biasing the slope estimates of the unordered categories.
C Because R automatically reserves one design-matrix column for the residual-variance estimate $\hat\sigma^2$ used in standard-error calculations.
D Because the F-test for the joint significance of this factor uses exactly $K-1$ degrees of freedom by definition, so the design matrix must match.

Show answer

Correct answer: A

With $K$ dummies plus an intercept, the column of ones equals the sum of the four dummies — perfect collinearity, so $(\mathbf{X}^\top\mathbf{X})^{-1}$ doesn't exist. Drop one dummy to absorb the reference level into the intercept.

B confuses the K-dummy issue with the *0/1/2* coding mistake (which is a different trap — that one imposes ordering, but is also wrong for separate reasons). C invents a non-existent design. D inverts cause and effect: degrees of freedom are a *consequence* of identifiability, not the reason for it.

Atoms: categorical-encoding-and-interactions, design-matrix-and-hat-matrix, collinearity.

Question 22 3 points

Mark each statement about interaction terms in linear regression as true or false.

a) If you fit $y = \beta_0 + \beta_3 (X\cdot Z) + \varepsilon$ without including $X$ and $Z$ as main effects, the model violates the hierarchical (main-effects) principle. True False
b) In a model with $X$, $Z$ binary, and the $X\cdot Z$ term, the coefficient on $Z$ alone is the $Z=1$-vs-$Z=0$ difference at $X=0$. True False
c) Adding an interaction lets two groups have different intercepts but forces the slopes to be equal. True False
d) A small p-value on the interaction term is evidence that the *main effect* of $X$ is non-zero in the population. True False

Show answer

True — the prof's flagged "non-negotiable" main-effects rule. Including only $X\cdot Z$ is almost always wrong.
True — the canonical interaction trap. Don't read it as a global average.
False — interactions are what allow different slopes. The "different intercepts only" picture is the no-interaction model.
False — the interaction t-test is on the *interaction coefficient*, which measures the slope *difference* between groups, not the main effect of $X$. A small p-value on $X\cdot Z$ tells you the slopes differ, not that the average slope is non-zero.

Atoms: categorical-encoding-and-interactions. Lecture: L06-linreg-2.

Question 23 3 points

Hourly temperature is regressed on time of day. Sampling resolution is then refined from hourly to every $5$ minutes (so $n$ grows by $\sim 12\times$). Reported t-statistics for the time slope shoot up; the p-value drops to $<10^{-9}$. The most accurate critique is:

A Once $n$ exceeds the number of distinct hours in the original data, the OLS point estimate $\hat\beta$ becomes biased and the slope estimate cannot be trusted.
B An $R^2$ that exceeds $0.95$ on this many points indicates overfitting; the appropriate fix is to reduce model flexibility (drop predictors or use ridge).
C The Gaussian-error assumption is broken because the response (temperature in Celsius) is bounded both below and above, so the t-distribution does not apply.
D Adjacent time bins are highly correlated; effective sample size is much smaller than $n$, so the SEs are underestimated and the reported significance is unreliable.

Show answer

Correct answer: D

Independence (assumption 5) is violated by serial correlation. The OLS algebra still computes $\hat\beta$, but the reported uncertainty is wrong — the prof's "horseshit standard errors" critique. Effective $n$ is much less than apparent $n$.

A confuses bias of $\hat\beta$ with bias of inference; under correlated errors point estimates are still unbiased. B invokes a different problem (overfitting), unrelated to dependence. C is a stretch — Gaussian is robust to mild boundedness, and mismatched.

Atoms: gaussian-error-assumptions, sampling-distribution-of-beta. Lecture: L05-linreg-1.

Question 24 3 points

Mark each statement about the residuals-vs-fitted plot as true or false.

a) A clear cone (fanning) shape indicates the homoscedasticity assumption may be violated. True False
b) Curvature (a clear U or inverted-U pattern) suggests the linear-mean assumption may be missing a non-linear term. True False
c) A flat, structureless band of residuals around zero proves that the errors are independent and Gaussian. True False

Show answer

True — fanning means $\mathrm{Var}(\varepsilon_i)$ scales with the fitted value, breaking common-variance.
True — systematic shape in residuals against fitted values says the conditional mean is mis-specified; add a transformation or polynomial.
False — a flat band is *consistent with* the assumptions, not proof of them. Residuals-vs-fitted only checks linearity and constant variance; it can't see independence (look at order plots) and can't see normality (look at the QQ plot). "Consistent with" ≠ "proves".

Atoms: residual-diagnostics, gaussian-error-assumptions.

Question 25 3 points

Which statement best captures the difference between an *error* $\varepsilon_i$ and a *residual* $e_i = y_i - \hat y_i$ in the linear-regression model?

A Errors are computed from the data after the fit, while residuals are the parameters that define the noise model.
B Errors and residuals coincide whenever the fitted model contains an intercept term, so the distinction is bookkeeping only.
C Residuals are unbiased for $\sigma^2$ but errors are biased, which is why the textbook variance estimator divides by $n-p-1$ rather than $n$.
D Errors are unobservable random variables; residuals are observed predictions of those errors.

Show answer

Correct answer: D

The prof's distinction: $\varepsilon_i$ is the *random* deviation in the population model, never directly observed. $e_i$ is the *observed* prediction of that error after fitting. Raw residuals additionally have $\mathrm{Cov}(\mathbf{e}) = \sigma^2(\mathbf{I}-\mathbf{H})$, which is why we standardise them for diagnostics.

A inverts the two roles. B is wrong even with an intercept — residuals are predictions of unobserved errors, not equal to them. C mixes up bias of an estimator (variance) with the variables themselves.

Atoms: gaussian-error-assumptions, residual-diagnostics.

Question 26 4 points Exam 2024 P3

Under the linear model $y_i = \mathbf{x}_i^\top\boldsymbol\beta + \varepsilon_i$ with $\varepsilon_i \overset{\text{iid}}\sim N(0, \sigma^2)$, what is the *correct* short justification that the least-squares estimator equals the maximum-likelihood estimator?

A Because $\hat{\boldsymbol\beta}$ is unbiased under the Gaussian model, and any unbiased estimator of a parameter automatically maximises the joint likelihood.
B Because $\log L \propto -\sum_i (y_i - \mathbf{x}_i^\top\boldsymbol\beta)^2$ in $\boldsymbol\beta$, so maximising $\log L$ is equivalent to minimising the SSE — the LS objective.
C Because the OLS closed form $(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ exists in one matrix multiplication, and closed-form estimators coincide with MLEs by definition.
D Because the Gaussian density is symmetric, and any symmetric-loss minimiser is automatically an MLE under the symmetric likelihood.

Show answer

Correct answer: B

The prof's flagged mathy template (verbatim from L27 / 2024 exam P3): write the log-likelihood, drop terms not depending on $\boldsymbol\beta$, identify the SSE, conclude. About 6–10 lines on the exam.

A confuses unbiasedness with the MLE property — they're different. C confuses computational convenience with the optimisation objective — closed form is a property of the algebra, not the principle. D is nonsense — symmetry is not a sufficient condition for being an MLE.

Atoms: least-squares-and-mle, gaussian-error-assumptions. Lecture: L27-summary.

Question 27 3 points

Differentiating the matrix-form residual sum of squares $\mathrm{RSS}(\boldsymbol\beta) = (\mathbf{y} - \mathbf{X}\boldsymbol\beta)^\top(\mathbf{y} - \mathbf{X}\boldsymbol\beta)$ with respect to $\boldsymbol\beta$ and setting the result to zero gives:

A $\mathbf{X}^\top\mathbf{X}\,\hat{\boldsymbol\beta} = \mathbf{X}^\top\mathbf{y}$ (the normal equations).
B $\mathbf{X}\,\hat{\boldsymbol\beta} = \mathbf{y}$, i.e. the fitted values reproduce the observed response exactly with no residual.
C $\hat{\boldsymbol\beta} = \mathbf{y}^\top\mathbf{X} / (\mathbf{X}^\top\mathbf{X})$, scalar division of the cross-product by the Gram matrix.
D $\hat{\boldsymbol\beta} = \mathbf{X}\,(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{y}$, with the design matrix premultiplying the inverse.

Show answer

Correct answer: A

$\partial\mathrm{RSS}/\partial\boldsymbol\beta = -2\mathbf{X}^\top\mathbf{y} + 2\mathbf{X}^\top\mathbf{X}\boldsymbol\beta = \mathbf{0}$ gives the normal equations. Solving (when $\mathbf{X}^\top\mathbf{X}$ is invertible) yields $\hat{\boldsymbol\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$.

B is the residual-equals-zero condition, which only holds for $n \le p+1$ (degenerate). C reorders the matrix product and uses scalar division for matrices — algebraically illegal. D mis-orders the inverse and projection: $(\mathbf{X}^\top\mathbf{X})^{-1}$ is followed by $\mathbf{X}^\top$, not premultiplied by $\mathbf{X}$.

Atoms: least-squares-and-mle, design-matrix-and-hat-matrix.

Question 28 3 points

Let $\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$ be the hat matrix in OLS. Mark each statement as true or false.

a) $\mathbf{H}$ depends only on the design matrix $\mathbf{X}$, not on the response $\mathbf{y}$. True False
b) The diagonal entries $h_{ii}$ measure leverage; they sum to $n$ (the sample size). True False
c) $\mathbf{H}$ is symmetric and idempotent ($\mathbf{H}^\top = \mathbf{H}$ and $\mathbf{H}^2 = \mathbf{H}$). True False

Show answer

True — useful: leverage can be computed from the design alone, before $\mathbf{y}$ is observed.
False — $\sum h_{ii} = \mathrm{tr}(\mathbf{H}) = p + 1$, the number of fitted parameters (counting the intercept), *not* $n$. The trace of a projection matrix equals the rank of its column space, and that's $p+1$ here.
True — $\mathbf{H}$ is the orthogonal projection onto the column space of $\mathbf{X}$.

Atoms: design-matrix-and-hat-matrix, residual-diagnostics.

Question 29 3 points

A regression of life expectancy on years of education estimates $\hat\beta = 0.6$ years of life per year of schooling, with a small p-value. Which of the following claims is best supported by this OLS fit alone?

A Holding all other factors fixed, increasing schooling by one year causes the average person's life expectancy to rise by $0.6$ years.
B Among the predictors that influence life expectancy in this population, years of education is the single dominant driver.
C In the data, more-educated individuals have on average $0.6$ more years of life expectancy per year of schooling, controlling for whatever else is in the model.
D Years of schooling and life expectancy are statistically independent at the population level, despite the fitted slope.

Show answer

Correct answer: C

Regression measures conditional association, not causation — the prof's "fancy correlations" line. The slope is the average change in $\hat Y$ associated with a one-unit change in the predictor, conditional on the other predictors in the model. Lurking variables, reverse causation, and selection effects can all generate a non-zero slope.

A reads the slope as causal — the canonical exam-flagged mistake. B overstates: a significant non-zero slope is consistent with many populations, including ones where education is a minor driver. D contradicts the stated significant fit.

Atoms: linear-regression, t-test-and-significance. Lecture: L05-linreg-1.