← Back to wiki
Module 03 — Linear Regression
29 questions · 100 points · ~45 min
Click an option to lock the answer; the explanation auto-opens.
Score tracker bottom-left.
A model is fit as $y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon_i$ by ordinary least squares. Which statement best describes this model?
- A It is nonlinear regression because the response curves with $x$.
- B It is a generalised additive model because two basis terms enter additively.
- C It can only be fit by gradient descent because of the quadratic term.
- D It is linear regression because it is linear in the parameters $\boldsymbol\beta$.
Show answer
Correct answer: D
"Linear" refers to the parameters, not the predictor. Treat $(1, x, x^2)$ as columns of $\mathbf{X}$ and run ordinary OLS — closed-form $\hat{\boldsymbol\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ still applies.
A confuses curvature in $x$ with non-linearity in $\boldsymbol\beta$ (the canonical L06 trap). B is wrong because GAMs use *smoothed* univariate functions, not fixed polynomial bases. C is wrong: OLS has a one-shot closed form whenever $\mathbf{X}^\top\mathbf{X}$ is invertible.
Atoms: polynomial-regression, linear-regression. Lecture: L06-linreg-2.
Question 2
5 points
ISLP §3 Q3
Starting salary (in $\$1{,}000$) is modelled as
$$\hat y = 50 + 20\,\text{GPA} + 0.07\,\text{IQ} + 35\,\text{Level} + 0.01\,(\text{GPA}\cdot\text{IQ}) - 10\,(\text{GPA}\cdot\text{Level})$$
with $\text{Level} = 1$ for college and $0$ for high school. Which statement is correct, holding $\text{IQ}$ and $\text{GPA}$ fixed?
- A College graduates earn more on average than high-school graduates regardless of GPA.
- B High-school graduates earn more on average than college graduates regardless of GPA.
- C College graduates earn more on average, provided that GPA is high enough.
- D High-school graduates earn more on average, provided that GPA is high enough.
Show answer
Correct answer: D
The college-vs-high-school gap is $35 - 10\cdot\text{GPA}$. It is positive only when $\text{GPA} < 3.5$ and flips negative for $\text{GPA} > 3.5$. So at sufficiently high GPA, high-school graduates earn more.
A reads only the main effect $35$ and ignores the interaction (the canonical "main effect under interaction" trap). B ignores the +35 main effect entirely. C flips the sign of the interaction-driven crossover.
Atoms: categorical-encoding-and-interactions, linear-regression. Lecture: L27-summary.
Question 3
4 points
Exam 2025 P2a
Adult weight (kg) is modelled on age (years) and sex (male = 1, female = 0) with an $\text{age}\times\text{sex}$ interaction. The fitted equation is
$$\hat{\text{weight}} = 60.2 + 0.11\,\text{age} + 5.03\,\text{sex} + 0.68\,(\text{age}\cdot\text{sex}).$$
Mark each statement as true or false.
Show answer
- False — for females ($\text{sex}=0$) the age slope is $0.11$, not $0.79$. The value $0.79 = 0.11 + 0.68$ is the *male* age slope; this is the canonical "read the slope through the interaction" trap.
- False — the $5.03$ coefficient is the male-vs-female gap *only when age $= 0$*. Since the interaction is $0.68$, the gap at age $a$ is $5.03 + 0.68\,a$, very different from $5.03$ across the data range. This is the prof's flagged 2025 Q2 interaction trap.
- True — male age slope = $0.11 + 0.68 = 0.79$ kg/year.
- True — the $\text{age}\cdot\text{sex}$ term is precisely what allows different age slopes per sex.
Atoms: categorical-encoding-and-interactions. Lecture: L27-summary.
Question 4
4 points
ISLP §3 Q4
With $n=100$ observations of a single predictor $X$ and quantitative response $Y$, you fit
Model A: $\hat y = \hat\beta_0 + \hat\beta_1 x$, and Model B: $\hat y = \hat\beta_0 + \hat\beta_1 x + \hat\beta_2 x^2 + \hat\beta_3 x^3$.
Suppose the *true* relationship is linear. What can you say about the **training** RSS of the two models?
- A Training RSS is the same, since the truth is linear and OLS is unbiased.
- B Training RSS is lower for Model B, because adding parameters cannot raise training RSS.
- C Training RSS is lower for Model A, because the cubic terms add irreducible noise to the residuals.
- D Cannot be determined without seeing the test set, since RSS depends on the held-out split.
Show answer
Correct answer: B
Model A is nested inside Model B (set $\beta_2 = \beta_3 = 0$). On the training data, OLS picks the coefficients that *minimise* RSS over the larger parameter space, so Model B's training RSS $\le$ Model A's. The prof's flagged keyword is training — for test RSS it would flip.
A misses that adding parameters lets the optimiser fit noise — the truth being linear matters for population-level bias, not for finite-sample training fit. C confuses irreducible noise with what the model *fits to*; cubic terms don't add noise, they absorb residuals. D conflates training and test — training RSS is computed only on the training split and is fully determined by the fit.
Atoms: polynomial-regression, r-squared, bias-variance-tradeoff. Lecture: L27-summary.
Question 5
3 points
ISLP §3 Q4
Same setup as Q4 (true relationship is linear, $n=100$). What do you expect for the **test** RSS of Model A versus Model B?
- A Lower for Model A on average; Model B overfits noise.
- B Lower for Model B on average; more flexibility always helps prediction.
- C Identical, because both models contain the truth.
- D The cubic and linear models give exactly the same predictions, so test RSS is identical.
Show answer
Correct answer: A
The cubic model has the same bias as the linear model (both contain the truth) but extra variance from estimating two unnecessary coefficients, so on average test RSS is lower for the simpler Model A.
B inverts the bias-variance message — when the truth is in the simpler model, the simpler model wins. C ignores the variance term — Model B's noisier coefficient estimates inflate prediction MSE. D is wrong: with finite samples $\hat\beta_2, \hat\beta_3$ are nonzero point estimates, so Model B's predictions differ from Model A's.
Atoms: polynomial-regression, bias-variance-tradeoff.
Question 6
5 points
Exam 2025 P4a
A Boston-housing-style regression of median home value (medv, in $\$1000$) on nine predictors plus an $\text{rm}^2$ quadratic term gives the following partial output for the $\text{chas}$ indicator (1 if tract bounds the Charles River, 0 otherwise):
chas Estimate 3.36 Std. Error 0.86 t value 3.91 Pr(>|t|) 0.0001
Using the working approximation $t_{0.975, n-p-1}\approx 2$, the approximate 95% confidence interval for the chas effect is closest to:
- A $[3.36 \pm 0.86]$, i.e. $[2.50,\ 4.22]$ thousand dollars.
- B $[3.36 \pm 1.72]$, i.e. $[1.64,\ 5.08]$ thousand dollars.
- C $[3.36 \pm 7.82]$, i.e. $[-4.46,\ 11.18]$ thousand dollars.
- D $[0.0001,\ 3.36]$ thousand dollars.
Show answer
Correct answer: B
$\hat\beta \pm t_{0.975}\cdot\mathrm{SE} \approx 3.36 \pm 2\cdot 0.86 = 3.36 \pm 1.72 \to [1.64, 5.08]$. Houses bounding the river fetch on average $\$1{,}640$–$\$5{,}080$ more, holding other predictors constant.
A drops the $t$ multiplier, using one SE rather than two. C uses the t-value $3.91$ instead of the t-quantile (a confusion of column names). D treats the p-value as a confidence bound — p-values and CIs are different objects.
Atoms: confidence-and-prediction-intervals, sampling-distribution-of-beta, t-test-and-significance. Lecture: L27-summary.
Question 7
3 points
Exam 2025 P4a
A linear model is fit with an intercept, 7 continuous predictors, a 9-level factor (entered as dummies with one reference level), and one quadratic term I(rm^2). How many regression parameters does the fitted model consume (counting the intercept)?
- A $7 + 9 + 1 = 17$.
- B $1 + 7 + 8 + 1 = 17$.
- C $1 + 7 + 9 + 1 = 18$.
- D $1 + 7 + 8 + 2 = 18$.
Show answer
Correct answer: B
1 intercept + 7 continuous slopes + $K-1 = 8$ dummies for the 9-level factor + 1 for $\text{rm}^2$ = 17.
A forgets the intercept entirely. C uses $K=9$ dummies instead of $K-1$ — the 9-th level is absorbed into the intercept (otherwise $\mathbf{X}^\top\mathbf{X}$ is singular). D double-counts the quadratic term as if both $\text{rm}$ and $\text{rm}^2$ were *new* (they each consume one slot, and $\text{rm}$ is already in the 7 continuous predictors).
Atoms: categorical-encoding-and-interactions, design-matrix-and-hat-matrix.
A multiple regression is fit with $n = 200$ observations and $p = 9$ slope parameters. For one predictor, the output gives $\hat\beta = 0.40$ and $\mathrm{SE}(\hat\beta) = 0.20$. Using the working cutoff $t_{0.975, 190}\approx 1.97$, is this coefficient significant at the 5% level, and what is its t-statistic?
- A $t = 2.0$; significant at the 5% level.
- B $t = 0.5$; not significant at the 5% level.
- C $t = 0.08$; not significant at the 5% level.
- D $t = 2.0$; not significant because $|t|$ has to exceed $t_{0.99}\approx 2.6$.
Show answer
Correct answer: A
$t = \hat\beta / \mathrm{SE}(\hat\beta) = 0.40 / 0.20 = 2.0 > 1.97$, so reject $H_0:\beta=0$ at $\alpha=0.05$.
B inverts numerator and denominator (computes $\mathrm{SE}/\hat\beta$). C drops the SE step and reads $\hat\beta\cdot\mathrm{SE}$. D applies the wrong cutoff (1% two-sided instead of 5%).
Atoms: t-test-and-significance, sampling-distribution-of-beta.
Question 9
4 points
CE1 P2g
For a t-test with null $H_0:\beta_j = 0$ and observed two-sided p-value $p$, mark each statement as true or false.
Show answer
- False — the p-value is computed *conditional on* $H_0$; it is not a posterior probability of $H_0$.
- False — failing to reject $H_0$ is "insufficient evidence", not "$H_1$ is false".
- True — this is the canonical correct definition.
- False — under $H_0$ "everything is random chance" by assumption; the p-value is a tail probability under that assumption, not a probability of "results being chance".
Atoms: t-test-and-significance.
A regression on $n = 50{,}000$ observations gives $\hat\beta = 0.003$ kg/year for a body-fat predictor, with $p < 10^{-6}$. The investigator concludes that the effect is large because the p-value is tiny. The best critique is:
- A A p-value below $10^{-6}$ on a sample this large proves that the linear model is correctly specified, so the conclusion about effect size is logically sound.
- B The effect must be large because the p-value is so small; at sample sizes in the tens of thousands, effect size and statistical significance carry essentially the same information.
- C The p-value is invalid because it falls below the conventional reporting threshold of $0.001$ and should be re-computed with a more conservative correction.
- D A small p-value is evidence the effect is real, but the effect size is the slope, not the p-value; $0.003$ kg/year is practically negligible.
Show answer
Correct answer: D
The prof's "significance is just sample size" sermon: with huge $n$ the SE shrinks like $1/\sqrt n$, so any non-zero effect becomes statistically significant. Practical (effect-size) and statistical (p-value) significance are different — you want both, ideally.
A confuses significance with model adequacy — diagnostics, not p-values, check specification. B is the canonical conflation the prof warns against. C invents a non-existent reporting rule.
Atoms: t-test-and-significance, sampling-distribution-of-beta. Lecture: L05-linreg-1.
Recall the simple-LR formula $\mathrm{SE}(\hat\beta_1)^2 = \sigma^2 / \sum_i (x_i - \bar x)^2$. Mark each statement about $\mathrm{SE}(\hat\beta_1)$ as true or false.
Show answer
- False — variance scales like $1/n$, so SE scales like $1/\sqrt n$. Doubling $n$ divides SE by $\sqrt 2$, not by $2$ — this is the canonical "$\sqrt n$ rate" trap.
- True — the denominator $\sum(x_i - \bar x)^2$ grows, shrinking SE; this is the prof's "design wider experiments" lever.
- False — collinearity *inflates* SE: $\mathbf{X}^\top\mathbf{X}$ becomes near-singular, $(\mathbf{X}^\top\mathbf{X})^{-1}$ blows up.
- True — SE is proportional to $\sigma$.
Atoms: sampling-distribution-of-beta, collinearity. Lecture: L05-linreg-1.
At a fixed test point $\mathbf{x}_0$, you can construct (a) a 95% confidence interval for $\mathbf{x}_0^\top\boldsymbol\beta$ and (b) a 95% prediction interval for a future $Y$ at $\mathbf{x}_0$. Which of the following correctly contrasts them?
- A The CI captures uncertainty in $\hat{\boldsymbol\beta}$ only; the PI adds the irreducible noise variance $\sigma^2$, so PI is always wider.
- B The CI captures the true individual response; the PI captures the population mean, so PI is wider only when $\mathbf{x}_0$ is far from the data mean.
- C Both intervals carry the same uncertainty; the only difference is the nominal coverage level.
- D The PI is always narrower because it conditions on the observed $\mathbf{x}_0$ rather than averaging over data.
Show answer
Correct answer: A
The CI for the *mean response* uses $\hat\sigma\sqrt{\mathbf{x}_0^\top(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{x}_0}$; the PI for a *future observation* uses $\hat\sigma\sqrt{1 + \mathbf{x}_0^\top(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{x}_0}$ — that "+1" is the $\sigma^2$ contribution from $\varepsilon_{\text{new}}$.
B swaps the targets: the CI is for the mean, the PI is for an individual. C ignores the structurally different objects each interval covers. D inverts the inequality — PI is *always* wider, never narrower.
Atoms: confidence-and-prediction-intervals, sampling-distribution-of-beta.
Mark each statement about $R^2$ and $R^2_{\text{adj}}$ in OLS as true or false.
Show answer
- True — the larger model contains the smaller as a special case ($\beta_{\text{new}} = 0$), so OLS minimisation cannot raise RSS.
- False — that's the property of ordinary $R^2$. Adjusted $R^2$ has the $(n-1)/(n-p-1)$ penalty, so it *can* fall when added parameters do not pull their weight; that's exactly what makes it a model-selection signal rather than just a goodness-of-fit number.
- True — a standard simple-LR identity.
- False — $R^2$ is the *fraction of variance* in $Y$ explained, not a classification accuracy.
Atoms: r-squared, linear-regression.
A linear model is fit to $n = 8$ observations with response variance summary $\mathrm{TSS} = 800$. The fitted residuals satisfy $\mathrm{RSS} = 200$. What is $R^2$?
- A $0.20$.
- B $0.25$.
- C $0.75$.
- D $4.00$.
Show answer
Correct answer: C
$R^2 = 1 - \mathrm{RSS}/\mathrm{TSS} = 1 - 200/800 = 0.75$.
A reports $\mathrm{RSS}/\mathrm{TSS} = 0.25$ as $0.20$ — arithmetic slip. B reports $\mathrm{RSS}/\mathrm{TSS}$ instead of $1 - \mathrm{RSS}/\mathrm{TSS}$ (forgets the $1-$). D inverts the ratio to $\mathrm{TSS}/\mathrm{RSS}$.
Atoms: r-squared.
After fitting an OLS regression you produce a QQ plot of standardised residuals against theoretical normal quantiles. The middle of the plot lies on the reference line, but both tails curl strongly upward at the high end and downward at the low end (heavy "S-shape"). Which assumption is most directly called into question?
- A Independence of the errors $\varepsilon_i$ across observations in time or space.
- B Normality of the errors — the residuals look heavy-tailed relative to a Gaussian.
- C Linearity of the conditional mean $E[Y\mid X]$ as a function of the predictors.
- D Constancy (homoscedasticity) of $\sigma^2$ across the full range of fitted values.
Show answer
Correct answer: B
An S-curve with extreme tails pulling away from the reference line is the canonical "heavy tails / non-Gaussian errors" pattern on a QQ plot. The bulk being aligned says the centre of the distribution looks roughly normal; the deviations are at the extremes.
A would show as structure in the residuals-vs-fitted or order plots, not in the QQ plot. C shows up as curvature in residuals-vs-fitted. D shows up as a fan in residuals-vs-fitted or scale-location plots, not as an S in the QQ plot.
Atoms: residual-diagnostics, gaussian-error-assumptions. Lecture: L06-linreg-2.
In simple linear regression with $n = 5$ predictor values $\mathbf{x} = (1, 2, 3, 4, 10)$, what is the leverage $h_{55}$ of the fifth observation ($x_5 = 10$)? Recall $h_{ii} = \tfrac{1}{n} + (x_i - \bar x)^2 / \sum_j (x_j - \bar x)^2$.
- A $0.20$.
- B $0.72$.
- C $0.92$.
- D $1.20$.
Show answer
Correct answer: C
$\bar x = (1+2+3+4+10)/5 = 4$. Deviations $(x_i - \bar x) = (-3, -2, -1, 0, 6)$, so $\sum(x_j-\bar x)^2 = 9 + 4 + 1 + 0 + 36 = 50$. Then $h_{55} = 1/5 + 36/50 = 0.20 + 0.72 = 0.92$. The fifth observation has very high leverage — it sits at the far end of $x$.
A reports only the baseline $1/n$ term and forgets the $(x_i - \bar x)^2$ contribution. B reports just the second term $36/50 = 0.72$ and forgets the $1/n$ baseline. D adds a full $1$ instead of $1/n$ — confusing the prediction-interval "+1" with the leverage formula.
Atoms: residual-diagnostics, design-matrix-and-hat-matrix.
On a residuals-vs-leverage plot, four points are highlighted. Which one is *most* dangerous for the OLS fit (i.e. exerts the largest influence on $\hat{\boldsymbol\beta}$)?
- A A point with low leverage and a large residual.
- B A point with high leverage and a residual close to zero.
- C A point with high leverage and a large residual.
- D A point with low leverage and a residual close to zero.
Show answer
Correct answer: C
Cook's distance combines both: $D_i \propto \tilde r_i^2 \cdot h_{ii}/(1-h_{ii})$. The dangerous corner is high leverage *and* large residual — the "fat kid far from centre on the seesaw" image.
A has no lever arm — the point can have a large residual but limited pull on the line. B is "harmless high leverage" — the point sits on the fitted line, so it doesn't bend it. D has neither lever nor residual; the safest case.
Atoms: residual-diagnostics. Lecture: L08-classif-2.
Two predictors $x_1$ and $x_2$ are highly correlated (sample correlation $\approx 0.99$). Mark each statement about the OLS fit as true or false.
Show answer
- True — collinearity makes $\mathbf{X}^\top\mathbf{X}$ near-singular, so $(\mathbf{X}^\top\mathbf{X})^{-1}$ has large diagonal entries → SE explodes.
- False — under the classical assumptions, OLS is unbiased even under collinearity; only the variance is inflated.
- True — the canonical reason to do the joint F-test before drilling into per-coefficient t's.
- False — predictions are typically *stable*, not unreliable: even when individual coefficients have huge SEs, the *sum* $\beta_1 x_1 + \beta_2 x_2$ is well-determined, so $\hat y$ on data similar to the training set is fine. Collinearity hurts inference about individual $\beta_j$, not in-sample prediction.
Atoms: collinearity, sampling-distribution-of-beta, multivariate-normal (the joint $N_{p+1}(\boldsymbol\beta, \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1})$ distribution of $\hat{\boldsymbol\beta}$ is the foundation for inflated SEs under collinearity).
You want to test whether *any* of the predictors in a multiple regression with $p = 5$ slopes contributes to predicting $Y$. Which is the appropriate procedure?
- A Pick the smallest individual p-value and report it as the overall test.
- B Compute the residual standard error and reject if it is below $\sigma$.
- C Run an F-test on $H_0:\beta_1 = \cdots = \beta_5 = 0$ before drilling into individual t-tests.
- D Compare $R^2$ between the full model and the intercept-only model and reject if $R^2 > 0.5$.
Show answer
Correct answer: C
The F-test on the joint null is exactly the question being asked. It is also robust to collinearity: individual t's can both fail while the joint F is highly significant. Note: you are not expected to compute the F-statistic on this exam — only to know what null it tests and why you'd reach for it.
A inflates the family-wise false-positive rate (the multiple-testing trap), and ignores collinearity-masked signals. B is meaningless — RSE estimates $\sigma$ but doesn't test "any predictor matters". D invents a non-existent rejection rule.
Atoms: f-test, t-test-and-significance, collinearity. Lecture: L06-linreg-2.
Question 20
4 points
CE1 P2c
Earthworm weight $Y$ is modelled on stomach circumference $X$ (continuous) and genus $G \in \{L, N, Oc\}$ (factor; L as reference) via $\hat Y = \hat\beta_0 + \hat\beta_1 X + \hat\beta_2 \mathbf{1}_{G=N} + \hat\beta_3 \mathbf{1}_{G=Oc}$. The fitted equation for genus Oc is:
- A $\hat Y = \hat\beta_0 + \hat\beta_1 X$.
- B $\hat Y = (\hat\beta_0 + \hat\beta_3) + \hat\beta_1 X$.
- C $\hat Y = (\hat\beta_0 + \hat\beta_2) + \hat\beta_1 X$.
- D $\hat Y = (\hat\beta_0 + \hat\beta_2 + \hat\beta_3) + \hat\beta_1 X$.
Show answer
Correct answer: B
For genus Oc, $\mathbf{1}_{G=N} = 0$ and $\mathbf{1}_{G=Oc} = 1$, so the dummy contribution is $\hat\beta_3$ added to the intercept. Slope $\hat\beta_1$ is the same across genera (no interaction).
A is the reference-genus equation (L). C is genus N. D adds both dummies as if they could be active simultaneously — but each observation belongs to exactly one genus.
Atoms: categorical-encoding-and-interactions.
A categorical predictor takes $K = 4$ levels. Why does R encode it as $3$ dummy columns rather than $4$?
- A Because the column of ones plus $4$ dummies is linearly dependent, so $\mathbf{X}^\top\mathbf{X}$ is singular and the fit is unidentifiable.
- B Because using all $4$ dummies would force the four levels onto an artificial $0$-$1$-$2$-$3$ ordering, biasing the slope estimates of the unordered categories.
- C Because R automatically reserves one design-matrix column for the residual-variance estimate $\hat\sigma^2$ used in standard-error calculations.
- D Because the F-test for the joint significance of this factor uses exactly $K-1$ degrees of freedom by definition, so the design matrix must match.
Show answer
Correct answer: A
With $K$ dummies plus an intercept, the column of ones equals the sum of the four dummies — perfect collinearity, so $(\mathbf{X}^\top\mathbf{X})^{-1}$ doesn't exist. Drop one dummy to absorb the reference level into the intercept.
B confuses the K-dummy issue with the *0/1/2* coding mistake (which is a different trap — that one imposes ordering, but is also wrong for separate reasons). C invents a non-existent design. D inverts cause and effect: degrees of freedom are a *consequence* of identifiability, not the reason for it.
Atoms: categorical-encoding-and-interactions, design-matrix-and-hat-matrix, collinearity.
Mark each statement about interaction terms in linear regression as true or false.
Show answer
- True — the prof's flagged "non-negotiable" main-effects rule. Including only $X\cdot Z$ is almost always wrong.
- True — the canonical interaction trap. Don't read it as a global average.
- False — interactions are what allow different slopes. The "different intercepts only" picture is the no-interaction model.
- False — the interaction t-test is on the *interaction coefficient*, which measures the slope *difference* between groups, not the main effect of $X$. A small p-value on $X\cdot Z$ tells you the slopes differ, not that the average slope is non-zero.
Atoms: categorical-encoding-and-interactions. Lecture: L06-linreg-2.
Hourly temperature is regressed on time of day. Sampling resolution is then refined from hourly to every $5$ minutes (so $n$ grows by $\sim 12\times$). Reported t-statistics for the time slope shoot up; the p-value drops to $<10^{-9}$. The most accurate critique is:
- A Once $n$ exceeds the number of distinct hours in the original data, the OLS point estimate $\hat\beta$ becomes biased and the slope estimate cannot be trusted.
- B An $R^2$ that exceeds $0.95$ on this many points indicates overfitting; the appropriate fix is to reduce model flexibility (drop predictors or use ridge).
- C The Gaussian-error assumption is broken because the response (temperature in Celsius) is bounded both below and above, so the t-distribution does not apply.
- D Adjacent time bins are highly correlated; effective sample size is much smaller than $n$, so the SEs are underestimated and the reported significance is unreliable.
Show answer
Correct answer: D
Independence (assumption 5) is violated by serial correlation. The OLS algebra still computes $\hat\beta$, but the reported uncertainty is wrong — the prof's "horseshit standard errors" critique. Effective $n$ is much less than apparent $n$.
A confuses bias of $\hat\beta$ with bias of inference; under correlated errors point estimates are still unbiased. B invokes a different problem (overfitting), unrelated to dependence. C is a stretch — Gaussian is robust to mild boundedness, and mismatched.
Atoms: gaussian-error-assumptions, sampling-distribution-of-beta. Lecture: L05-linreg-1.
Mark each statement about the residuals-vs-fitted plot as true or false.
Show answer
- True — fanning means $\mathrm{Var}(\varepsilon_i)$ scales with the fitted value, breaking common-variance.
- True — systematic shape in residuals against fitted values says the conditional mean is mis-specified; add a transformation or polynomial.
- False — a flat band is *consistent with* the assumptions, not proof of them. Residuals-vs-fitted only checks linearity and constant variance; it can't see independence (look at order plots) and can't see normality (look at the QQ plot). "Consistent with" ≠ "proves".
Atoms: residual-diagnostics, gaussian-error-assumptions.
Which statement best captures the difference between an *error* $\varepsilon_i$ and a *residual* $e_i = y_i - \hat y_i$ in the linear-regression model?
- A Errors are computed from the data after the fit, while residuals are the parameters that define the noise model.
- B Errors and residuals coincide whenever the fitted model contains an intercept term, so the distinction is bookkeeping only.
- C Residuals are unbiased for $\sigma^2$ but errors are biased, which is why the textbook variance estimator divides by $n-p-1$ rather than $n$.
- D Errors are unobservable random variables; residuals are observed predictions of those errors.
Show answer
Correct answer: D
The prof's distinction: $\varepsilon_i$ is the *random* deviation in the population model, never directly observed. $e_i$ is the *observed* prediction of that error after fitting. Raw residuals additionally have $\mathrm{Cov}(\mathbf{e}) = \sigma^2(\mathbf{I}-\mathbf{H})$, which is why we standardise them for diagnostics.
A inverts the two roles. B is wrong even with an intercept — residuals are predictions of unobserved errors, not equal to them. C mixes up bias of an estimator (variance) with the variables themselves.
Atoms: gaussian-error-assumptions, residual-diagnostics.
Question 26
4 points
Exam 2024 P3
Under the linear model $y_i = \mathbf{x}_i^\top\boldsymbol\beta + \varepsilon_i$ with $\varepsilon_i \overset{\text{iid}}\sim N(0, \sigma^2)$, what is the *correct* short justification that the least-squares estimator equals the maximum-likelihood estimator?
- A Because $\hat{\boldsymbol\beta}$ is unbiased under the Gaussian model, and any unbiased estimator of a parameter automatically maximises the joint likelihood.
- B Because $\log L \propto -\sum_i (y_i - \mathbf{x}_i^\top\boldsymbol\beta)^2$ in $\boldsymbol\beta$, so maximising $\log L$ is equivalent to minimising the SSE — the LS objective.
- C Because the OLS closed form $(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ exists in one matrix multiplication, and closed-form estimators coincide with MLEs by definition.
- D Because the Gaussian density is symmetric, and any symmetric-loss minimiser is automatically an MLE under the symmetric likelihood.
Show answer
Correct answer: B
The prof's flagged mathy template (verbatim from L27 / 2024 exam P3): write the log-likelihood, drop terms not depending on $\boldsymbol\beta$, identify the SSE, conclude. About 6–10 lines on the exam.
A confuses unbiasedness with the MLE property — they're different. C confuses computational convenience with the optimisation objective — closed form is a property of the algebra, not the principle. D is nonsense — symmetry is not a sufficient condition for being an MLE.
Atoms: least-squares-and-mle, gaussian-error-assumptions. Lecture: L27-summary.
Differentiating the matrix-form residual sum of squares $\mathrm{RSS}(\boldsymbol\beta) = (\mathbf{y} - \mathbf{X}\boldsymbol\beta)^\top(\mathbf{y} - \mathbf{X}\boldsymbol\beta)$ with respect to $\boldsymbol\beta$ and setting the result to zero gives:
- A $\mathbf{X}^\top\mathbf{X}\,\hat{\boldsymbol\beta} = \mathbf{X}^\top\mathbf{y}$ (the normal equations).
- B $\mathbf{X}\,\hat{\boldsymbol\beta} = \mathbf{y}$, i.e. the fitted values reproduce the observed response exactly with no residual.
- C $\hat{\boldsymbol\beta} = \mathbf{y}^\top\mathbf{X} / (\mathbf{X}^\top\mathbf{X})$, scalar division of the cross-product by the Gram matrix.
- D $\hat{\boldsymbol\beta} = \mathbf{X}\,(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{y}$, with the design matrix premultiplying the inverse.
Show answer
Correct answer: A
$\partial\mathrm{RSS}/\partial\boldsymbol\beta = -2\mathbf{X}^\top\mathbf{y} + 2\mathbf{X}^\top\mathbf{X}\boldsymbol\beta = \mathbf{0}$ gives the normal equations. Solving (when $\mathbf{X}^\top\mathbf{X}$ is invertible) yields $\hat{\boldsymbol\beta} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$.
B is the residual-equals-zero condition, which only holds for $n \le p+1$ (degenerate). C reorders the matrix product and uses scalar division for matrices — algebraically illegal. D mis-orders the inverse and projection: $(\mathbf{X}^\top\mathbf{X})^{-1}$ is followed by $\mathbf{X}^\top$, not premultiplied by $\mathbf{X}$.
Atoms: least-squares-and-mle, design-matrix-and-hat-matrix.
Let $\mathbf{H} = \mathbf{X}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top$ be the hat matrix in OLS. Mark each statement as true or false.
Show answer
- True — useful: leverage can be computed from the design alone, before $\mathbf{y}$ is observed.
- False — $\sum h_{ii} = \mathrm{tr}(\mathbf{H}) = p + 1$, the number of fitted parameters (counting the intercept), *not* $n$. The trace of a projection matrix equals the rank of its column space, and that's $p+1$ here.
- True — $\mathbf{H}$ is the orthogonal projection onto the column space of $\mathbf{X}$.
Atoms: design-matrix-and-hat-matrix, residual-diagnostics.
A regression of life expectancy on years of education estimates $\hat\beta = 0.6$ years of life per year of schooling, with a small p-value. Which of the following claims is best supported by this OLS fit alone?
- A Holding all other factors fixed, increasing schooling by one year causes the average person's life expectancy to rise by $0.6$ years.
- B Among the predictors that influence life expectancy in this population, years of education is the single dominant driver.
- C In the data, more-educated individuals have on average $0.6$ more years of life expectancy per year of schooling, controlling for whatever else is in the model.
- D Years of schooling and life expectancy are statistically independent at the population level, despite the fitted slope.
Show answer
Correct answer: C
Regression measures conditional association, not causation — the prof's "fancy correlations" line. The slope is the average change in $\hat Y$ associated with a one-unit change in the predictor, conditional on the other predictors in the model. Lurking variables, reverse causation, and selection effects can all generate a non-zero slope.
A reads the slope as causal — the canonical exam-flagged mistake. B overstates: a significant non-zero slope is consistent with many populations, including ones where education is a minor driver. D contradicts the stated significant fit.
Atoms: linear-regression, t-test-and-significance. Lecture: L05-linreg-1.