← Back to wiki
Module 07 — Moving Beyond Linearity
22 questions · 100 points · ~35 min
Click an option to lock the answer; the explanation auto-opens.
Score tracker bottom-left.
Why do polynomial regression, step functions, and cubic regression splines
all qualify as "linear regression" in this course, even though their
fitted curves are not straight lines?
- A The fitted curve $\hat f(x)$ is itself a linear function of $x$ once you average over the basis terms.
- B The model is linear in the parameters $\boldsymbol\beta$, so OLS on the basis-expanded design matrix still applies.
- C The least-squares fit is unique only when the relationship between $X$ and $Y$ is linear; the basis is just a numerical convenience.
- D The residuals are forced to be Gaussian, which is the defining property of linear regression.
Show answer
Correct answer: B
The whole module-7 trick: replace $X$ with $b_j(X)$ to get $y = \beta_0 + \sum_j \beta_j b_j(x) + \varepsilon$. The model is still linear in $\boldsymbol\beta$, so the closed form $\hat\beta = (X^TX)^{-1}X^Ty$, sampling distribution, CIs and t-tests all carry over unchanged.
A confuses "linear in $\beta$" with "linear in $x$" — the prof explicitly flagged this trap. C invents a uniqueness condition that has nothing to do with basis expansion. D names a noise assumption (used for inference), not the reason fitting is OLS.
Atoms: basis-functions, polynomial-regression. Lecture: L16-beyondlinear-1.
Question 2
5 points
Exam 2025 P4e(i)
A regression model already includes an intercept. A cubic regression spline
on age is added with knots placed at the four quantiles
$\{0.2, 0.4, 0.6, 0.8\}$. Using the truncated-power basis
$x, x^2, x^3, (x - c_j)^3_+$, how many degrees of freedom does this spline
term consume on top of the intercept?
Show answer
Correct answer: D
For a cubic spline with $K$ knots: $K + d + 1 = K + 4$ parameters total (degree $d=3$ plus intercept). With $K = 4$ knots that is $4 + 4 = 8$. Subtract one because the intercept is already in the model: $8 - 1 = 7$.
A counts only the knots (forgets the three polynomial columns $x, x^2, x^3$). B is the count including the intercept — the trap when you don't notice the model already has one. C triple-counts (one polynomial term per knot).
Atoms: regression-splines, basis-functions.
Question 3
6 points
Exam 2025 P4d
In smoothing splines, increasing the smoothing parameter $\lambda$ will:
Show answer
- False — $\lambda \uparrow$ makes the fit smoother, not wigglier. At $\lambda \to \infty$ the curvature penalty dominates and you collapse to the OLS straight line. The opposite direction from polynomial degree (the canonical T/F trap).
- True — pushing $\lambda$ too high oversmooths and underfits; the limit is a straight line.
- True — that is what $\lambda$ multiplies in the objective $\sum (y_i - g(x_i))^2 + \lambda \int g''(t)^2\, dt$.
- True — effective df $= \mathrm{tr}(\mathbf S_\lambda)$ moves in the same direction as flexibility, so $\lambda \uparrow \Rightarrow$ df $\downarrow$ (toward 2).
Sub-statements scored independently, $6/4 = 1.5$ points each. The (a) statement is the prof's flagged direction-flip trap from L27 Q3d.
Atoms: smoothing-splines, regularization. Lecture: L27-summary.
Question 4
4 points
Exam 2023 P3d
Which integral correctly expresses the smoothness penalty in the
smoothing-spline objective the lecturer wrote on the board?
- A $\lambda \int g'(t)^2 \, dt$
- B $\lambda \int |g(t)|\, dt$
- C $\lambda \int g(t)^2\, dt$
- D $\lambda \int g''(t)^2 \, dt$
Show answer
Correct answer: D
The penalty is integrated squared second derivative: $g''(t)$ measures how fast the slope changes (curvature), and $\int g''(t)^2\, dt$ aggregates curvature across the range. Memorise: second derivative.
A penalises slope, not curvature — a constant non-zero slope would already get penalised, which is wrong (a straight line should have zero penalty). B is an L1 functional norm, not the smoothing-spline objective. C penalises the function's magnitude, which would shrink toward zero, not toward a straight line.
Atoms: smoothing-splines.
Question 5
4 points
Exam 2024 P2c
A covariate is included in a regression model as a natural cubic
spline with three interior knots. The model already has its own intercept
column. How many degrees of freedom does this spline term consume?
Show answer
Correct answer: C
A natural cubic spline forces linearity past the boundary knots, costing two constraints relative to a plain cubic spline. Param count: plain cubic with $K$ knots = $K+4$; natural cubic = $K+4-2 = K+2$ total, or $K+1$ when an intercept already lives in the model. With $K = 3$ that is $3 + 1 = 4$.
A forgets that an intercept-shifted constant column is still being added by the spline (you'd get $K = 3$ alone, missing the linear column $x$). B is the cubic-spline answer with intercept ($K + 3$ — confusing natural and plain cubic). D is plain cubic including intercept ($K+4$) — both errors at once.
Atoms: regression-splines.
Mark each statement about an additive model
$y = \beta_0 + f_1(x_1) + f_2(x_2) + f_3(x_3) + \varepsilon$
as true or false.
Show answer
- True — the additive form combines marginal contributions; the cross-product $g(x_1)\cdot x_2$ is genuinely not in the function class. You'd need an explicit $f_{12}(x_1, x_2)$ term, which the prof flagged as the GAM-to-trees motivation.
- False — Exercise 7.5 mixes five different $f_j$ types in one model: cubic spline on displacement, polynomial on horsepower, linear on weight, smoothing spline on acceleration, factor on origin. The point of GAMs is per-predictor flexibility.
- True — verbatim from L17: "they will actually center all of the data and then when you predict for each value you're essentially having seeing the other variables set to their mean values."
- True — that's exactly the construction in Exercise 7.4: $\mathbf X = (\mathbf 1, \mathbf X_1, \mathbf X_2, \mathbf X_3)$ then $\hat\beta = (X^TX)^{-1}X^Ty$. Backfitting is only needed when an
s() or lo() component is included.
Atoms: generalized-additive-models. Lecture: L17-trees-1.
Suppose age is binned with cutpoints $c_1 = 30, c_2 = 50, c_3 = 65$
and a step-function regression is fitted. Counting the intercept, how many
parameters does the model have, and what is the shape of the fitted
function?
- A 3 parameters; the fit is piecewise linear and continuous at every cutpoint.
- B 4 parameters; the fit is piecewise constant and may jump at every cutpoint.
- C 3 parameters; the fit is piecewise constant with one indicator per cutpoint and no separate intercept column.
- D 6 parameters; one slope and one intercept inside each of the three bins.
Show answer
Correct answer: B
$K$ cutpoints give $K+1$ bins; $K$ indicator dummies plus the intercept = $K+1$ parameters. Here $K=3$, so 4 parameters. The fit is one constant per bin and is not connected — "piecewise constant, but it's not connected, it can jump."
A confuses step functions with linear splines (no slopes inside bins, no continuity). C miscounts: with $K$ cutpoints you need $K$ indicators plus the intercept column, so 4 parameters, not 3 (omitting the intercept absorbs one bin into the global mean and breaks the basis-on-top-of-intercept framing). D is a piecewise-linear regression with separate slopes per bin (different model entirely; would have 6 parameters, not 4).
Atoms: step-functions, basis-functions.
You fit polynomials of degrees $1$ through $10$ to predict mpg
from horsepower on a held-out test split. The training
MSE drops monotonically with degree. The test MSE has a clear
U-shape, bottoming out around degree 2. What is the most defensible
single-sentence interpretation?
- A Training MSE falls monotonically with model flexibility; test MSE is U-shaped because variance eventually dominates the bias gain, so pick degree 2 from the bottom of the U.
- B Training and test MSE always agree in shape; the U you see in test MSE is purely sampling noise from this particular train/test split.
- C Training MSE drops because higher polynomial degree increases the irreducible error term in the bias-variance decomposition.
- D Training MSE drops because higher polynomial degree lowers the variance term in the bias-variance decomposition.
Show answer
Correct answer: A
Adding parameters can never raise training MSE (the OLS solution at degree $d-1$ is feasible at degree $d$ with a zero coefficient). Test MSE follows the bias-variance U: bias falls with flexibility, variance rises, and irreducible error is constant. The bottom of the U is the bias-variance optimum.
B confuses train and test — the prof's flagged "keyword trap." C misuses irreducible error, which is a property of the noise, not the model. D inverts the bias-variance roles: variance rises, not falls, with model flexibility (bias is what falls).
Atoms: polynomial-regression, bias-variance-tradeoff, cross-validation.
For a smoothing spline minimising
$\sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int g''(t)^2\, dt$
on data with at least three distinct $x_i$, what is the effective
degrees of freedom $\mathrm{tr}(\mathbf S_\lambda)$ in the limit
$\lambda \to \infty$?
Show answer
Correct answer: C
As $\lambda \to \infty$ the curvature penalty forces $g''(t) \equiv 0$, so $g$ collapses to a straight line — the OLS line, with two parameters (intercept + slope). Effective df therefore tends to 2.
A would mean the fit is identically zero (penalising magnitude, not curvature). B is the constant-mean fit — a curvature penalty does not flatten slope, it flattens curvature. D is the opposite extreme ($\lambda = 0$), where $g$ interpolates every $y_i$.
Atoms: smoothing-splines.
Question 10
5 points
ISLP §7 Q3
We fit $Y = \beta_0 + \beta_1 b_1(X) + \beta_2 b_2(X) + \varepsilon$ with
$b_1(X) = X$, $b_2(X) = (X-1)^2 \, \mathbb{1}(X \ge 1)$ and obtain
$\hat\beta_0 = 1$, $\hat\beta_1 = 1$, $\hat\beta_2 = -2$. What does the
fitted curve look like over $X \in [-2, 2]$?
- A A single straight line of slope 1 through $(0, 1)$ over the whole interval.
- B Straight line $\hat y = 1 + X$ for $X < 1$; downward-opening parabola $\hat y = 1 + X - 2(X-1)^2$ for $X \ge 1$ (joined continuously at $X = 1$).
- C Straight line $\hat y = 1 + X$ for $X \ge 1$; downward parabola $\hat y = 1 + X - 2(X-1)^2$ for $X < 1$.
- D Constant 1 for $X < 1$; an upward-opening parabola for $X \ge 1$.
Show answer
Correct answer: B
$b_2(X) = 0$ when $X < 1$, so the fit is just $1 + X$ on $[-2, 1)$. For $X \ge 1$, $b_2(X) = (X-1)^2$ activates with coefficient $-2$, giving $\hat y = 1 + X - 2(X-1)^2$ — a downward-opening parabola because the leading coefficient $-2 < 0$. The two pieces match in value at $X=1$ since $b_2(1) = 0$.
A ignores $b_2$ entirely. C swaps which side the indicator is active on. D drops the linear $b_1$ piece on the left and gets the parabola direction wrong (a coefficient of $-2$ opens downward, not upward).
Atoms: basis-functions, regression-splines.
Question 11
5 points
ISLP §7 Q5
Define
$\hat g_1 = \arg\min_g \big( \sum (y_i - g(x_i))^2 + \lambda \int [g^{(3)}(t)]^2\, dt \big)$
and
$\hat g_2 = \arg\min_g \big( \sum (y_i - g(x_i))^2 + \lambda \int [g^{(4)}(t)]^2\, dt \big)$
on the same data. Mark each statement true or false.
Show answer
- False — direction reversed. $\hat g_1$ penalises the third derivative, so its unpenalised family at $\lambda \to \infty$ is "all quadratics" (degree $\le 2$). $\hat g_2$ penalises the fourth derivative, so its unpenalised family is "all cubics" (degree $\le 3$) — strictly larger, hence $\hat g_2$ achieves the smaller training RSS, not $\hat g_1$.
- True — at $\lambda = 0$ the penalty vanishes and the minimiser can interpolate every data point exactly, sending training RSS to 0 for both.
- True — forcing $g^{(m)} \equiv 0$ leaves polynomials of degree $< m$. So $g^{(3)} \equiv 0 \Rightarrow$ degree $\le 2$ (quadratic family); $g^{(4)} \equiv 0 \Rightarrow$ degree $\le 3$ (cubic family).
Atoms: smoothing-splines, regularization.
You compare two LOESS fits to the same wage-vs-age data: one with
span = 0.1, one with span = 0.9. Which pairing
of bias / variance behaviour is correct?
- A
span = 0.1: low bias, high variance; span = 0.9: high bias, low variance.
- B
span = 0.1: low bias, low variance; span = 0.9: high bias, low variance — local weighting eliminates the variance penalty for narrow spans.
- C
span = 0.1: low bias, low variance; span = 0.9: high bias, high variance.
- D Both choices give roughly identical bias and variance because LOESS averages over local kernels.
Show answer
Correct answer: A
Span is the fraction of points entering each local fit. Narrow span (0.1) → only the closest few points → wiggly, low-bias / high-variance ("smoothed KNN with small $K$"). Wide span (0.9) → almost all points contribute to every local fit → near-linear, high-bias / low-variance.
B mixes one half right (wide span really does have low variance) with the unicorn "low-bias and low-variance simultaneously" for narrow span — flexibility knobs trade bias against variance, they do not eliminate one for free. C also invents the unicorn at the narrow-span end and inverts the wide-span variance direction. D ignores the span knob entirely.
Atoms: local-regression, bias-variance-tradeoff.
The lecturer separates module-7 methods into "basis-function" methods
(fit by ordinary least squares on a transformed design matrix) and
"fit-a-function-directly" methods. Which list correctly identifies the
first family?
- A Polynomial regression, step functions, regression splines.
- B Polynomial regression, smoothing splines, local regression.
- C Smoothing splines, local regression, GAMs with
s() components.
- D Step functions, smoothing splines, regression splines.
Show answer
Correct answer: A
The prof's split: basis-function methods = polynomial regression, step functions, regression splines (each builds a design matrix and runs OLS). Smoothing splines and LOESS drop that frame: smoothing splines minimise a curvature-penalised loss over all functions, LOESS does a fresh local weighted regression at every query point.
B mixes a basis-function method with two function-space methods. C is exactly the "function-space" family — the negation of what the question asks. D includes smoothing splines, which use a curvature penalty rather than OLS on a fixed basis.
Atoms: basis-functions, smoothing-splines, local-regression.
Question 14
5 points
Ex7.5
Consider the additive model on the Auto data
$$\texttt{mpg} = \beta_0 + f_1(\texttt{displace}) + f_2(\texttt{horsepower}) + \beta_3\,\texttt{weight} + f_4(\texttt{accel}) + f_5(\texttt{origin}) + \varepsilon,$$
where $f_1$ is a cubic spline with one knot at 290, $f_2$ is a degree-2 polynomial,
$f_4$ is a smoothing spline with effective df = 3, and
origin is a 3-level factor. How many model degrees of freedom
does this GAM consume on top of the intercept?
Show answer
Correct answer: C
Sum the per-component dof, all excluding the global intercept:
cubic spline on displace with one knot $= K + 3 = 4$
(truncated-power columns $x, x^2, x^3, (x-290)^3_+$);
degree-2 polynomial on horsepower $= 2$;
linear weight $= 1$;
smoothing spline on accel $= 3$ (given);
factor origin with 3 levels $= 3 - 1 = 2$ dummies.
Total $= 4 + 2 + 1 + 3 + 2 = 12$.
A counts only one dof per component (forgets that splines and polynomials add multiple columns). B forgets the smoothing-spline df entirely. D double-counts the global intercept on top of the 12 dof already in the components.
Atoms: generalized-additive-models, regression-splines, smoothing-splines.
How does the lecturer recommend choosing the smoothing parameter
$\lambda$ for a smoothing spline, and why is the standard recommendation
cheap to evaluate?
- A By minimising training RSS — the smoothing penalty by itself is enough to keep the fit from overfitting.
- B By a global F-test against the null intercept-only model — the standard inference tool for nonparametric fits.
- C By Akaike's information criterion with $\mathrm{tr}(\mathbf S_\lambda)$ taking the role of the effective parameter count.
- D By leave-one-out CV, exploiting the closed-form shortcut so only one fit on the full data is needed.
Show answer
Correct answer: D
The book / lecturer recommends LOOCV because the leave-one-out residuals can be computed from a single fit: $\mathrm{RSS}_{\text{cv}}(\lambda) = \sum_i ((y_i - \hat y_i)/(1 - \{\mathbf S_\lambda\}_{ii}))^2$. Same shortcut shape as OLS LOOCV with the smoother-matrix diagonal in place of $h_{ii}$.
A picks the most flexible fit (df $\to n$) — training RSS is monotone in flexibility. B uses an F-test, which the prof explicitly said he won't ask about (and it doesn't apply to nonparametric fits anyway). C uses AIC — the prof "doesn't trust them" and prefers CV; AIC also relies on penalty-formula assumptions for nonparametric fits that are dubious here.
Atoms: smoothing-splines, cross-validation.
Mark each statement about a plain cubic regression spline (truncated-power basis) as true or false.
Show answer
- True — that's the defining property of a cubic spline: continuity through second derivative.
- False — the third derivative is allowed to jump at knots; that's the only thing that distinguishes a piecewise cubic from a global cubic. Adding the truncated $(x-c_j)^3_+$ contributes a discontinuity in the third derivative only.
- True — the "$_+$" notation means $(x-c_j)^3_+ = 0$ for $x \le c_j$ and $(x-c_j)^3$ for $x > c_j$.
- False — different bases for the same column space yield the same column space, hence the same projection, hence the same $\hat y$. Exercise 7.4 verifies $\hat y_{\text{by-hand}} = \hat y_{\texttt{gam()}}$ exactly.
Atoms: regression-splines, basis-functions.
Question 17
5 points
Exam 2025 P4e
On the Boston Housing data the following test MSEs are reported: linear
regression $25.0$, lasso (CV-tuned $\lambda$) $24.7$, GAM with cubic-spline
terms on rm and age $19.4$, boosted trees $14.1$.
What is the most defensible interpretation, given what the prof said about
each method?
- A Boosted trees exploit both non-linear effects and interactions; the GAM beats linear / lasso by capturing non-linearity but cannot match boosting on interactions.
- B The GAM and the lasso are the same model class, so their test MSEs should be identical; the gap above is most likely a coding error.
- C Boosted trees are best because boosting always reduces test MSE relative to GAMs and linear models, regardless of dataset or interaction structure.
- D Linear regression wins on bias-variance grounds; the lower test MSEs of the GAM and boosted trees only reflect overfit to this particular split.
Show answer
Correct answer: A
The prof's exact framing in L27 Q6d: "boosting wins, GAM beats plain regression." GAMs add per-variable non-linearity (additive); boosted trees add both non-linearity and interactions, which is why the gap between the GAM and boosting is real signal, not noise.
B confuses GAM with a regularised linear model — a GAM is strictly more flexible. C overgeneralises ("always") — boosting can lose, and on smaller / cleaner data plain GAMs sometimes win. D confuses lower test MSE with overfit; overfit shows up as higher test MSE relative to a less-flexible reference.
Atoms: generalized-additive-models, regression-splines. Lecture: L27-summary.
Mark each statement comparing regression splines and smoothing splines as true or false.
Show answer
- True — exactly the structural distinction the prof drew. Smoothing splines sidestep knot placement.
- True — regression spline = OLS on a finite basis; smoothing spline = optimisation over $g$ with curvature penalty.
- False — effective df $= \mathrm{tr}(\mathbf S_\lambda)$ is generally non-integer. The prof: "you can get non-integer values of degrees of freedom… here it's an effective degree of freedom."
- False — knots up = more flexible, but $\lambda$ up = less flexible (smoother). Different directions; this is the canonical T/F trap.
Atoms: smoothing-splines, regression-splines.
The prof framed smoothing splines as a function-space analogue of which
familiar finite-dimensional regularization machinery, and what does the
analogy say about the role of $\lambda$?
- A Lasso — $\lambda$ pushes individual coefficients to exactly zero, giving a sparse and interpretable fit.
- B Stepwise model selection — $\lambda$ is a $p$-value cutoff for adding or dropping a basis term.
- C Ridge regression — same loss-plus-L2-penalty structure, applied to the curvature $g''$ instead of $\beta$.
- D Cross-validation itself — $\lambda$ is the held-out fold size in the resampling scheme.
Show answer
Correct answer: C
Verbatim L16: "we had regularizers… $Y - \beta X$ squared plus sum of $\beta$ squared, that was our ridge regression… Really what you're doing is you're adding another objective to your optimization." Smoothing splines plug the L2-on-curvature term $\lambda \int g''(t)^2\, dt$ into the same loss-plus-penalty frame.
A picks the wrong regulariser — lasso's L1 corner-induced sparsity has no analogue in the curvature penalty. B confuses tuning by penalty with tuning by selection (and the prof distrusts $p$-value-driven model selection anyway). D conflates the regularisation parameter $\lambda$ with the cross-validation fold count.
Atoms: smoothing-splines, regularization.
For the same number of interior knots, what is the practical consequence
of using a natural cubic spline instead of a plain cubic spline?
- A The natural spline is forced linear past the boundary knots, reducing tail variance at the cost of two parameters.
- B The natural spline gains two extra parameters relative to a plain cubic spline because it places additional knots at the boundaries.
- C The natural spline switches the interior basis from cubic to quadratic, roughly halving the total parameter count.
- D The natural spline imposes continuity of all derivatives at every interior knot, including the third derivative.
Show answer
Correct answer: A
Natural splines add two boundary constraints (second derivative zero at the boundary knots), forcing linear extrapolation past the outer knots. That's two fewer effective parameters than a plain cubic spline with the same interior knots, and is exactly the behaviour that fixes the "wild boundary tails" of plain cubics.
B inverts the parameter direction: the boundary knots are already implicit in the plain cubic spline (they bound the data range), and the natural-spline constraint removes two parameters there rather than adding them. C invents a quadratic basis — natural splines are still cubic in the interior. D would make the spline a global cubic polynomial, eliminating the piecewise structure entirely.
Atoms: regression-splines, bias-variance-tradeoff.
A GAM is fitted to the Wage data with $\texttt{wage} = \beta_0 + f_1(\texttt{age}) + f_2(\texttt{year}) + f_3(\texttt{education}) + \varepsilon$,
and the per-predictor panel for $f_1(\texttt{age})$ peaks around age 47.
Which interpretation of "the panel value at age 47" is correct?
- A The predicted wage of a 47-year-old at average year and average education, in dollars.
- B The age-47 contribution to wage, holding the other predictors at their means, on the wage scale.
- C The slope of fitted wage with respect to age, evaluated locally at age 47.
- D The probability that a randomly drawn 47-year-old earns more than the sample mean wage.
Show answer
Correct answer: B
Verbatim L17: "It's not that you look at this and like, oh well, he's 47, so he makes eight whatever units… this is just the contribution. You also have to consider all these other ones." Each panel plots one $f_j(x_j)$ with the others held at their means; this is exact partial dependence under additivity.
A reads the panel value as the prediction itself, which is the most common student error the prof flagged. C reads it as a derivative — but the panel is the function value, not its slope. D treats the regression GAM as a probability — wrong scale entirely (that's the logistic-GAM panel, on the log-odds scale).
Atoms: generalized-additive-models. Lecture: L17-trees-1.
Question 22
4 points
Ex7.3
You build a natural cubic spline design matrix $\mathbf X$ for a single
predictor year with one interior knot at 2006 (boundary knots
at the data's min and max). Excluding the intercept, how many columns
does $\mathbf X$ have?
Show answer
Correct answer: D
The textbook natural-cubic-spline basis is $b_1(x) = x$ plus $b_{k+2}(x) = d_k(x) - d_K(x)$ for $k = 0, \ldots, K-1$. With one interior knot ($K = 1$) you get $b_1(x) = x$ and one $d_0(x) - d_1(x)$ column — two columns total beyond the intercept. Equivalently: natural cubic spline has $K + 1 = 2$ columns excluding the intercept.
A drops the linear column $b_1(x) = x$. C is the count if you forget that the natural-spline boundary constraints kill two parameters. B is the plain cubic spline answer ($K + 3$), which doesn't apply here.
Atoms: regression-splines, basis-functions.