← Back to wiki

Module 07 — Moving Beyond Linearity

22 questions · 100 points · ~35 min

Click an option to lock the answer; the explanation auto-opens. Score tracker bottom-left.

Question 1 4 points

Why do polynomial regression, step functions, and cubic regression splines all qualify as "linear regression" in this course, even though their fitted curves are not straight lines?

Show answer
Correct answer: B

The whole module-7 trick: replace $X$ with $b_j(X)$ to get $y = \beta_0 + \sum_j \beta_j b_j(x) + \varepsilon$. The model is still linear in $\boldsymbol\beta$, so the closed form $\hat\beta = (X^TX)^{-1}X^Ty$, sampling distribution, CIs and t-tests all carry over unchanged.

A confuses "linear in $\beta$" with "linear in $x$" — the prof explicitly flagged this trap. C invents a uniqueness condition that has nothing to do with basis expansion. D names a noise assumption (used for inference), not the reason fitting is OLS.

Atoms: basis-functions, polynomial-regression. Lecture: L16-beyondlinear-1.

Question 2 5 points Exam 2025 P4e(i)

A regression model already includes an intercept. A cubic regression spline on age is added with knots placed at the four quantiles $\{0.2, 0.4, 0.6, 0.8\}$. Using the truncated-power basis $x, x^2, x^3, (x - c_j)^3_+$, how many degrees of freedom does this spline term consume on top of the intercept?

Show answer
Correct answer: D

For a cubic spline with $K$ knots: $K + d + 1 = K + 4$ parameters total (degree $d=3$ plus intercept). With $K = 4$ knots that is $4 + 4 = 8$. Subtract one because the intercept is already in the model: $8 - 1 = 7$.

A counts only the knots (forgets the three polynomial columns $x, x^2, x^3$). B is the count including the intercept — the trap when you don't notice the model already has one. C triple-counts (one polynomial term per knot).

Atoms: regression-splines, basis-functions.

Question 3 6 points Exam 2025 P4d

In smoothing splines, increasing the smoothing parameter $\lambda$ will:

Show answer
  1. False — $\lambda \uparrow$ makes the fit smoother, not wigglier. At $\lambda \to \infty$ the curvature penalty dominates and you collapse to the OLS straight line. The opposite direction from polynomial degree (the canonical T/F trap).
  2. True — pushing $\lambda$ too high oversmooths and underfits; the limit is a straight line.
  3. True — that is what $\lambda$ multiplies in the objective $\sum (y_i - g(x_i))^2 + \lambda \int g''(t)^2\, dt$.
  4. True — effective df $= \mathrm{tr}(\mathbf S_\lambda)$ moves in the same direction as flexibility, so $\lambda \uparrow \Rightarrow$ df $\downarrow$ (toward 2).

Sub-statements scored independently, $6/4 = 1.5$ points each. The (a) statement is the prof's flagged direction-flip trap from L27 Q3d.

Atoms: smoothing-splines, regularization. Lecture: L27-summary.

Question 4 4 points Exam 2023 P3d

Which integral correctly expresses the smoothness penalty in the smoothing-spline objective the lecturer wrote on the board?

Show answer
Correct answer: D

The penalty is integrated squared second derivative: $g''(t)$ measures how fast the slope changes (curvature), and $\int g''(t)^2\, dt$ aggregates curvature across the range. Memorise: second derivative.

A penalises slope, not curvature — a constant non-zero slope would already get penalised, which is wrong (a straight line should have zero penalty). B is an L1 functional norm, not the smoothing-spline objective. C penalises the function's magnitude, which would shrink toward zero, not toward a straight line.

Atoms: smoothing-splines.

Question 5 4 points Exam 2024 P2c

A covariate is included in a regression model as a natural cubic spline with three interior knots. The model already has its own intercept column. How many degrees of freedom does this spline term consume?

Show answer
Correct answer: C

A natural cubic spline forces linearity past the boundary knots, costing two constraints relative to a plain cubic spline. Param count: plain cubic with $K$ knots = $K+4$; natural cubic = $K+4-2 = K+2$ total, or $K+1$ when an intercept already lives in the model. With $K = 3$ that is $3 + 1 = 4$.

A forgets that an intercept-shifted constant column is still being added by the spline (you'd get $K = 3$ alone, missing the linear column $x$). B is the cubic-spline answer with intercept ($K + 3$ — confusing natural and plain cubic). D is plain cubic including intercept ($K+4$) — both errors at once.

Atoms: regression-splines.

Question 6 6 points

Mark each statement about an additive model $y = \beta_0 + f_1(x_1) + f_2(x_2) + f_3(x_3) + \varepsilon$ as true or false.

Show answer
  1. True — the additive form combines marginal contributions; the cross-product $g(x_1)\cdot x_2$ is genuinely not in the function class. You'd need an explicit $f_{12}(x_1, x_2)$ term, which the prof flagged as the GAM-to-trees motivation.
  2. False — Exercise 7.5 mixes five different $f_j$ types in one model: cubic spline on displacement, polynomial on horsepower, linear on weight, smoothing spline on acceleration, factor on origin. The point of GAMs is per-predictor flexibility.
  3. True — verbatim from L17: "they will actually center all of the data and then when you predict for each value you're essentially having seeing the other variables set to their mean values."
  4. True — that's exactly the construction in Exercise 7.4: $\mathbf X = (\mathbf 1, \mathbf X_1, \mathbf X_2, \mathbf X_3)$ then $\hat\beta = (X^TX)^{-1}X^Ty$. Backfitting is only needed when an s() or lo() component is included.

Atoms: generalized-additive-models. Lecture: L17-trees-1.

Question 7 4 points

Suppose age is binned with cutpoints $c_1 = 30, c_2 = 50, c_3 = 65$ and a step-function regression is fitted. Counting the intercept, how many parameters does the model have, and what is the shape of the fitted function?

Show answer
Correct answer: B

$K$ cutpoints give $K+1$ bins; $K$ indicator dummies plus the intercept = $K+1$ parameters. Here $K=3$, so 4 parameters. The fit is one constant per bin and is not connected — "piecewise constant, but it's not connected, it can jump."

A confuses step functions with linear splines (no slopes inside bins, no continuity). C miscounts: with $K$ cutpoints you need $K$ indicators plus the intercept column, so 4 parameters, not 3 (omitting the intercept absorbs one bin into the global mean and breaks the basis-on-top-of-intercept framing). D is a piecewise-linear regression with separate slopes per bin (different model entirely; would have 6 parameters, not 4).

Atoms: step-functions, basis-functions.

Question 8 5 points

You fit polynomials of degrees $1$ through $10$ to predict mpg from horsepower on a held-out test split. The training MSE drops monotonically with degree. The test MSE has a clear U-shape, bottoming out around degree 2. What is the most defensible single-sentence interpretation?

Show answer
Correct answer: A

Adding parameters can never raise training MSE (the OLS solution at degree $d-1$ is feasible at degree $d$ with a zero coefficient). Test MSE follows the bias-variance U: bias falls with flexibility, variance rises, and irreducible error is constant. The bottom of the U is the bias-variance optimum.

B confuses train and test — the prof's flagged "keyword trap." C misuses irreducible error, which is a property of the noise, not the model. D inverts the bias-variance roles: variance rises, not falls, with model flexibility (bias is what falls).

Atoms: polynomial-regression, bias-variance-tradeoff, cross-validation.

Question 9 4 points

For a smoothing spline minimising $\sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int g''(t)^2\, dt$ on data with at least three distinct $x_i$, what is the effective degrees of freedom $\mathrm{tr}(\mathbf S_\lambda)$ in the limit $\lambda \to \infty$?

Show answer
Correct answer: C

As $\lambda \to \infty$ the curvature penalty forces $g''(t) \equiv 0$, so $g$ collapses to a straight line — the OLS line, with two parameters (intercept + slope). Effective df therefore tends to 2.

A would mean the fit is identically zero (penalising magnitude, not curvature). B is the constant-mean fit — a curvature penalty does not flatten slope, it flattens curvature. D is the opposite extreme ($\lambda = 0$), where $g$ interpolates every $y_i$.

Atoms: smoothing-splines.

Question 10 5 points ISLP §7 Q3

We fit $Y = \beta_0 + \beta_1 b_1(X) + \beta_2 b_2(X) + \varepsilon$ with $b_1(X) = X$, $b_2(X) = (X-1)^2 \, \mathbb{1}(X \ge 1)$ and obtain $\hat\beta_0 = 1$, $\hat\beta_1 = 1$, $\hat\beta_2 = -2$. What does the fitted curve look like over $X \in [-2, 2]$?

Show answer
Correct answer: B

$b_2(X) = 0$ when $X < 1$, so the fit is just $1 + X$ on $[-2, 1)$. For $X \ge 1$, $b_2(X) = (X-1)^2$ activates with coefficient $-2$, giving $\hat y = 1 + X - 2(X-1)^2$ — a downward-opening parabola because the leading coefficient $-2 < 0$. The two pieces match in value at $X=1$ since $b_2(1) = 0$.

A ignores $b_2$ entirely. C swaps which side the indicator is active on. D drops the linear $b_1$ piece on the left and gets the parabola direction wrong (a coefficient of $-2$ opens downward, not upward).

Atoms: basis-functions, regression-splines.

Question 11 5 points ISLP §7 Q5

Define $\hat g_1 = \arg\min_g \big( \sum (y_i - g(x_i))^2 + \lambda \int [g^{(3)}(t)]^2\, dt \big)$ and $\hat g_2 = \arg\min_g \big( \sum (y_i - g(x_i))^2 + \lambda \int [g^{(4)}(t)]^2\, dt \big)$ on the same data. Mark each statement true or false.

Show answer
  1. False — direction reversed. $\hat g_1$ penalises the third derivative, so its unpenalised family at $\lambda \to \infty$ is "all quadratics" (degree $\le 2$). $\hat g_2$ penalises the fourth derivative, so its unpenalised family is "all cubics" (degree $\le 3$) — strictly larger, hence $\hat g_2$ achieves the smaller training RSS, not $\hat g_1$.
  2. True — at $\lambda = 0$ the penalty vanishes and the minimiser can interpolate every data point exactly, sending training RSS to 0 for both.
  3. True — forcing $g^{(m)} \equiv 0$ leaves polynomials of degree $< m$. So $g^{(3)} \equiv 0 \Rightarrow$ degree $\le 2$ (quadratic family); $g^{(4)} \equiv 0 \Rightarrow$ degree $\le 3$ (cubic family).

Atoms: smoothing-splines, regularization.

Question 12 4 points

You compare two LOESS fits to the same wage-vs-age data: one with span = 0.1, one with span = 0.9. Which pairing of bias / variance behaviour is correct?

Show answer
Correct answer: A

Span is the fraction of points entering each local fit. Narrow span (0.1) → only the closest few points → wiggly, low-bias / high-variance ("smoothed KNN with small $K$"). Wide span (0.9) → almost all points contribute to every local fit → near-linear, high-bias / low-variance.

B mixes one half right (wide span really does have low variance) with the unicorn "low-bias and low-variance simultaneously" for narrow span — flexibility knobs trade bias against variance, they do not eliminate one for free. C also invents the unicorn at the narrow-span end and inverts the wide-span variance direction. D ignores the span knob entirely.

Atoms: local-regression, bias-variance-tradeoff.

Question 13 4 points

The lecturer separates module-7 methods into "basis-function" methods (fit by ordinary least squares on a transformed design matrix) and "fit-a-function-directly" methods. Which list correctly identifies the first family?

Show answer
Correct answer: A

The prof's split: basis-function methods = polynomial regression, step functions, regression splines (each builds a design matrix and runs OLS). Smoothing splines and LOESS drop that frame: smoothing splines minimise a curvature-penalised loss over all functions, LOESS does a fresh local weighted regression at every query point.

B mixes a basis-function method with two function-space methods. C is exactly the "function-space" family — the negation of what the question asks. D includes smoothing splines, which use a curvature penalty rather than OLS on a fixed basis.

Atoms: basis-functions, smoothing-splines, local-regression.

Question 14 5 points Ex7.5

Consider the additive model on the Auto data $$\texttt{mpg} = \beta_0 + f_1(\texttt{displace}) + f_2(\texttt{horsepower}) + \beta_3\,\texttt{weight} + f_4(\texttt{accel}) + f_5(\texttt{origin}) + \varepsilon,$$ where $f_1$ is a cubic spline with one knot at 290, $f_2$ is a degree-2 polynomial, $f_4$ is a smoothing spline with effective df = 3, and origin is a 3-level factor. How many model degrees of freedom does this GAM consume on top of the intercept?

Show answer
Correct answer: C

Sum the per-component dof, all excluding the global intercept: cubic spline on displace with one knot $= K + 3 = 4$ (truncated-power columns $x, x^2, x^3, (x-290)^3_+$); degree-2 polynomial on horsepower $= 2$; linear weight $= 1$; smoothing spline on accel $= 3$ (given); factor origin with 3 levels $= 3 - 1 = 2$ dummies. Total $= 4 + 2 + 1 + 3 + 2 = 12$.

A counts only one dof per component (forgets that splines and polynomials add multiple columns). B forgets the smoothing-spline df entirely. D double-counts the global intercept on top of the 12 dof already in the components.

Atoms: generalized-additive-models, regression-splines, smoothing-splines.

Question 15 3 points

How does the lecturer recommend choosing the smoothing parameter $\lambda$ for a smoothing spline, and why is the standard recommendation cheap to evaluate?

Show answer
Correct answer: D

The book / lecturer recommends LOOCV because the leave-one-out residuals can be computed from a single fit: $\mathrm{RSS}_{\text{cv}}(\lambda) = \sum_i ((y_i - \hat y_i)/(1 - \{\mathbf S_\lambda\}_{ii}))^2$. Same shortcut shape as OLS LOOCV with the smoother-matrix diagonal in place of $h_{ii}$.

A picks the most flexible fit (df $\to n$) — training RSS is monotone in flexibility. B uses an F-test, which the prof explicitly said he won't ask about (and it doesn't apply to nonparametric fits anyway). C uses AIC — the prof "doesn't trust them" and prefers CV; AIC also relies on penalty-formula assumptions for nonparametric fits that are dubious here.

Atoms: smoothing-splines, cross-validation.

Question 16 6 points

Mark each statement about a plain cubic regression spline (truncated-power basis) as true or false.

Show answer
  1. True — that's the defining property of a cubic spline: continuity through second derivative.
  2. False — the third derivative is allowed to jump at knots; that's the only thing that distinguishes a piecewise cubic from a global cubic. Adding the truncated $(x-c_j)^3_+$ contributes a discontinuity in the third derivative only.
  3. True — the "$_+$" notation means $(x-c_j)^3_+ = 0$ for $x \le c_j$ and $(x-c_j)^3$ for $x > c_j$.
  4. False — different bases for the same column space yield the same column space, hence the same projection, hence the same $\hat y$. Exercise 7.4 verifies $\hat y_{\text{by-hand}} = \hat y_{\texttt{gam()}}$ exactly.

Atoms: regression-splines, basis-functions.

Question 17 5 points Exam 2025 P4e

On the Boston Housing data the following test MSEs are reported: linear regression $25.0$, lasso (CV-tuned $\lambda$) $24.7$, GAM with cubic-spline terms on rm and age $19.4$, boosted trees $14.1$. What is the most defensible interpretation, given what the prof said about each method?

Show answer
Correct answer: A

The prof's exact framing in L27 Q6d: "boosting wins, GAM beats plain regression." GAMs add per-variable non-linearity (additive); boosted trees add both non-linearity and interactions, which is why the gap between the GAM and boosting is real signal, not noise.

B confuses GAM with a regularised linear model — a GAM is strictly more flexible. C overgeneralises ("always") — boosting can lose, and on smaller / cleaner data plain GAMs sometimes win. D confuses lower test MSE with overfit; overfit shows up as higher test MSE relative to a less-flexible reference.

Atoms: generalized-additive-models, regression-splines. Lecture: L27-summary.

Question 18 6 points

Mark each statement comparing regression splines and smoothing splines as true or false.

Show answer
  1. True — exactly the structural distinction the prof drew. Smoothing splines sidestep knot placement.
  2. True — regression spline = OLS on a finite basis; smoothing spline = optimisation over $g$ with curvature penalty.
  3. False — effective df $= \mathrm{tr}(\mathbf S_\lambda)$ is generally non-integer. The prof: "you can get non-integer values of degrees of freedom… here it's an effective degree of freedom."
  4. False — knots up = more flexible, but $\lambda$ up = less flexible (smoother). Different directions; this is the canonical T/F trap.

Atoms: smoothing-splines, regression-splines.

Question 19 3 points

The prof framed smoothing splines as a function-space analogue of which familiar finite-dimensional regularization machinery, and what does the analogy say about the role of $\lambda$?

Show answer
Correct answer: C

Verbatim L16: "we had regularizers… $Y - \beta X$ squared plus sum of $\beta$ squared, that was our ridge regression… Really what you're doing is you're adding another objective to your optimization." Smoothing splines plug the L2-on-curvature term $\lambda \int g''(t)^2\, dt$ into the same loss-plus-penalty frame.

A picks the wrong regulariser — lasso's L1 corner-induced sparsity has no analogue in the curvature penalty. B confuses tuning by penalty with tuning by selection (and the prof distrusts $p$-value-driven model selection anyway). D conflates the regularisation parameter $\lambda$ with the cross-validation fold count.

Atoms: smoothing-splines, regularization.

Question 20 4 points

For the same number of interior knots, what is the practical consequence of using a natural cubic spline instead of a plain cubic spline?

Show answer
Correct answer: A

Natural splines add two boundary constraints (second derivative zero at the boundary knots), forcing linear extrapolation past the outer knots. That's two fewer effective parameters than a plain cubic spline with the same interior knots, and is exactly the behaviour that fixes the "wild boundary tails" of plain cubics.

B inverts the parameter direction: the boundary knots are already implicit in the plain cubic spline (they bound the data range), and the natural-spline constraint removes two parameters there rather than adding them. C invents a quadratic basis — natural splines are still cubic in the interior. D would make the spline a global cubic polynomial, eliminating the piecewise structure entirely.

Atoms: regression-splines, bias-variance-tradeoff.

Question 21 4 points

A GAM is fitted to the Wage data with $\texttt{wage} = \beta_0 + f_1(\texttt{age}) + f_2(\texttt{year}) + f_3(\texttt{education}) + \varepsilon$, and the per-predictor panel for $f_1(\texttt{age})$ peaks around age 47. Which interpretation of "the panel value at age 47" is correct?

Show answer
Correct answer: B

Verbatim L17: "It's not that you look at this and like, oh well, he's 47, so he makes eight whatever units… this is just the contribution. You also have to consider all these other ones." Each panel plots one $f_j(x_j)$ with the others held at their means; this is exact partial dependence under additivity.

A reads the panel value as the prediction itself, which is the most common student error the prof flagged. C reads it as a derivative — but the panel is the function value, not its slope. D treats the regression GAM as a probability — wrong scale entirely (that's the logistic-GAM panel, on the log-odds scale).

Atoms: generalized-additive-models. Lecture: L17-trees-1.

Question 22 4 points Ex7.3

You build a natural cubic spline design matrix $\mathbf X$ for a single predictor year with one interior knot at 2006 (boundary knots at the data's min and max). Excluding the intercept, how many columns does $\mathbf X$ have?

Show answer
Correct answer: D

The textbook natural-cubic-spline basis is $b_1(x) = x$ plus $b_{k+2}(x) = d_k(x) - d_K(x)$ for $k = 0, \ldots, K-1$. With one interior knot ($K = 1$) you get $b_1(x) = x$ and one $d_0(x) - d_1(x)$ column — two columns total beyond the intercept. Equivalently: natural cubic spline has $K + 1 = 2$ columns excluding the intercept.

A drops the linear column $b_1(x) = x$. C is the count if you forget that the natural-spline boundary constraints kill two parameters. B is the plain cubic spline answer ($K + 3$), which doesn't apply here.

Atoms: regression-splines, basis-functions.