Module 07 — Moving Beyond Linearity

Question 1 4 points

Why do polynomial regression, step functions, and cubic regression splines all qualify as "linear regression" in this course, even though their fitted curves are not straight lines?

A The fitted curve $\hat f(x)$ is itself a linear function of $x$ once you average over the basis terms.
B The model is linear in the parameters $\boldsymbol\beta$, so OLS on the basis-expanded design matrix still applies.
C The least-squares fit is unique only when the relationship between $X$ and $Y$ is linear; the basis is just a numerical convenience.
D The residuals are forced to be Gaussian, which is the defining property of linear regression.

Show answer

Correct answer: B

The whole module-7 trick: replace $X$ with $b_j(X)$ to get $y = \beta_0 + \sum_j \beta_j b_j(x) + \varepsilon$. The model is still linear in $\boldsymbol\beta$, so the closed form $\hat\beta = (X^TX)^{-1}X^Ty$, sampling distribution, CIs and t-tests all carry over unchanged.

A confuses "linear in $\beta$" with "linear in $x$" — the prof explicitly flagged this trap. C invents a uniqueness condition that has nothing to do with basis expansion. D names a noise assumption (used for inference), not the reason fitting is OLS.

Atoms: basis-functions, polynomial-regression. Lecture: L16-beyondlinear-1.

Question 2 5 points Exam 2025 P4e(i)

A regression model already includes an intercept. A cubic regression spline on age is added with knots placed at the four quantiles $\{0.2, 0.4, 0.6, 0.8\}$. Using the truncated-power basis $x, x^2, x^3, (x - c_j)^3_+$, how many degrees of freedom does this spline term consume on top of the intercept?

A 4
B 8
C 12
D 7

Show answer

Correct answer: D

For a cubic spline with $K$ knots: $K + d + 1 = K + 4$ parameters total (degree $d=3$ plus intercept). With $K = 4$ knots that is $4 + 4 = 8$. Subtract one because the intercept is already in the model: $8 - 1 = 7$.

A counts only the knots (forgets the three polynomial columns $x, x^2, x^3$). B is the count including the intercept — the trap when you don't notice the model already has one. C triple-counts (one polynomial term per knot).

Atoms: regression-splines, basis-functions.

Question 3 6 points Exam 2025 P4d

In smoothing splines, increasing the smoothing parameter $\lambda$ will:

a) Make the fitted function more flexible and wiggly. True False
b) Make the fitted function smoother and potentially underfit the data. True False
c) Increase the penalty for wiggliness in the loss. True False
d) Decrease the effective degrees of freedom $\mathrm{tr}(\mathbf S_\lambda)$. True False

Show answer

False — $\lambda \uparrow$ makes the fit smoother, not wigglier. At $\lambda \to \infty$ the curvature penalty dominates and you collapse to the OLS straight line. The opposite direction from polynomial degree (the canonical T/F trap).
True — pushing $\lambda$ too high oversmooths and underfits; the limit is a straight line.
True — that is what $\lambda$ multiplies in the objective $\sum (y_i - g(x_i))^2 + \lambda \int g''(t)^2\, dt$.
True — effective df $= \mathrm{tr}(\mathbf S_\lambda)$ moves in the same direction as flexibility, so $\lambda \uparrow \Rightarrow$ df $\downarrow$ (toward 2).

Sub-statements scored independently, $6/4 = 1.5$ points each. The (a) statement is the prof's flagged direction-flip trap from L27 Q3d.

Atoms: smoothing-splines, regularization. Lecture: L27-summary.

Question 4 4 points Exam 2023 P3d

Which integral correctly expresses the smoothness penalty in the smoothing-spline objective the lecturer wrote on the board?

A $\lambda \int g'(t)^2 \, dt$
B $\lambda \int |g(t)|\, dt$
C $\lambda \int g(t)^2\, dt$
D $\lambda \int g''(t)^2 \, dt$

Show answer

Correct answer: D

The penalty is integrated squared second derivative: $g''(t)$ measures how fast the slope changes (curvature), and $\int g''(t)^2\, dt$ aggregates curvature across the range. Memorise: second derivative.

A penalises slope, not curvature — a constant non-zero slope would already get penalised, which is wrong (a straight line should have zero penalty). B is an L1 functional norm, not the smoothing-spline objective. C penalises the function's magnitude, which would shrink toward zero, not toward a straight line.

Atoms: smoothing-splines.

Question 5 4 points Exam 2024 P2c

A covariate is included in a regression model as a natural cubic spline with three interior knots. The model already has its own intercept column. How many degrees of freedom does this spline term consume?

A 3
B 6
C 4
D 7

Show answer

Correct answer: C

A natural cubic spline forces linearity past the boundary knots, costing two constraints relative to a plain cubic spline. Param count: plain cubic with $K$ knots = $K+4$; natural cubic = $K+4-2 = K+2$ total, or $K+1$ when an intercept already lives in the model. With $K = 3$ that is $3 + 1 = 4$.

A forgets that an intercept-shifted constant column is still being added by the spline (you'd get $K = 3$ alone, missing the linear column $x$). B is the cubic-spline answer with intercept ($K + 3$ — confusing natural and plain cubic). D is plain cubic including intercept ($K+4$) — both errors at once.

Atoms: regression-splines.

Question 6 6 points

Mark each statement about an additive model $y = \beta_0 + f_1(x_1) + f_2(x_2) + f_3(x_3) + \varepsilon$ as true or false.

a) Without an explicit interaction term, this GAM cannot represent a true effect of the form $g(x_1) \cdot x_2$. True False
b) Each $f_j$ must be the same kind of basis (e.g. all cubic splines); mixing a smoothing spline for $x_1$ with a polynomial for $x_2$ is not allowed. True False
c) Each panel of the standard GAM plot shows the fitted contribution of one predictor with the other predictors held at their means. True False
d) When all $f_j$ are basis-function expansions (no s() or lo()), the GAM can be fit by stacking the basis blocks and running ordinary least squares. True False

Show answer

True — the additive form combines marginal contributions; the cross-product $g(x_1)\cdot x_2$ is genuinely not in the function class. You'd need an explicit $f_{12}(x_1, x_2)$ term, which the prof flagged as the GAM-to-trees motivation.
False — Exercise 7.5 mixes five different $f_j$ types in one model: cubic spline on displacement, polynomial on horsepower, linear on weight, smoothing spline on acceleration, factor on origin. The point of GAMs is per-predictor flexibility.
True — verbatim from L17: "they will actually center all of the data and then when you predict for each value you're essentially having seeing the other variables set to their mean values."
True — that's exactly the construction in Exercise 7.4: $\mathbf X = (\mathbf 1, \mathbf X_1, \mathbf X_2, \mathbf X_3)$ then $\hat\beta = (X^TX)^{-1}X^Ty$. Backfitting is only needed when an s() or lo() component is included.

Atoms: generalized-additive-models. Lecture: L17-trees-1.

Question 7 4 points

Suppose age is binned with cutpoints $c_1 = 30, c_2 = 50, c_3 = 65$ and a step-function regression is fitted. Counting the intercept, how many parameters does the model have, and what is the shape of the fitted function?

A 3 parameters; the fit is piecewise linear and continuous at every cutpoint.
B 4 parameters; the fit is piecewise constant and may jump at every cutpoint.
C 3 parameters; the fit is piecewise constant with one indicator per cutpoint and no separate intercept column.
D 6 parameters; one slope and one intercept inside each of the three bins.

Show answer

Correct answer: B

$K$ cutpoints give $K+1$ bins; $K$ indicator dummies plus the intercept = $K+1$ parameters. Here $K=3$, so 4 parameters. The fit is one constant per bin and is not connected — "piecewise constant, but it's not connected, it can jump."

A confuses step functions with linear splines (no slopes inside bins, no continuity). C miscounts: with $K$ cutpoints you need $K$ indicators plus the intercept column, so 4 parameters, not 3 (omitting the intercept absorbs one bin into the global mean and breaks the basis-on-top-of-intercept framing). D is a piecewise-linear regression with separate slopes per bin (different model entirely; would have 6 parameters, not 4).

Atoms: step-functions, basis-functions.

Question 8 5 points

You fit polynomials of degrees $1$ through $10$ to predict mpg from horsepower on a held-out test split. The training MSE drops monotonically with degree. The test MSE has a clear U-shape, bottoming out around degree 2. What is the most defensible single-sentence interpretation?

A Training MSE falls monotonically with model flexibility; test MSE is U-shaped because variance eventually dominates the bias gain, so pick degree 2 from the bottom of the U.
B Training and test MSE always agree in shape; the U you see in test MSE is purely sampling noise from this particular train/test split.
C Training MSE drops because higher polynomial degree increases the irreducible error term in the bias-variance decomposition.
D Training MSE drops because higher polynomial degree lowers the variance term in the bias-variance decomposition.

Show answer

Correct answer: A

Adding parameters can never raise training MSE (the OLS solution at degree $d-1$ is feasible at degree $d$ with a zero coefficient). Test MSE follows the bias-variance U: bias falls with flexibility, variance rises, and irreducible error is constant. The bottom of the U is the bias-variance optimum.

B confuses train and test — the prof's flagged "keyword trap." C misuses irreducible error, which is a property of the noise, not the model. D inverts the bias-variance roles: variance rises, not falls, with model flexibility (bias is what falls).

Atoms: polynomial-regression, bias-variance-tradeoff, cross-validation.

Question 9 4 points

For a smoothing spline minimising $\sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int g''(t)^2\, dt$ on data with at least three distinct $x_i$, what is the effective degrees of freedom $\mathrm{tr}(\mathbf S_\lambda)$ in the limit $\lambda \to \infty$?

A 0
B 1
C 2
D $n$

Show answer

Correct answer: C

As $\lambda \to \infty$ the curvature penalty forces $g''(t) \equiv 0$, so $g$ collapses to a straight line — the OLS line, with two parameters (intercept + slope). Effective df therefore tends to 2.

A would mean the fit is identically zero (penalising magnitude, not curvature). B is the constant-mean fit — a curvature penalty does not flatten slope, it flattens curvature. D is the opposite extreme ($\lambda = 0$), where $g$ interpolates every $y_i$.

Atoms: smoothing-splines.

Question 10 5 points ISLP §7 Q3

We fit $Y = \beta_0 + \beta_1 b_1(X) + \beta_2 b_2(X) + \varepsilon$ with $b_1(X) = X$, $b_2(X) = (X-1)^2 \, \mathbb{1}(X \ge 1)$ and obtain $\hat\beta_0 = 1$, $\hat\beta_1 = 1$, $\hat\beta_2 = -2$. What does the fitted curve look like over $X \in [-2, 2]$?

A A single straight line of slope 1 through $(0, 1)$ over the whole interval.
B Straight line $\hat y = 1 + X$ for $X < 1$; downward-opening parabola $\hat y = 1 + X - 2(X-1)^2$ for $X \ge 1$ (joined continuously at $X = 1$).
C Straight line $\hat y = 1 + X$ for $X \ge 1$; downward parabola $\hat y = 1 + X - 2(X-1)^2$ for $X < 1$.
D Constant 1 for $X < 1$; an upward-opening parabola for $X \ge 1$.

Show answer

Correct answer: B

$b_2(X) = 0$ when $X < 1$, so the fit is just $1 + X$ on $[-2, 1)$. For $X \ge 1$, $b_2(X) = (X-1)^2$ activates with coefficient $-2$, giving $\hat y = 1 + X - 2(X-1)^2$ — a downward-opening parabola because the leading coefficient $-2 < 0$. The two pieces match in value at $X=1$ since $b_2(1) = 0$.

A ignores $b_2$ entirely. C swaps which side the indicator is active on. D drops the linear $b_1$ piece on the left and gets the parabola direction wrong (a coefficient of $-2$ opens downward, not upward).

Atoms: basis-functions, regression-splines.

Question 11 5 points ISLP §7 Q5

Define $\hat g_1 = \arg\min_g \big( \sum (y_i - g(x_i))^2 + \lambda \int [g^{(3)}(t)]^2\, dt \big)$ and $\hat g_2 = \arg\min_g \big( \sum (y_i - g(x_i))^2 + \lambda \int [g^{(4)}(t)]^2\, dt \big)$ on the same data. Mark each statement true or false.

a) As $\lambda \to \infty$, $\hat g_1$ has the smaller training RSS because penalising only the third derivative leaves a strictly larger unpenalised polynomial family than penalising the fourth. True False
b) At $\lambda = 0$, both $\hat g_1$ and $\hat g_2$ interpolate the data and have training RSS equal to zero. True False
c) As $\lambda \to \infty$, $\hat g_1$ collapses to a quadratic (because the third derivative is forced to zero), and $\hat g_2$ collapses to a cubic. True False

Show answer

False — direction reversed. $\hat g_1$ penalises the third derivative, so its unpenalised family at $\lambda \to \infty$ is "all quadratics" (degree $\le 2$). $\hat g_2$ penalises the fourth derivative, so its unpenalised family is "all cubics" (degree $\le 3$) — strictly larger, hence $\hat g_2$ achieves the smaller training RSS, not $\hat g_1$.
True — at $\lambda = 0$ the penalty vanishes and the minimiser can interpolate every data point exactly, sending training RSS to 0 for both.
True — forcing $g^{(m)} \equiv 0$ leaves polynomials of degree $< m$. So $g^{(3)} \equiv 0 \Rightarrow$ degree $\le 2$ (quadratic family); $g^{(4)} \equiv 0 \Rightarrow$ degree $\le 3$ (cubic family).

Atoms: smoothing-splines, regularization.

Question 12 4 points

You compare two LOESS fits to the same wage-vs-age data: one with span = 0.1, one with span = 0.9. Which pairing of bias / variance behaviour is correct?

A span = 0.1: low bias, high variance; span = 0.9: high bias, low variance.
B span = 0.1: low bias, low variance; span = 0.9: high bias, low variance — local weighting eliminates the variance penalty for narrow spans.
C span = 0.1: low bias, low variance; span = 0.9: high bias, high variance.
D Both choices give roughly identical bias and variance because LOESS averages over local kernels.

Show answer

Correct answer: A

Span is the fraction of points entering each local fit. Narrow span (0.1) → only the closest few points → wiggly, low-bias / high-variance ("smoothed KNN with small $K$"). Wide span (0.9) → almost all points contribute to every local fit → near-linear, high-bias / low-variance.

B mixes one half right (wide span really does have low variance) with the unicorn "low-bias and low-variance simultaneously" for narrow span — flexibility knobs trade bias against variance, they do not eliminate one for free. C also invents the unicorn at the narrow-span end and inverts the wide-span variance direction. D ignores the span knob entirely.

Atoms: local-regression, bias-variance-tradeoff.

Question 13 4 points

The lecturer separates module-7 methods into "basis-function" methods (fit by ordinary least squares on a transformed design matrix) and "fit-a-function-directly" methods. Which list correctly identifies the first family?

A Polynomial regression, step functions, regression splines.
B Polynomial regression, smoothing splines, local regression.
C Smoothing splines, local regression, GAMs with s() components.
D Step functions, smoothing splines, regression splines.

Show answer

Correct answer: A

The prof's split: basis-function methods = polynomial regression, step functions, regression splines (each builds a design matrix and runs OLS). Smoothing splines and LOESS drop that frame: smoothing splines minimise a curvature-penalised loss over all functions, LOESS does a fresh local weighted regression at every query point.

B mixes a basis-function method with two function-space methods. C is exactly the "function-space" family — the negation of what the question asks. D includes smoothing splines, which use a curvature penalty rather than OLS on a fixed basis.

Atoms: basis-functions, smoothing-splines, local-regression.

Question 14 5 points Ex7.5

Consider the additive model on the Auto data $$\texttt{mpg} = \beta_0 + f_1(\texttt{displace}) + f_2(\texttt{horsepower}) + \beta_3\,\texttt{weight} + f_4(\texttt{accel}) + f_5(\texttt{origin}) + \varepsilon,$$ where $f_1$ is a cubic spline with one knot at 290, $f_2$ is a degree-2 polynomial, $f_4$ is a smoothing spline with effective df = 3, and origin is a 3-level factor. How many model degrees of freedom does this GAM consume on top of the intercept?

A 6
B 8
C 12
D 14

Show answer

Correct answer: C

Sum the per-component dof, all excluding the global intercept: cubic spline on displace with one knot $= K + 3 = 4$ (truncated-power columns $x, x^2, x^3, (x-290)^3_+$); degree-2 polynomial on horsepower $= 2$; linear weight $= 1$; smoothing spline on accel $= 3$ (given); factor origin with 3 levels $= 3 - 1 = 2$ dummies. Total $= 4 + 2 + 1 + 3 + 2 = 12$.

A counts only one dof per component (forgets that splines and polynomials add multiple columns). B forgets the smoothing-spline df entirely. D double-counts the global intercept on top of the 12 dof already in the components.

Atoms: generalized-additive-models, regression-splines, smoothing-splines.

Question 15 3 points

How does the lecturer recommend choosing the smoothing parameter $\lambda$ for a smoothing spline, and why is the standard recommendation cheap to evaluate?

A By minimising training RSS — the smoothing penalty by itself is enough to keep the fit from overfitting.
B By a global F-test against the null intercept-only model — the standard inference tool for nonparametric fits.
C By Akaike's information criterion with $\mathrm{tr}(\mathbf S_\lambda)$ taking the role of the effective parameter count.
D By leave-one-out CV, exploiting the closed-form shortcut so only one fit on the full data is needed.

Show answer

Correct answer: D

The book / lecturer recommends LOOCV because the leave-one-out residuals can be computed from a single fit: $\mathrm{RSS}_{\text{cv}}(\lambda) = \sum_i ((y_i - \hat y_i)/(1 - \{\mathbf S_\lambda\}_{ii}))^2$. Same shortcut shape as OLS LOOCV with the smoother-matrix diagonal in place of $h_{ii}$.

A picks the most flexible fit (df $\to n$) — training RSS is monotone in flexibility. B uses an F-test, which the prof explicitly said he won't ask about (and it doesn't apply to nonparametric fits anyway). C uses AIC — the prof "doesn't trust them" and prefers CV; AIC also relies on penalty-formula assumptions for nonparametric fits that are dubious here.

Atoms: smoothing-splines, cross-validation.

Question 16 6 points

Mark each statement about a plain cubic regression spline (truncated-power basis) as true or false.

a) The first and second derivatives of the fit are continuous at every interior knot. True False
b) The third derivative of the fit is continuous at every interior knot. True False
c) Each truncated cubic basis function $(x - c_j)^3_+$ contributes only on the right of its knot $c_j$ and is identically zero to the left. True False
d) Refitting the same data with R's bs() (B-spline basis) instead of the truncated-power basis would generally give different fitted values $\hat y$. True False

Show answer

True — that's the defining property of a cubic spline: continuity through second derivative.
False — the third derivative is allowed to jump at knots; that's the only thing that distinguishes a piecewise cubic from a global cubic. Adding the truncated $(x-c_j)^3_+$ contributes a discontinuity in the third derivative only.
True — the "$_+$" notation means $(x-c_j)^3_+ = 0$ for $x \le c_j$ and $(x-c_j)^3$ for $x > c_j$.
False — different bases for the same column space yield the same column space, hence the same projection, hence the same $\hat y$. Exercise 7.4 verifies $\hat y_{\text{by-hand}} = \hat y_{\texttt{gam()}}$ exactly.

Atoms: regression-splines, basis-functions.

Question 17 5 points Exam 2025 P4e

On the Boston Housing data the following test MSEs are reported: linear regression $25.0$, lasso (CV-tuned $\lambda$) $24.7$, GAM with cubic-spline terms on rm and age $19.4$, boosted trees $14.1$. What is the most defensible interpretation, given what the prof said about each method?

A Boosted trees exploit both non-linear effects and interactions; the GAM beats linear / lasso by capturing non-linearity but cannot match boosting on interactions.
B The GAM and the lasso are the same model class, so their test MSEs should be identical; the gap above is most likely a coding error.
C Boosted trees are best because boosting always reduces test MSE relative to GAMs and linear models, regardless of dataset or interaction structure.
D Linear regression wins on bias-variance grounds; the lower test MSEs of the GAM and boosted trees only reflect overfit to this particular split.

Show answer

Correct answer: A

The prof's exact framing in L27 Q6d: "boosting wins, GAM beats plain regression." GAMs add per-variable non-linearity (additive); boosted trees add both non-linearity and interactions, which is why the gap between the GAM and boosting is real signal, not noise.

B confuses GAM with a regularised linear model — a GAM is strictly more flexible. C overgeneralises ("always") — boosting can lose, and on smaller / cleaner data plain GAMs sometimes win. D confuses lower test MSE with overfit; overfit shows up as higher test MSE relative to a less-flexible reference.

Atoms: generalized-additive-models, regression-splines. Lecture: L27-summary.

Question 18 6 points

Mark each statement comparing regression splines and smoothing splines as true or false.

a) A regression spline requires you to choose the number and location of knots; a smoothing spline puts a knot at every unique $x_i$ and shrinks via $\lambda$. True False
b) A regression spline is fit by ordinary least squares on a basis-expanded design matrix; a smoothing spline minimises a penalised loss over function space. True False
c) Effective degrees of freedom of a smoothing spline must be a positive integer, just like the parameter count of a regression spline. True False
d) Increasing the number of knots in a regression spline and increasing $\lambda$ in a smoothing spline both make the fit more flexible. True False

Show answer

True — exactly the structural distinction the prof drew. Smoothing splines sidestep knot placement.
True — regression spline = OLS on a finite basis; smoothing spline = optimisation over $g$ with curvature penalty.
False — effective df $= \mathrm{tr}(\mathbf S_\lambda)$ is generally non-integer. The prof: "you can get non-integer values of degrees of freedom… here it's an effective degree of freedom."
False — knots up = more flexible, but $\lambda$ up = less flexible (smoother). Different directions; this is the canonical T/F trap.

Atoms: smoothing-splines, regression-splines.

Question 19 3 points

The prof framed smoothing splines as a function-space analogue of which familiar finite-dimensional regularization machinery, and what does the analogy say about the role of $\lambda$?

A Lasso — $\lambda$ pushes individual coefficients to exactly zero, giving a sparse and interpretable fit.
B Stepwise model selection — $\lambda$ is a $p$-value cutoff for adding or dropping a basis term.
C Ridge regression — same loss-plus-L2-penalty structure, applied to the curvature $g''$ instead of $\beta$.
D Cross-validation itself — $\lambda$ is the held-out fold size in the resampling scheme.

Show answer

Correct answer: C

Verbatim L16: "we had regularizers… $Y - \beta X$ squared plus sum of $\beta$ squared, that was our ridge regression… Really what you're doing is you're adding another objective to your optimization." Smoothing splines plug the L2-on-curvature term $\lambda \int g''(t)^2\, dt$ into the same loss-plus-penalty frame.

A picks the wrong regulariser — lasso's L1 corner-induced sparsity has no analogue in the curvature penalty. B confuses tuning by penalty with tuning by selection (and the prof distrusts $p$-value-driven model selection anyway). D conflates the regularisation parameter $\lambda$ with the cross-validation fold count.

Atoms: smoothing-splines, regularization.

Question 20 4 points

For the same number of interior knots, what is the practical consequence of using a natural cubic spline instead of a plain cubic spline?

A The natural spline is forced linear past the boundary knots, reducing tail variance at the cost of two parameters.
B The natural spline gains two extra parameters relative to a plain cubic spline because it places additional knots at the boundaries.
C The natural spline switches the interior basis from cubic to quadratic, roughly halving the total parameter count.
D The natural spline imposes continuity of all derivatives at every interior knot, including the third derivative.

Show answer

Correct answer: A

Natural splines add two boundary constraints (second derivative zero at the boundary knots), forcing linear extrapolation past the outer knots. That's two fewer effective parameters than a plain cubic spline with the same interior knots, and is exactly the behaviour that fixes the "wild boundary tails" of plain cubics.

B inverts the parameter direction: the boundary knots are already implicit in the plain cubic spline (they bound the data range), and the natural-spline constraint removes two parameters there rather than adding them. C invents a quadratic basis — natural splines are still cubic in the interior. D would make the spline a global cubic polynomial, eliminating the piecewise structure entirely.

Atoms: regression-splines, bias-variance-tradeoff.

Question 21 4 points

A GAM is fitted to the Wage data with $\texttt{wage} = \beta_0 + f_1(\texttt{age}) + f_2(\texttt{year}) + f_3(\texttt{education}) + \varepsilon$, and the per-predictor panel for $f_1(\texttt{age})$ peaks around age 47. Which interpretation of "the panel value at age 47" is correct?

A The predicted wage of a 47-year-old at average year and average education, in dollars.
B The age-47 contribution to wage, holding the other predictors at their means, on the wage scale.
C The slope of fitted wage with respect to age, evaluated locally at age 47.
D The probability that a randomly drawn 47-year-old earns more than the sample mean wage.

Show answer

Correct answer: B

Verbatim L17: "It's not that you look at this and like, oh well, he's 47, so he makes eight whatever units… this is just the contribution. You also have to consider all these other ones." Each panel plots one $f_j(x_j)$ with the others held at their means; this is exact partial dependence under additivity.

A reads the panel value as the prediction itself, which is the most common student error the prof flagged. C reads it as a derivative — but the panel is the function value, not its slope. D treats the regression GAM as a probability — wrong scale entirely (that's the logistic-GAM panel, on the log-odds scale).

Atoms: generalized-additive-models. Lecture: L17-trees-1.

Question 22 4 points Ex7.3

You build a natural cubic spline design matrix $\mathbf X$ for a single predictor year with one interior knot at 2006 (boundary knots at the data's min and max). Excluding the intercept, how many columns does $\mathbf X$ have?

A 1
B 4
C 3
D 2

Show answer

Correct answer: D

The textbook natural-cubic-spline basis is $b_1(x) = x$ plus $b_{k+2}(x) = d_k(x) - d_K(x)$ for $k = 0, \ldots, K-1$. With one interior knot ($K = 1$) you get $b_1(x) = x$ and one $d_0(x) - d_1(x)$ column — two columns total beyond the intercept. Equivalently: natural cubic spline has $K + 1 = 2$ columns excluding the intercept.

A drops the linear column $b_1(x) = x$. C is the count if you forget that the natural-spline boundary constraints kill two parameters. B is the plain cubic spline answer ($K + 3$), which doesn't apply here.

Atoms: regression-splines, basis-functions.