Mock for: May 18, 2026 (real exam date)
Grade boundaries (NTNU prosentvurderingsmetoden, advisory): A: 89–100 % B: 77–88 % C: 65–76 % D: 53–64 % E: 41–52 % F: 0–40 %.
Read the passage and pick the best word or short phrase for each blank from the choices in parentheses. Each correct fill is worth \(1\) %.
A central theme of this course is (1) (interpolation / regularization / model averaging / the Bayes rule): any modification of a learning algorithm that aims to reduce (2) (training error / the intercept / generalization error / the residual variance) without necessarily reducing training error.
In the linear regression module we discussed what happens when two predictors are highly correlated: the matrix \(\mathbf{X}^\top\mathbf{X}\) becomes nearly singular, the variances of the estimated coefficients explode, and individual \(t\)-tests can become misleading. This phenomenon is called (3) (heteroscedasticity / leverage / collinearity / overfitting).
For training neural networks the standard optimizer is (4) (coordinate descent / Newton–Raphson / the EM algorithm / mini-batch stochastic gradient descent), and the gradients themselves are computed efficiently by an algorithm that reuses intermediates from the forward pass, namely (5) (boosting / backpropagation / bootstrapping / cross-validation).
A neural network with many more weights than observations needs regularization to generalize well. One classification-specific trick is (6) (batch normalization / Adam / label smoothing / the universal approximation theorem), in which the hard one-hot targets are replaced by slightly softened versions.
For tree ensembles, growing many independent trees in parallel and averaging them describes (7) (gradient boosting / AdaBoost / cost-complexity pruning / bagging and random forests), whereas growing many small trees sequentially, each one a small correction to the previous ensemble’s predictions, describes (8) (bagging / boosting / stacking / the bootstrap).
When estimating the standard error of a complicated statistic for which no clean closed-form sampling distribution is available, the standard nonparametric tool in this course is the (9) (F-test / validation-set approach / Mallows’ \(C_p\) / bootstrap).
Finally, when we want both to select a hyperparameter and to honestly assess the resulting model on the same data set, the recommended procedure is (10) (the validation-set approach / nested cross-validation / leave-one-out CV / the bootstrap).
For each subproblem, write True/False for each statement (or the requested numeric answer). For true/false subproblems you may add a one-sentence justification, but only if you think it helps; do not write essays.
Mark each statement true or false.
In the expected-squared-test-error decomposition \(\mathbb{E}[(y_0 - \hat f(x_0))^2] = \mathrm{Bias}^2 + \mathrm{Var} + \sigma^2\), the irreducible term \(\sigma^2\) can be lowered by choosing a more flexible model.
In a regression with a heavily noisy response (large \(\sigma^2\)), making the model class more flexible is generally counter-productive because the extra flexibility “chases” noise.
In the over-parameterized regime where \(p \gg n\) and many fits achieve zero training error, mini-batch SGD picks (approximately) the minimum-norm interpolator, which can generalize well — the so-called “benign overfitting” or double-descent phenomenon.
The lecturer’s preferred name for what students usually call “the bias–variance trade-off” is “avoid bad overfitting”, because in over-parameterized regimes one can drive both bias and variance down at once.
(1 %) True or false: “Validation-set approach with a \(50/50\) train/test split is the same as \(2\)-fold cross-validation.”
(1 %) True or false: “Compared to \(5\)- or \(10\)-fold CV, LOOCV has the lowest bias as an estimator of test error, but typically higher variance, because the \(n\) leave-one-out training sets overlap almost completely.”
(1 %) A genomicist has \(p = 5{,}000\) predictors and \(n = 100\) random labels (truly no signal, Bayes error is \(50\%\)). She first ranks all \(5{,}000\) predictors by their marginal correlation with \(y\), keeps the top \(25\), then runs \(10\)-fold CV on a logistic regression of \(y\) on those \(25\). She reports CV-misclassification near \(0\%\) and concludes she has built an excellent classifier. True or false: “her conclusion is justified.”
(1 %) True or false: “In nested cross-validation, the inner folds are used for model selection (hyperparameter tuning) and the outer folds for performance assessment.”
(1 %) True or false: “In mini-batch SGD, the mini-batch gradient is an unbiased estimator of the full-batch gradient: it has the same expectation, but larger variance.”
(1 %) True or false: “A common course-recommendation for the dropout rate is in the range \(0.2\)–\(0.5\), with \(0.2\) a frequent default; dropout is applied during training only, and is turned off (or absorbed by rescaling) at test time.”
(1 %) True or false: “Label smoothing replaces the hard one-hot target vector \((0,\dots,0,1,0,\dots,0)\) by a softened version such as \((\varepsilon/(C-1),\dots,1-\varepsilon,\dots)\) and is motivated, in part, by the possibility that the training labels themselves are imperfect.”
(1 %) True or false: “Early stopping is a regularization technique that monitors the training error during gradient descent and returns the network parameters from the epoch at which it first stops decreasing, motivated by the observation that overfitting begins precisely when the training loss flattens out.”
(1 %) True or false: “In gradient boosting with squared-error loss, fitting the next tree to the current residuals is equivalent to fitting a tree to the negative gradient of the loss with respect to the current ensemble’s predictions.”
(1 %) True or false: “Boosting reduces variance relative to a single deep tree because each of its constituent trees is fit on an independent bootstrap replicate of the training data and the resulting predictions are averaged, in the same fashion as in bagging and random forests.”
(1 %) True or false: “In gradient boosting, halving the shrinkage / learning rate \(\nu\) approximately doubles the number of trees \(M\) required for the ensemble to fit well, so \(M\) and \(\nu\) are tuned jointly.”
(1 %) True or false: “In AdaBoost, the classifier weight is \(\alpha_m = \log\!\bigl((1-\mathrm{err}_m)/\mathrm{err}_m\bigr)\), so a base classifier with weighted error above \(0.5\) would receive a negative \(\alpha_m\) in the final vote.”
A linear regression of \(y\) on \(p = 7\) continuous predictors is fit on \(n = 200\) observations. Two of the predictors, \(x_2\) and \(x_3\), are almost perfectly correlated (\(\mathrm{cor}(x_2, x_3) \approx 0.99\)); the other predictors are roughly uncorrelated.
Mark each statement true or false.
The coefficient estimates \(\hat\beta_2\) and \(\hat\beta_3\) are likely to have large standard errors and to be each individually insignificant by their \(t\)-tests, while their joint contribution to the model may still be highly significant.
The predicted values \(\hat y\) are likely to be much more unstable across resamples than the individual coefficients \(\hat\beta_2\) and \(\hat\beta_3\) are.
Standardizing \(x_2\) and \(x_3\) to mean \(0\) and variance \(1\) before fitting OLS makes the matrix \(\mathbf{X}^\top\mathbf{X}\) exactly invertible and fixes the collinearity problem.
Adding an L2 (ridge) penalty \(\lambda\sum_j\beta_j^2\) to the residual sum of squares makes \(\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I}\) invertible for any \(\lambda > 0\) and tends, in the limit of strong collinearity, to make the two collinear coefficients converge toward each other (sharing the effect roughly equally) rather than letting one of them blow up.
You have fit a logistic regression of default on
balance, income, and
student_status. From it you compute the predicted
probability of default \(\hat p(x_0)\)
for a specific new customer with covariates \(x_0 = (\texttt{balance}=2{,}000,\,
\texttt{income}=40{,}000,\, \texttt{student}=\mathrm{yes})\). The
closed-form \(\sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}\)
formula does not give you a direct standard error for \(\hat p(x_0)\) because the sigmoid step in
\(\hat p = \sigma(x_0^\top\hat\beta)\)
makes the propagation of variance nonlinear.
(2 %) Write pseudocode (or math) for a bootstrap procedure that returns an estimate \(\widehat{\mathrm{SE}}_{\text{boot}}(\hat p(x_0))\) and a \(95\%\) percentile confidence interval for \(\hat p(x_0)\). State your choice of the number of bootstrap resamples \(B\) and justify it in one short sentence.
(1 %) True or false: “If \(\hat p(x_0)\) is biased as an estimator of the true conditional probability \(\Pr(Y = 1 \mid X = x_0)\), the bootstrap standard error and the percentile CI computed in (i) automatically correct for that bias.”
(1 %) A logistic regression of the binary outcome
spam on a continuous predictor n_links (the
number of links in an e-mail) has fitted coefficient \(\hat\beta_{\texttt{n\_links}} = 0.18\). By
what factor do the odds of spam change between two
otherwise identical messages whose n_links values differ by
\(5\)? (One numeric value, rounded to
two decimals.)
(1 %) For another e-mail the fitted linear predictor
equals \(\hat\eta = 1.2\). What is the
predicted probability of spam? (One numeric value, rounded
to three decimals.)
(1 %) True or false: “Adding \(\lambda \sum_j |\beta_j|\) as an L1 penalty to the logistic-regression log-likelihood, with \(\lambda\) chosen by cross-validation, can drive some \(\hat\beta_j\) exactly to zero, performing automatic variable selection on top of logistic regression.”
Mark each statement true or false. (\(0.5\) % per statement.)
In a random forest, the parameter \(\mathtt{mtry}\) (number of predictors sampled at each split) controls the correlation between trees: smaller \(\mathtt{mtry}\) tends to give less correlated trees and therefore more variance reduction from averaging.
The number of trees \(B\) in a random forest is a tuning parameter selected by cross-validation, because the test error of a random forest follows a U-shaped curve in \(B\) and rises again once \(B\) becomes too large.
You perform PCA on a dataset with four standardized variables \(X_1, X_2, X_3, X_4\) and obtain eigenvalues \(\lambda_1 = 1.80\), \(\lambda_2 = 1.10\), \(\lambda_3 = 0.70\), \(\lambda_4 = 0.40\). The first principal-component loading vector (entries rounded to two decimals) is \[\phi_1 = (\,0.60,\;\,0.50,\;\,0.40,\;\,0.50\,)^\top.\] A new observation has standardized values \(x^* = (\,1,\;\, -1,\;\, 0.5,\;\, 0\,)^\top.\)
(1 %) How many principal components must be retained to capture at least \(85\%\) of the total variance? Show the cumulative-PVE calculation.
(1 %) Compute the score \(z^*_1\) of the new observation on PC1.
Let \(y_0 = f(x_0) + \varepsilon\) where \(\varepsilon\) is a zero-mean noise term with \(\mathrm{Var}(\varepsilon) = \sigma^2\), independent of the training set. Let \(\hat f\) be a fixed estimator trained on a random training set \(\mathcal D\), and let \(\hat f(x_0)\) denote its prediction at the fixed query point \(x_0\).
(2 %) State the two key independence / zero-mean assumptions you will need in the derivation below: one about \(\varepsilon\) relative to the training set \(\mathcal D\), and one about \(\mathbb{E}[\varepsilon]\). Define explicitly what randomness the outer expectation \(\mathbb{E}[\,\cdot\,]\) in \(\mathbb{E}[(y_0 - \hat f(x_0))^2]\) is taken over.
(4 %) Starting from \[\mathbb{E}\bigl[(y_0 - \hat f(x_0))^2\bigr] \;=\; \mathbb{E}\bigl[(f(x_0) + \varepsilon - \hat f(x_0))^2\bigr],\] derive the decomposition \[\boxed{\;\mathbb{E}\bigl[(y_0 - \hat f(x_0))^2\bigr] \;=\; \mathrm{Bias}^2\!\bigl[\hat f(x_0)\bigr] \;+\; \mathrm{Var}\!\bigl[\hat f(x_0)\bigr] \;+\; \sigma^2.\;}\] Show the algebraic steps clearly. In particular: (a) explain why a cross-term involving \(\varepsilon\) vanishes, (b) use the add-and-subtract trick \(\hat f(x_0) - f(x_0) = (\hat f(x_0) - \mathbb{E}[\hat f(x_0)]) + (\mathbb{E}[\hat f(x_0)] - f(x_0))\) to separate the bias and the variance, and (c) identify each of the three resulting terms.
(1 %) The decomposition above is an exact identity that holds for every estimator \(\hat f\). Briefly explain why it is therefore compatible both with the classical U-shaped test-MSE curve and with the double-descent / benign-overfitting curve seen in highly over-parameterized models.
(1 %) Briefly state one practical implication: explain in one sentence why, in this course, a method that introduces a small extra bias to obtain a large reduction in variance (e.g. ridge, lasso, dropout, or boosting at well-chosen \(M\)) can have lower test MSE than ordinary least squares.
(2 %) Write pseudocode for ordinary \(k\)-fold cross-validation that returns the CV-error of a specified model class \(\mathcal M(\theta)\) at a specified hyperparameter value \(\theta\), evaluated on a training set \(\{(x_i, y_i)\}_{i=1}^n\). Include: the random partition into folds, the inner fit-and-evaluate loop, the per-fold error definition, and the aggregation. Then state, in one or two sentences, how you would use this procedure to select a hyperparameter from a grid \(\{\theta_1, \dots, \theta_T\}\) and what you would do with the chosen hyperparameter afterwards.
(2 %) Now suppose you want both to select \(\theta\) and to obtain an honest estimate of the test error of the selected model, using only the training set. Sketch pseudocode for nested cross-validation that does this in a single run, clearly distinguishing the inner and outer loops. State, in one short sentence, why a naive single-layer CV approach in which one reports \(\min_\theta \mathrm{CV}_k(\theta)\) as the final test-error estimate is biased downward.
A small training set of \(N = 5\) observations \((i, y_i, w_i^{(1)})\) is given, with labels coded \(\pm 1\) and initial weights \(w_i^{(1)} = 1/N\):
| \(i\) | \(y_i\) | \(w_i^{(1)}\) | Prediction \(G_1(x_i)\) |
|---|---|---|---|
| \(1\) | \(+1\) | \(0.20\) | \(+1\) (correct) |
| \(2\) | \(+1\) | \(0.20\) | \(-1\) (wrong) |
| \(3\) | \(-1\) | \(0.20\) | \(-1\) (correct) |
| \(4\) | \(-1\) | \(0.20\) | \(-1\) (correct) |
| \(5\) | \(+1\) | \(0.20\) | \(-1\) (wrong) |
A weak classifier \(G_1(x)\) has been fit to these weighted data and its per-observation predictions are shown in the rightmost column.
(1 %) Compute the weighted misclassification error of \(G_1\), \[\mathrm{err}_1 = \frac{\sum_i w_i^{(1)} \cdot \mathbb{1}[y_i \neq G_1(x_i)]}{\sum_i w_i^{(1)}}.\]
(1 %) Compute the classifier weight \(\alpha_1 = \log\!\bigl((1 - \mathrm{err}_1)/\mathrm{err}_1\bigr)\) (give the answer to three decimals; you may use \(\log = \ln\)).
(2 %) Apply the AdaBoost weight update \[w_i^{(2)} \;\propto\; w_i^{(1)} \cdot \exp\!\bigl(\alpha_1 \cdot \mathbb{1}[y_i \neq G_1(x_i)]\bigr),\] and report the renormalized weights \(w_i^{(2)}\) (with \(\sum_i w_i^{(2)} = 1\)), rounded to three decimals. In one sentence, comment on which observations have gained weight and why this is the entire point of AdaBoost.
A wine merchant has \(n = 1{,}200\)
red wines, each scored by sommeliers on a continuous
quality index (response, range \(3\)–\(8\)). The available predictors are the
eight chemical measurements
alcohol — alcohol content (% vol.);
volatile_acidity (g/L);
sulphates (g/L);
pH;
density (g/cm\(^3\));
residual_sugar (g/L);
free_so2 — free sulfur dioxide (mg/L);
total_so2 — total sulfur dioxide (mg/L).
Important pairwise correlation among the predictors (computed on the training set, after standardization): \[\mathrm{cor}(\texttt{free\_so2},\, \texttt{total\_so2}) \approx 0.99.\] All other absolute correlations among predictors are below \(0.55\).
The data are split \(800 / 400\) into training and test sets. All continuous predictors are standardized to mean \(0\), variance \(1\) before fitting any of the models below.
The course staff first fits, on the training set, the OLS model \[\texttt{quality} \,\sim\, \texttt{alcohol} + I(\texttt{alcohol}^2) + \texttt{volatile\_acidity} + \texttt{sulphates} + \texttt{pH} + \texttt{density} + \texttt{residual\_sugar} + \texttt{free\_so2} + \texttt{total\_so2} + \texttt{alcohol}\!:\!\texttt{volatile\_acidity}.\] The fitted output is:
| Estimate | Std. Error | t-value | Pr(\(>|t|\)) | |
|---|---|---|---|---|
| (Intercept) | \(5.62\) | \(0.030\) | \(187.3\) | \(<0.001\) |
alcohol |
\(0.420\) | \(0.045\) | \(9.33\) | \(<0.001\) |
| \(I(\texttt{alcohol}^2)\) | \(-0.110\) | \(0.030\) | \(-3.67\) | \(<0.001\) |
volatile_acidity |
\(-0.305\) | \(0.034\) | \(-8.97\) | \(<0.001\) |
sulphates |
\(0.180\) | \(0.031\) | \(5.81\) | \(<0.001\) |
pH |
\(-0.090\) | \(0.040\) | \(-2.25\) | \(0.025\) |
density |
\(-0.060\) | \(0.062\) | \(-0.97\) | \(0.334\) |
residual_sugar |
\(0.022\) | \(0.038\) | \(0.58\) | \(0.563\) |
free_so2 |
\(0.260\) | \(0.180\) | \(1.44\) | \(0.149\) |
total_so2 |
\(-0.330\) | \(0.190\) | \(-1.74\) | \(0.083\) |
alcohol:volatile_acidity |
\(-0.140\) | \(0.034\) | \(-4.12\) | \(<0.001\) |
Residual standard error: \(0.655\) on \(789\) degrees of freedom. Multiple \(R^2 = 0.371\), Adjusted \(R^2 = 0.363\). Training MSE \(= 0.425\), Test MSE \(= 0.453\).
(1 %) How many parameters (including the intercept) does this model estimate? Verify the count against the printed residual degrees of freedom.
(2 %) Two predictors, free_so2 and
total_so2, have large estimates (\(+0.26\) and \(-0.33\)) but each appears individually
insignificant (\(p = 0.149\)
and \(0.083\)). Their standard errors
(\(0.180\) and \(0.190\)) are about \(5\times\) larger than those of the other
predictors. Identify, in one or two sentences, what is going on, and
state precisely which two pieces of information together let
you diagnose this (no need to compute a VIF).
(2 %) For a wine in which both alcohol and
volatile_acidity are at their training-set means (so their
standardized values are both \(0\)), and all other predictors are also at
their means, compute the predicted quality. Then compute
the predicted quality for the same wine if its
standardized alcohol is increased from \(0\) to \(+1\) while standardized
volatile_acidity stays at \(0\). (Two numeric answers; remember to
include both the \(\texttt{alcohol}^2\)
term and the interaction term.)
(2 %) Now repeat the second calculation in (iii) but
with standardized volatile_acidity fixed at \(+1\) (a wine with notably more volatile
acidity than average) instead of at \(0\), while standardized
alcohol again moves from \(0\) to \(+1\). Report (a) the implied change in
predicted quality, and (b) one short sentence interpreting
the sign and magnitude of the interaction term in plain
English.
(1 %) A classmate writes: “density has
\(p = 0.334\), so density does not
affect wine quality, and we should drop it.” Give one short sentence
rebutting this reading.
Refit the OLS model after dropping total_so2
(keeping free_so2 and all other predictors). The new fitted
coefficients and SEs for the two SO\(_2\)-related rows are:
| Estimate | Std. Error | t-value | Pr(\(>|t|\)) | |
|---|---|---|---|---|
free_so2 (in reduced
model) |
\(-0.072\) | \(0.040\) | \(-1.80\) | \(0.072\) |
Other coefficients are essentially unchanged from part (a). Training MSE \(= 0.428\), Test MSE \(= 0.451\).
(2 %) The estimate of \(\hat\beta_{\texttt{free\_so2}}\) has swung
from \(+0.26\) (in the original model)
to \(-0.072\) (in the reduced model),
and its SE has shrunk from \(0.180\) to
\(0.040\). Briefly explain, in one or
two sentences, why this is the canonical fingerprint of strong
collinearity between free_so2 and total_so2 in
the original model, by reference to what happens to \((\mathbf{X}^\top\mathbf{X})^{-1}\) when two
columns of \(\mathbf{X}\) are nearly
proportional.
(1 %) The training and test MSE are essentially the same between the full model in (a) and the reduced model in (b). What does this tell you about whether the collinearity was hurting prediction (as opposed to interpretation)?
(1 %) The course staff lists three legitimate alternative fixes for collinearity: (1) drop a variable, (2) combine the collinear predictors (e.g. replace them by a single “\(\texttt{total\_so2}\)” channel), and (3) regularize with ridge / use principal components regression. In one sentence, briefly state when option (2) is preferable to option (1).
The full \(10\)-predictor design
(now including both free_so2 and total_so2,
plus the polynomial and the interaction) is fed into ridge regression.
The \(10\)-fold CV-MSE profile is:
Test MSEs on the same \(400\)-observation test set:
| Test MSE | No. nonzero coef’s | |
|---|---|---|
| OLS (full \(11\)-coefficient model from part (a)) | \(0.453\) | \(11\) |
| Ridge at \(\hat\lambda_{\min}\) | \(0.430\) | \(11\) |
| Ridge at \(\hat\lambda_{1\mathrm{SE}}\) | \(0.435\) | \(11\) |
| Lasso at \(\hat\lambda_{1\mathrm{SE}}^{\mathrm{lasso}}\) | \(0.439\) | \(6\) |
(1 %) Write down the ridge objective function explicitly, being careful about whether the intercept is penalised. Then state, in one sentence, why the predictors were standardised before this fit.
(2 %) State the one-standard-error rule precisely. Then give one bias–variance reason why ridge at \(\hat\lambda_{\min}\) improves on OLS by about \(0.023\) in test MSE on this dataset, even though OLS is unbiased.
(1 %) Among the four rows in the table above, which would you recommend to a colleague who values interpretability and is comfortable trading a small amount of predictive accuracy for it? Justify in one short sentence.
(1 %) Briefly explain why the lasso CV curve
would, in this dataset, plausibly produce a model that zeroes out
one of free_so2 or total_so2 while ridge
does not.
The course staff also fits a gradient-boosted tree model (interaction depth \(d = 3\), shrinkage \(\nu = 0.05\), \(M\) chosen by \(10\)-fold CV; \(M^\star \approx 1{,}400\)). On the same \(400\)-observation test set it achieves test MSE \(= 0.382\).
(1 %) Compare the test MSEs — boosting \(0.382\), ridge \(0.430\), OLS \(0.453\) — in one short sentence: what does
the gap between boosting and the linear methods suggest about the
structure of the relationship between predictors and
quality?
(1 %) A junior colleague proposes quadrupling \(M\) to \(5{,}600\) “just to be safe.” Briefly say why this can hurt boosting, even though it would not hurt a random forest.
(1 %) You also want some interpretability from the boosted model. Name one diagnostic plot and one summary statistic you can extract from it; for each, state in one short sentence what question it answers and what question it does not answer.
A telecom company has \(n =
5{,}000\) customers; the binary response churn
equals \(1\) if the customer cancelled
their subscription within the next six months. The available predictors
are:
tenure — months as a customer (continuous);
monthly_charges — current monthly bill, USD
(continuous);
senior — binary \(0/1\) (senior citizen);
contract — categorical, \(3\) levels: month-to-month
(reference), 1-year, 2-year;
tech_support — binary \(0/1\) (subscribed to tech support
add-on);
online_security — binary \(0/1\).
Data are split \(70/30\) into training (\(3{,}500\)) and test (\(1{,}500\)) sets. Among the \(1{,}500\) test customers, \(420\) truly churned and \(1{,}080\) did not.
A logistic regression model is fit on the training set with all six
predictors above, plus an interaction between tenure and
contract: \[\mathrm{logit}\bigl(\Pr(\texttt{churn} = 1 \mid
X)\bigr) \;=\; \beta_0 + \beta_1 \texttt{tenure} + \beta_2
\texttt{monthly\_charges} + \beta_3 \texttt{senior} +
\boldsymbol\beta_{\texttt{contract}}^\top \texttt{contract} +
\boldsymbol\beta_{\texttt{ten:contract}}^\top
(\texttt{tenure}\!:\!\texttt{contract}) + \beta_4 \texttt{tech\_support}
+ \beta_5 \texttt{online\_security}.\] The fitted output is:
| Estimate | Std. Error | z-value | Pr(\(>|z|\)) | |
|---|---|---|---|---|
| (Intercept) | \(-0.40\) | \(0.18\) | \(-2.22\) | \(0.026\) |
tenure |
\(-0.060\) | \(0.005\) | \(-12.0\) | \(<0.001\) |
monthly_charges |
\(0.018\) | \(0.003\) | \(6.00\) | \(<0.001\) |
senior |
\(0.40\) | \(0.12\) | \(3.33\) | \(<0.001\) |
contract_1year |
\(-1.10\) | \(0.20\) | \(-5.50\) | \(<0.001\) |
contract_2year |
\(-2.00\) | \(0.25\) | \(-8.00\) | \(<0.001\) |
tenure:contract_1year |
\(0.025\) | \(0.008\) | \(3.13\) | \(0.002\) |
tenure:contract_2year |
\(0.040\) | \(0.010\) | \(4.00\) | \(<0.001\) |
tech_support |
\(-0.55\) | \(0.13\) | \(-4.23\) | \(<0.001\) |
online_security |
\(-0.45\) | \(0.13\) | \(-3.46\) | \(<0.001\) |
(Reference levels: contract = month-to-month;
senior = 0; tech_support = 0;
online_security = 0.)
(2 %) For each additional month of tenure,
by what factor do the odds of churn change for a customer on (a) a
month-to-month contract, (b) a 1-year contract, and
(c) a 2-year contract? Give all three odds-multiplication
factors, rounded to three decimals, and state the encoding assumption
you have made explicit.
(1 %) Briefly explain in one short sentence why \(\hat\beta_{\texttt{tenure:contract\_2year}} = +0.040\) has a positive sign even though \(\hat\beta_{\texttt{tenure}} = -0.060\) is negative. (Don’t just restate that “it’s an interaction.”)
(3 %) Consider a specific customer:
tenure \(=12\) months,
monthly_charges \(=85\)
USD, senior \(=0\),
contract \(=\) month-to-month,
tech_support \(=0\),
online_security \(=0\).
Compute the predicted probability \(\hat
p\) of churn for this customer. Show the linear predictor \(\hat\eta\) and the sigmoid step.
(2 %) Repeat the calculation in (iii) for an otherwise identical customer who is on a 2-year contract. Don’t forget the interaction. Then comment in one sentence on why the marketing team might use this comparison to argue for incentivising long contracts.
(2 %) A classmate writes: “senior has \(p < 0.001\), so we can conclude that
being a senior causes customers to churn more often.” Briefly
state one reason that conclusion is unwarranted from this fitted model
(the answer should appeal to the kind of model the prof flagged
repeatedly — “fancy correlations, not causal”).
LDA is also fit on the same training set, using the same six predictors. On the test set:
LDA confusion matrix
| Predicted: churn | Predicted: no churn | |
|---|---|---|
| Actual: churn | \(180\) | \(240\) |
| Actual: no churn | \(130\) | \(950\) |
For the logistic regression of part (a), at threshold \(\hat p = 0.5\), the corresponding test-set confusion matrix is:
| Predicted: churn | Predicted: no churn | |
|---|---|---|
| Actual: churn | \(210\) | \(210\) |
| Actual: no churn | \(135\) | \(945\) |
(2 %) Compute the sensitivity, specificity, and test error rate of the LDA classifier and of the logistic-regression classifier (both at their default thresholds). Six numeric answers, two decimals each.
(1 %) LDA and logistic regression are both linear
classifiers in this problem. State the one substantive
modelling difference between them; if a student were to add a
categorical variable with many levels like contract into
the input vector, which of the two methods’ assumptions is the more
obviously violated, and why?
(2 %) A clinician colleague hands you the same problem and proposes QDA instead of LDA. Mention one situation in which QDA would be expected to outperform LDA on a problem like this, and one situation in which it would be expected to do worse.
The course staff fits an AdaBoost ensemble of \(M = 500\) shallow trees (depth \(d = 2\)) on the training data, and separately a gradient-boosted classifier with \(M = 800\) trees, depth \(d = 3\), and shrinkage \(\nu = 0.05\), \(M\) chosen by \(10\)-fold CV. The gradient-boosting test-set confusion matrix at the default classification threshold is:
Gradient boosting confusion matrix (default
threshold)
| Predicted: churn | Predicted: no churn | |
|---|---|---|
| Actual: churn | \(260\) | \(160\) |
| Actual: no churn | \(145\) | \(935\) |
(2 %) Briefly justify the choice of each of the gradient-boosting hyperparameters (\(M = 800\), \(d = 3\), \(\nu = 0.05\)) in one short sentence per parameter. “Sufficiently many” is not a sufficient justification for \(M\); tie each choice to the role the parameter plays in the bias–variance behaviour of the ensemble.
(2 %) Sketch the conceptual difference between AdaBoost and gradient boosting in two short bullet points, each at most one sentence: one bullet for the AdaBoost mechanism (sample re-weighting + \(\pm 1\) vote), one for the gradient-boosting mechanism (fit next tree to the negative gradient of the loss). Then state, in one extra sentence, in what sense AdaBoost can be viewed as a special case of gradient boosting.
(2 %) Compute the sensitivity, specificity, and test error rate of the gradient-boosting classifier (three numeric answers, two decimals). Briefly compare it to the logistic regression from part (b), one short sentence, telling the marketing team which model you would recommend deploying if the cost of failing to identify a churner (a false negative) is substantially larger than the cost of unnecessarily targeting a loyal customer.
(2 %) The same boosted model is used to produce a
variable-importance plot; the top three predictors by
permutation importance are tenure, contract,
and monthly_charges. State (a) what a permutation
variable-importance plot is mechanically (one short sentence),
and (b) one specific question that this plot cannot answer
about tenure that a logistic-regression coefficient
can.
In the full population, only \(28\%\) of customers churn within the six-month window. A junior colleague suggests using accuracy (i.e. \(1 -\) test error rate) as the sole criterion for picking between the three classifiers above.
(1 %) A trivial classifier that always predicts “no churn” would have test error rate of approximately what value on this data set? Justify in one sentence.
(2 %) Comment in two or three sentences on whether accuracy is the right criterion here. In particular: (a) name one alternative metric or pair of metrics that you would privilege, and (b) explain in one sentence which decision that metric / pair of metrics would make easier than raw accuracy would (e.g. choosing between two classifiers that have nearly identical accuracy but very different false-negative rates).
End of exam. Total: \(10 + 28 + 16 + 20 + 26 = 100\) points.