Module 04 — Classification

Question 1 3 points

In the classification setup, what is the Bayes error rate?

A The training error of a fitted Bayes-network classifier on the data at hand, evaluated under the $0/1$ loss.
B The classification analogue of the residual sum of squares: $\sum_i \mathbf 1(y_i \neq \hat y_i)$ counted on the training set.
C The expected $0/1$ loss of the classifier $\hat y(x) = \arg\max_k \Pr(Y=k\mid X=x)$ that uses the true posterior, i.e. $1 - E[\max_k \Pr(Y=k\mid X)]$.
D The misclassification rate of the naive-Bayes classifier when its conditional-independence assumption is satisfied.

Show answer

Correct answer: C

The Bayes classifier picks $\arg\max_k \Pr(Y=k\mid X)$ using the true posterior; its expected $0/1$ loss is the irreducible floor, the classification analogue of $\sigma^2$ in regression.

A confuses the abstract Bayes classifier with a fitted Bayes-network model. B is the empirical training error, not the population-level rate. D conflates the Bayes classifier with naive Bayes — same word, different objects (one is the optimal abstraction, the other a specific generative model).

Atoms: classification-setup, diagnostic-vs-sampling-paradigm. Lecture: L07-classif-1.

Question 2 4 points ISLP §4 Q9

On average, what fraction of credit-card holders with odds $0.37$ of defaulting will in fact default?

A $0.37$
B $0.63$
C $0.59$
D $0.27$

Show answer

Correct answer: D

Invert odds → probability: $p = \text{odds}/(1+\text{odds}) = 0.37/1.37 \approx 0.27$.

A reports the odds as if it were already a probability (the most common slip — odds and probability are different scales). B inverts the formula and uses $1/(1+\text{odds}) = 1/1.37 \approx 0.73$ then takes its complement incorrectly, landing on $\approx 0.63$. C confuses odds with the log-odds via $e^{0.37}/(1+e^{0.37}) \approx 0.59$, which is another wrong scale-mixing.

Atoms: odds-and-log-odds. Lecture: L27-summary ("This is the kind of question I would ask. It's simple. You calculate it.").

Question 3 4 points Ex4.3

An individual has a $16\%$ chance of defaulting on her credit-card payment. What are the odds that she will default?

A $0.16$
B $0.84$
C $0.19$
D $5.25$

Show answer

Correct answer: C

Odds $= p/(1-p) = 0.16/0.84 \approx 0.19$.

A reports the probability itself as the odds (no conversion). B reports $1-p$. D inverts the ratio: $0.84/0.16 \approx 5.25$ — the odds against rather than the odds for the event.

Atoms: odds-and-log-odds.

Question 4 4 points Ex4.4

A logistic model for $Y = I(\text{student gets an A})$ has $\hat\beta_0 = -6$, $\hat\beta_1 = 0.05$ (study hours), $\hat\beta_2 = 1$ (GPA). For a student with $40$ hours and GPA $3.5$, what is $\hat p$?

A $\hat p \approx 0.38$
B $\hat p \approx 0.5$
C $\hat p \approx 0.62$
D $\hat p \approx 0.73$

Show answer

Correct answer: A

Linear predictor: $\hat\eta = -6 + 0.05\cdot 40 + 1\cdot 3.5 = -0.5$. Sigmoid: $\hat p = e^{-0.5}/(1+e^{-0.5}) = 1/(1+e^{0.5}) \approx 0.378$.

B drops the intercept and lands on $\hat p = \sigma(0) = 0.5$. C reports $\sigma(0.5) \approx 0.62$ — sign error on $\hat\eta$. D reports $\sigma(1)$, dropping the GPA contribution incorrectly.

Atoms: logistic-regression.

Question 5 4 points Ex4.4

Same model as Question 4 ($\hat\beta_0 = -6$, $\hat\beta_1 = 0.05$, $\hat\beta_2 = 1$). At GPA $3.5$, how many study hours give $\hat p = 0.5$?

A $40$ hours
B $90$ hours
C $55$ hours
D $50$ hours

Show answer

Correct answer: D

$\hat p = 0.5$ iff $\hat\eta = 0$. Solve $-6 + 0.05 x + 1\cdot 3.5 = 0$ → $0.05 x = 2.5$ → $x = 50$.

A is the value used in Q4 (gave $\hat p \approx 0.38$, not $0.5$). B mistakenly multiplies through, e.g. solves $-6 + 0.05 x + 3.5 = 1$ instead of $0$, yielding $x = 90$. C drops the $1\cdot \text{GPA}$ contribution and solves $-6 + 0.05 x = 0$.

Atoms: logistic-regression, odds-and-log-odds.

Question 6 4 points Exam 2025 P5

A logistic model for default uses SEX (1 = male, 2 = female), PAY_0, and the interaction SEX:PAY_0. Output gives $\hat\beta_{\text{PAY}_0} = 0.80$ and $\hat\beta_{\text{SEX:PAY}_0} = 0.15$. Holding other covariates fixed, by what factor do the odds of default change for one additional month of delayed payment, separately for males and females?

A Both groups: $e^{0.80} \approx 2.23$.
B Males: $e^{0.80} \approx 2.23$; females: $e^{0.80 + 0.15} \approx 2.59$.
C Males: $e^{0.80} \approx 2.23$; females: $e^{0.15} \approx 1.16$.
D Both groups: $e^{0.80 + 0.15} \approx 2.59$.

Show answer

Correct answer: B

With encoding $\texttt{SEX}=1$ for males and $\texttt{SEX}=2$ for females, increasing PAY_0 by one unit changes the linear predictor by $\beta_{\text{PAY}_0} + \beta_{\text{SEX:PAY}_0}\cdot\texttt{SEX}$. Multiplied through into the odds, that is $e^{\beta_{\text{PAY}_0}}$ for males ($\texttt{SEX}=1$ contributes the baseline) and $e^{\beta_{\text{PAY}_0} + \beta_{\text{SEX:PAY}_0}}$ for females once you absorb the level shift. The exam keys give $2.22$ and $2.57$.

A ignores the interaction entirely — the canonical interaction trap the prof flagged in L27. C reports the interaction coefficient alone for females, forgetting that the main effect is still active. D applies the female factor to both groups.

Atoms: odds-and-log-odds, logistic-regression. Lecture: L27-summary ("we need to be able to do it for the men and the women").

Question 7 3 points

Mark each statement about logistic regression as true or false.

a) For a one-unit increase in $x_j$ (no interaction), $\beta_j$ gives the additive change in the probability of $Y=1$. True False
b) The maximum-likelihood estimate $\hat{\boldsymbol\beta}$ has a closed-form solution analogous to OLS's $(X^\top X)^{-1}X^\top y$. True False
c) Flipping the reference class (encoding the other class as 1) flips the sign of every coefficient. True False
d) Severe collinearity among the predictors can leave the MLE non-unique, the same failure mode as in OLS. True False

Show answer

False — $\beta_j$ is the change in log-odds; the corresponding probability change is non-constant (depends on where on the sigmoid you are). Reporting it as an additive probability change is the canonical pitfall.
False — the score equations are non-linear in $\boldsymbol\beta$; fit by Newton-Raphson / Fisher scoring (no closed form).
True — the linear predictor sign flips with the encoding; every $\beta_j$ flips. State your encoding when interpreting output.
True — the prof: "the collinearity problem we talked about a minute ago, that can happen here, and then this thing's no longer having a single maximum and it gets weird."

Atoms: logistic-regression, odds-and-log-odds. Lecture: L08-classif-2.

Question 8 4 points Exam 2025 P5

A KNN classifier with $K=35$ on credit-default data gives the test confusion matrix below (rows = true, columns = predicted):

$\begin{array}{c|cc} & \hat y = 0 & \hat y = 1 \\\hline y = 0 & 1820 & 60 \\ y = 1 & 380 & 240 \end{array}$

What are the sensitivity and specificity (positive class = default = 1)?

A Sensitivity $\approx 0.39$, specificity $\approx 0.97$.
B Sensitivity $\approx 0.97$, specificity $\approx 0.39$.
C Sensitivity $\approx 0.80$, specificity $\approx 0.83$.
D Sensitivity $\approx 0.39$, specificity $\approx 0.86$.

Show answer

Correct answer: A

Sensitivity $= TP/(TP+FN) = 240/(240+380) \approx 0.387$. Specificity $= TN/(TN+FP) = 1820/(1820+60) \approx 0.968$.

B swaps the two — the classic "sniffs out positives / spares the negatives" mix-up. C reports overall accuracy split arbitrarily across both axes. D uses the wrong denominator on specificity ($TN/(TN+FN)$ — the negative-predictive-value formula, not specificity).

Atoms: confusion-matrix, sensitivity-specificity, knn-classification. Lecture: L27-summary.

Question 9 3 points

Mark each statement about sensitivity and specificity as true or false.

a) Sensitivity $= TP/(TP+FN)$ is the fraction of true positives that the classifier successfully catches. True False
b) Lowering the classification threshold from $0.5$ to $0.2$ generally increases sensitivity and decreases specificity. True False
c) The quantity $TP/(TP+FP)$ equals the sensitivity. True False
d) Under heavy class imbalance with $95\%$ negatives, a "predict 0 always" rule has accuracy $\approx 0.95$, sensitivity $0$, specificity $1$. True False

Show answer

True — by definition. Mnemonic: sensitivity "Sniffs out positives".
True — more positive predictions → more $TP$ (sens ↑) but also more $FP$ (spec ↓).
False — that's precision (positive predictive value). Sensitivity uses $FN$ in the denominator, not $FP$.
True — exactly the prof's class-imbalance vignette: high accuracy hides the model never catches a positive.

Atoms: sensitivity-specificity, confusion-matrix. Lecture: L27-summary.

Question 10 4 points Exam 2025 P5

Which statement most accurately describes a ROC curve and AUC for a binary classifier producing scores $\hat p(x)$?

A The $y$-axis is sensitivity, the $x$-axis is specificity, and the curve is traced by sweeping the classification threshold; AUC near $1$ indicates a good classifier.
B The $y$-axis is sensitivity, the $x$-axis is $1-$specificity; the curve sweeps the threshold; AUC equals the probability that a random positive scores higher than a random negative.
C The curve plots overall accuracy against the threshold value; AUC is the area under that curve and equals exactly $1$ when the classifier is perfect.
D The curve plots precision against recall as the threshold sweeps; an AUC near $0$ means the classifier carries essentially no useful information about class.

Show answer

Correct answer: B

The ROC plots TPR (= sensitivity) on the $y$-axis against FPR (= $1-$specificity) on the $x$-axis as the threshold sweeps from $1$ down to $0$. AUC is the probability $\Pr(\hat p(X^+) > \hat p(X^-))$.

A swaps the $x$-axis convention: with specificity on $x$ the diagonal would slope the wrong way (some software uses this; the prof's slide deck and ISLP Fig. 4.8 do not). C reports an accuracy-vs-threshold curve, which is a different object. D describes a precision-recall curve (information-retrieval style), and the AUC-near-zero claim is wrong (AUC $0.5$ is useless; AUC $0$ is perfectly inverted, hence informative).

Atoms: roc-auc, sensitivity-specificity.

Question 11 3 points

Mark each statement about ROC curves and AUC as true or false.

a) A classifier with AUC $= 0.5$ is no better than random guessing. True False
b) A classifier with AUC $= 0.2$ is worse than random and contains no useful information. True False
c) For a fixed dataset, comparing two classifiers by AUC is independent of any specific threshold. True False

Show answer

True — the diagonal of the ROC plot.
False — AUC $0.2$ means the scoring is informative but inverted; flipping all predictions gives AUC $0.8$. "Below the diagonal → invert your classifier."
True — AUC summarizes the curve over all thresholds, you can pick an operating point afterwards.

Atoms: roc-auc.

Question 12 4 points

In LDA, why is the discriminant $\delta_k(x)$ linear in $x$, while in QDA it is quadratic?

A In LDA the priors $\pi_k$ are required to be equal across classes, while in QDA the priors differ, and this difference contributes the quadratic term in the discriminant.
B LDA assumes the class-conditionals are Gaussian, while QDA drops this assumption and replaces $f_k(x)$ with a kernel density estimate, which makes the discriminant locally quadratic in $x$.
C LDA writes the discriminant in terms of $\Sigma$ while QDA writes it in terms of $\Sigma^{-1}$, and the squared inverse turns the linear cross-term into a quadratic form in $x$.
D Pooling $\Sigma$ in LDA makes the term $-\tfrac12 x^\top\Sigma^{-1}x$ independent of $k$ and it cancels in $\arg\max_k$; QDA keeps a class-specific $\Sigma_k$, so $-\tfrac12 x^\top\Sigma_k^{-1}x$ depends on $k$ and survives.

Show answer

Correct answer: D

Take the log of $\pi_k f_k(x)$. The exponent contains $-\tfrac12 x^\top\Sigma_k^{-1}x$. With pooled $\Sigma$ this term has no $k$ — drop it. With class-specific $\Sigma_k$ the term depends on $k$ and stays, making $\delta_k(x)$ quadratic. The prof flagged "where does the quadratic come from?" as a typical exam question.

A is irrelevant — both methods take general priors. B misstates QDA: QDA also assumes Gaussian class-conditionals, just with class-specific $\Sigma_k$. C garbles the algebra (both methods use $\Sigma_k^{-1}$ in the discriminant; squaring the inverse is not the mechanism).

Atoms: quadratic-discriminant-analysis, linear-discriminant-analysis, discriminant-score-and-decision-boundary. Lecture: L09-classif-3.

Question 13 4 points

Two-class LDA with $\mu_A = (1,1)^\top$, $\mu_B = (3,3)^\top$, shared covariance $\Sigma = 2I$, equal priors $\pi_A = \pi_B = 0.5$. The decision boundary is:

A $x_1 + x_2 = 2$
B $x_1 + x_2 = 4$
C $x_1 - x_2 = 0$
D $x_1^2 + x_2^2 = 8$

Show answer

Correct answer: B

$\Sigma^{-1} = \tfrac12 I$. Equating $\delta_A(x) = \delta_B(x)$: cross term coefficient $\Sigma^{-1}(\mu_A - \mu_B) = \tfrac12(-2,-2) = (-1,-1)$. Intercepts $-\tfrac12\mu_A^\top\Sigma^{-1}\mu_A = -\tfrac12$ and $+\tfrac12\mu_B^\top\Sigma^{-1}\mu_B = +\tfrac92$ sum to $4$. With equal priors, $\log\pi_A - \log\pi_B = 0$. So $-x_1 - x_2 + 4 = 0$, i.e. $x_1 + x_2 = 4$.

A picks the midpoint of the means as if the boundary equation were $\tfrac{\mu_A + \mu_B}{2}\cdot\mathbf 1 = 2$ (the midpoint of $\mu_A$, not the line through the midpoint). C is the perpendicular bisector's other line through the means — reverses sign of $\mu_B - \mu_A$ in $\Sigma^{-1}$. D imagines a quadratic boundary, which would only arise under QDA.

Atoms: discriminant-score-and-decision-boundary, linear-discriminant-analysis. Lecture: L09-classif-3 ("I didn't ask for a line. I just equated the two things and then I solved for them and then it became a line.").

Question 14 4 points

In 1D LDA with two classes, $\mu_1 = 2$, $\mu_2 = 8$, shared variance $\sigma^2 = 9$, and equal priors. The decision threshold $x^\star$ where $\delta_1(x) = \delta_2(x)$ is:

A $x^\star = 5$
B $x^\star = (\mu_1 + \mu_2)/2 - \tfrac{\sigma^2}{\mu_2 - \mu_1} \approx 3.5$
C $x^\star = \tfrac{8 \cdot 2 + 2 \cdot 8}{9} \approx 3.6$
D $x^\star = \sqrt{2 \cdot 8} \approx 4$

Show answer

Correct answer: A

Solving $x\mu_1/\sigma^2 - \mu_1^2/(2\sigma^2) = x\mu_2/\sigma^2 - \mu_2^2/(2\sigma^2)$ with equal priors gives $x^\star = (\mu_1 + \mu_2)/2 = 5$ — the midpoint, independent of $\sigma^2$.

B, C, D smuggle $\sigma^2$ into the threshold. B looks like the unequal-prior correction term ($\sigma^2/(\mu_2 - \mu_1) \cdot \log(\pi_2/\pi_1)$) but with the prior log-ratio replaced by $1$, so it shifts the midpoint for no reason. With equal priors and equal variances the variance terms cancel; only the means matter.

Atoms: linear-discriminant-analysis, discriminant-score-and-decision-boundary.

Question 15 3 points

In LDA, you raise the prior $\pi_A$ from $0.5$ to $0.7$ (and lower $\pi_B$ to $0.3$) without changing the means or the shared covariance. What happens to the decision boundary between classes $A$ and $B$?

A It shifts in space away from $\mu_A$ and toward $\mu_B$, so more of the input space is classified as class $A$.
B Its intercept moves by $\log(\pi_B/\pi_A)$ but its orientation is unchanged, so equal volumes of the input space are still classified as $A$ and $B$ — only the location of the divide drifts symmetrically.
C It is unchanged, since the LDA boundary depends only on the class means $\mu_k$ and the pooled covariance $\Sigma$.
D It rotates around the midpoint $(\mu_A + \mu_B)/2$ by an angle proportional to $\log(\pi_A/\pi_B)$.

Show answer

Correct answer: A

The boundary equation contains $\log\pi_A - \log\pi_B$. Increasing $\pi_A$ raises $\delta_A$ everywhere; the equality $\delta_A = \delta_B$ now holds at a point closer to $\mu_B$, i.e. the boundary shifts away from $\mu_A$, expanding the region classified as $A$.

B gets the parallel-shift geometry right but botches the consequence: a parallel translation of the boundary in input space necessarily changes the volumes assigned to each class (the more-likely class claims more space). C ignores the $\log\pi_k$ term entirely. D mistakenly imagines a rotation; for fixed $\mu_k, \Sigma$, the boundary is just translated parallel to itself.

Atoms: linear-discriminant-analysis, discriminant-score-and-decision-boundary.

Question 16 3 points ISLP §4 Q5

Mark each statement comparing LDA and QDA as true or false.

a) When the true Bayes boundary is linear, QDA tends to give lower test error than LDA on small samples. True False
b) On the training set, QDA's misclassification rate is generally no higher than LDA's, because QDA is strictly more flexible. True False
c) As the sample size $n$ grows, QDA's relative test performance versus LDA tends to improve when the true class covariances differ. True False
d) For fixed $p$, both LDA and QDA estimate the same number of covariance parameters per class. True False

Show answer

False — LDA's pooling reduces variance; on small samples with a truly linear boundary, LDA usually wins on test error. The iris example: LDA test 0.17 vs QDA test 0.32.
True — strictly more flexible models cannot increase training misclassification for the fitted maximum-likelihood solution.
True — with more data, QDA can afford the extra $K\cdot p(p+1)/2$ parameters, and the lower-bias model wins.
False — LDA pools across classes for one $\Sigma$ ($p(p+1)/2$ total). QDA keeps $K$ separate $\Sigma_k$'s, $K \cdot p(p+1)/2$ total.

Atoms: quadratic-discriminant-analysis, linear-discriminant-analysis, bias-variance-tradeoff. Lecture: L09-classif-3.

Question 17 4 points

For $p = 10$ predictors and $K = 3$ classes, how many free covariance parameters does QDA estimate (counting only the class covariance matrices, not means or priors)?

A $55$
B $100$
C $300$
D $165$

Show answer

Correct answer: D

Each $\Sigma_k$ is $p\times p$ symmetric → $p(p+1)/2 = 10 \cdot 11/2 = 55$ free parameters. With $K = 3$ class-specific covariances: $3 \cdot 55 = 165$.

A is one $\Sigma$ alone — that's the LDA pooled-covariance count. B forgets symmetry and reports $p^2 = 100$ for one matrix. C forgets symmetry and reports $K \cdot p^2 = 300$ — same mistake compounded.

Atoms: quadratic-discriminant-analysis.

Question 18 3 points

What is the defining modeling assumption of (Gaussian) naive Bayes?

A The predictors $X_1, \dots, X_p$ are unconditionally independent of each other in the population, regardless of the class label.
B The class prior $\pi_k$ is assumed uniform across the $K$ classes, so that all classification information comes from the class-conditional density.
C The predictors $X_1, \dots, X_p$ are conditionally independent given the class; in the Gaussian case, this is equivalent to QDA with each $\Sigma_k$ restricted to be diagonal.
D The Bayes posterior is computed using a directed-graphical Bayesian network rather than via the Bayes-classifier $\arg\max_k$ abstraction.

Show answer

Correct answer: C

$f_k(x) = \prod_j f_{kj}(x_j)$ — predictors are independent within each class. With Gaussian marginals plus class-specific variances, this is exactly QDA with diagonal $\Sigma_k$. The win is the parameter count, $O(p)$ vs $O(p^2)$.

A drops the "given class" qualifier — that would be a much stronger (essentially never holding) assumption. B is a uniform-prior assumption used by some other classifiers; naive Bayes still estimates $\pi_k$ from frequencies. D conflates the Bayes classifier abstraction with naive Bayes (same prof-flagged confusion).

Atoms: naive-bayes, diagnostic-vs-sampling-paradigm.

Question 19 3 points

Mark each statement about the diagnostic vs sampling (generative) paradigm as true or false.

a) Logistic regression and KNN sit in the diagnostic paradigm; LDA, QDA, and naive Bayes sit in the sampling/generative paradigm. True False
b) The sampling paradigm models $f_k(x) = \Pr(X \mid Y = k)$ and $\pi_k$, then recovers $\Pr(Y \mid X)$ via Bayes' theorem. True False
c) Logistic regression and LDA both produce a log-odds that is linear in $x$, but estimate the parameters differently (direct MLE vs Gaussian-plug-in). True False
d) Naive Bayes belongs to the diagnostic paradigm because it directly estimates the class probabilities by counting. True False

Show answer

True — the canonical roster the prof drew up in L07/L09.
True — the Bayes-flip $\Pr(Y=k\mid X) \propto \pi_k f_k(x)$ is the engine for LDA/QDA/naive Bayes.
True — both yield a linear log-odds; LDA leans on a Gaussian $X$ assumption, logistic does not.
False — naive Bayes is a generative (sampling) classifier: it models $\Pr(X \mid Y)$ via a product of marginals.

Atoms: diagnostic-vs-sampling-paradigm, logistic-regression, linear-discriminant-analysis, naive-bayes.

Question 20 4 points Ex4.1

A 7-point training set with two predictors and labels $\{A, A, A, B, B, B, B\}$. We want to predict the class at $x_0 = (1,2)$ using KNN. The prof's exercise asks why $K = 7$ is a poor choice. The best explanation is:

A With $K = 7$ on a $7$-point training set, every training point is a neighbor; the prediction equals the global majority class and entirely ignores the location of $x_0$.
B With $K = 7$ the classifier overfits drastically because it uses too many neighbors and chases noise in the training labels.
C A choice of $K = 7$ on a $7$-point training set produces ties and leaves the majority-vote prediction mathematically undefined here.
D $K = 7$ is fine in principle; the actual issue is that with only $7$ training points the local-neighbourhood estimate of $\Pr(Y\mid X)$ has too high variance to use any KNN at all here.

Show answer

Correct answer: A

$K = n$ collapses KNN to "predict the majority class everywhere" — pure underfit, not overfit. Here that majority is $B$ (4 of 7), regardless of where $x_0$ lives.

B inverts the bias-variance direction: large $K$ means more smoothing (high bias / low variance), not overfitting. C is a non-issue here ($4{:}3$ split, no tie). D blames sample size rather than the specific failure of $K = n$; the prof's question pinpoints the locality-collapse, not a generic small-$n$ complaint.

Atoms: knn-classification, bias-variance-tradeoff.

Question 21 3 points

Mark each statement about KNN classification as true or false.

a) Small $K$ (e.g. $K = 1$) gives high variance and low bias; large $K$ gives the reverse. True False
b) If the true Bayes boundary is highly nonlinear, the optimal $K$ tends to be large (smoother fit). True False
c) Standardizing predictors before KNN is recommended because Euclidean distance is sensitive to scale differences across features. True False
d) Choosing $K$ by cross-validation on the test set is a valid procedure for honest test-error estimation. True False

Show answer

True — $K$ is the flexibility knob; small $K$ chases noise (overfit), large $K$ over-smooths (underfit).
False — wiggly Bayes boundary needs a flexible KNN, i.e. small $K$ to track it. Exercise 4.1c.
True — KNN is not scale-invariant; one feature in millimetres swamps another in years if you don't standardize.
False — using the test set to tune $K$ leaks information; tune via CV on the training set, evaluate on a held-out test set.

Atoms: knn-classification, cross-validation, standardization, bias-variance-tradeoff.

Question 22 4 points ISLP §4 Q4

Observations are uniformly distributed on the unit hypercube $[0,1]^p$. To predict at a new point we use only training observations whose coordinates each lie within a $10\%$ window centred on the test point's coordinate. As $p$ grows from $1$ to $100$, the average fraction of training points actually used shrinks roughly as:

A $0.1$ for any $p$, since each coordinate independently keeps the same $10\%$ window around the test point.
B $0.1^p$ — exponentially toward zero, so at $p = 100$ essentially no training points qualify as "near".
C $0.1 \cdot p$ — linearly growing, so the curse of dimensionality only kicks in at very large $p$.
D $1 - 0.1^p$ — most points are nearby in high dimensions because the unit-cube volume concentrates near its centre.

Show answer

Correct answer: B

Each coordinate independently keeps a fraction $0.1$, so the joint event has probability $0.1^p$. At $p=100$ that is $10^{-100}$ — no training points are "local" to a test point. This is the geometric core of the curse of dimensionality and the reason KNN breaks in high $p$.

A treats the dimensions as if they shared a window rather than each contributing a factor. C is a linear model of an exponential phenomenon. D inverts the direction — in high $p$, the bulk of the volume is near the boundary, not near the centre.

Atoms: curse-of-dimensionality, knn-classification.

Question 23 3 points

Mark each statement about high-dimensional classification as true or false.

a) Of the methods covered in module 4, KNN is the most strongly affected by the curse of dimensionality. True False
b) Standardizing the predictors removes the curse of dimensionality. True False
c) Naive Bayes is often preferred to QDA when $p$ is large because its parameter count is $O(p)$ rather than $O(p^2)$. True False

Show answer

True — distance-based methods like KNN take the biggest hit; parametric/generative methods make stronger assumptions and survive better.
False — standardizing equalises feature scales (which KNN needs anyway), but does not change the geometric concentration of distances.
True — diagonal $\Sigma_k$ replaces $K\cdot p(p+1)/2$ parameters with $\sim 2pK$, dodging the variance explosion.

Atoms: curse-of-dimensionality, knn-classification, naive-bayes.

Question 24 4 points Exam 2025 P5

In the credit-default data, only about $5\%$ of clients default. A trivial classifier that predicts "no default" for every client achieves around $95\%$ accuracy. The most useful conclusion to draw from this is:

A Accuracy is misleading under class imbalance; sensitivity is $0$ and specificity is $1$, so the classifier is useless for catching defaulters.
B Accuracy is the right summary metric here; any candidate classifier whose accuracy is below this $95\%$ baseline should simply be discarded.
C The trivial classifier has good sensitivity but poor specificity, since it never raises false alarms but also catches every actual default.
D The trivial classifier's $95\%$ accuracy reflects a genuinely high specificity at the chosen threshold; on this dataset the operating point is simply too aggressive on negatives, and dialling the threshold down to $0.2$ would recover useful sensitivity from the same model.

Show answer

Correct answer: A

Predicting "no default" always gives $TP = 0, FN = $ all defaulters, $TN = $ all non-defaulters, $FP = 0$. So sensitivity $= 0/P = 0$ and specificity $= N/N = 1$. Accuracy is dominated by the majority class. The prof: "models that are really naive and only predict that it's going to be a zero are already going to do pretty well."

B treats accuracy as definitive — exactly the trap the prof flagged. C swaps sensitivity and specificity: sensitivity is $0$ (no positives caught) and specificity is $1$ (no false alarms). D conflates a constant classifier with a thresholded probabilistic one: the trivial rule has no score to threshold on, so changing a "threshold" cannot recover sensitivity here.

Atoms: sensitivity-specificity, confusion-matrix. Lecture: L27-summary.

Question 25 4 points

For an iris-like 2-predictor / 3-class dataset, you observe the following error rates:

$\begin{array}{lcc} \text{Method} & \text{Train error} & \text{Test error} \\\hline \text{LDA} & 0.19 & 0.17 \\ \text{QDA} & 0.17 & 0.32 \end{array}$

Which method should you prefer, and why?

A QDA: it has the lowest training error, which is the most reliable indicator of generalisation.
B QDA: it is strictly more flexible, so any apparent training-error advantage will translate to test-error advantage with enough regularisation.
C Either; the difference between $0.17$ and $0.32$ is too small to matter without a formal hypothesis test.
D LDA: its test error is lower, suggesting the extra parameters in QDA hurt more (variance) than they help (bias) on this sample size.

Show answer

Correct answer: D

Test error is the only honest generalisation signal. QDA's tighter training fit but worse test fit is the canonical bias-variance overfit pattern; pooling $\Sigma$ in LDA is paying off on this sample size. The prof's iris example, verbatim.

A treats training error as truth — the classic overfitting trap. B is hand-waving: regularising QDA to look like LDA just becomes LDA. C is wrong about the magnitude — nearly doubling test error is large; you don't need a formal test to pick the better generaliser.

Atoms: quadratic-discriminant-analysis, linear-discriminant-analysis, bias-variance-tradeoff, confusion-matrix. Lecture: L09-classif-3.

Question 26 4 points ISLP §4 Q7

A 1D LDA setup: companies that issued a dividend ($Y = 1$) had $\bar X = 10$, those that did not had $\bar X = 0$, with shared $\hat\sigma^2 = 36$. The prior is $\pi_1 = 0.8$. Assuming Gaussian class-conditionals, what is $\Pr(Y = 1 \mid X = 4)$?

A $\approx 0.20$
B $\approx 0.50$
C $\approx 0.75$
D $\approx 0.94$

Show answer

Correct answer: C

Bayes' theorem: $\Pr(Y=1\mid X=4) = \dfrac{\pi_1 f_1(4)}{\pi_1 f_1(4) + \pi_0 f_0(4)}$.
$f_1(4) \propto \exp(-(4-10)^2/(2\cdot 36)) = \exp(-0.5)$.
$f_0(4) \propto \exp(-(4-0)^2/(2\cdot 36)) = \exp(-8/36) = \exp(-2/9)$.
Numerator: $0.8 \exp(-0.5) \approx 0.4852$.
Denominator: $0.4852 + 0.2 \exp(-2/9) \approx 0.4852 + 0.1602 = 0.6454$.
Ratio: $\approx 0.752$.

A flips numerator and denominator (computes $\Pr(Y=0\mid X=4)$). B drops the prior and gets $\approx 0.5$. D forgets the $f_0$ density entirely (just the prior).

Atoms: linear-discriminant-analysis, diagnostic-vs-sampling-paradigm, multivariate-normal.

Question 27 3 points

Mark each statement about LDA's class-conditional density assumption as true or false.

a) LDA assumes $X \mid Y = k$ is multivariate normal with class-specific mean $\mu_k$ and a covariance $\Sigma$ that is the same across classes. True False
b) The pooled covariance estimator is $\hat\Sigma = \sum_k \frac{n_k - 1}{n - K}\hat\Sigma_k$. True False
c) The LDA decision rule depends on $\mu_0 - \mu_1$ alone; the covariance $\Sigma$ does not affect the boundary's orientation. True False
d) When the Gaussian-with-shared-$\Sigma$ assumption is correct, LDA is the maximum-likelihood Bayes-optimal classifier. True False

Show answer

True — exactly the LDA assumption set.
True — the standard pooled estimator, an unbiased combination of within-class sample covariances.
False — the boundary's normal direction is $\Sigma^{-1}(\mu_0 - \mu_1)$; the covariance rotates and rescales it. They coincide only when $\Sigma \propto I$.
True — under the model assumptions LDA equals the Bayes classifier with plug-in MLEs; no other procedure does asymptotically better.

Atoms: linear-discriminant-analysis, multivariate-normal, discriminant-score-and-decision-boundary.

Question 28 3 points Ex4.5

Two competing classifiers $p(x)$ and $q(x)$ produce probabilities of disease for the same population. On independent validation sets, $p(x)$ has AUC $0.6$ and $q(x)$ has AUC $0.7$. Which method would you prefer, and why?

A $q(x)$, but only after the operating threshold is fixed: an AUC gap of $0.1$ is meaningless until you read off the corresponding sensitivities at one common threshold.
B $q(x)$: higher AUC means a larger probability that a random positive scores higher than a random negative, summarising better trade-offs across all thresholds.
C Cannot tell: AUCs computed on independent validation sets are not comparable across classifiers in any meaningful way.
D Cannot tell: AUC is unreliable when the validation set has class imbalance, so without the prevalence we cannot know which classifier discriminates better.

Show answer

Correct answer: B

Higher AUC = better discrimination across all thresholds. AUC $= \Pr(\hat p(X^+) > \hat p(X^-))$, so $q(x)$'s $0.7$ beats $p(x)$'s $0.6$.

A misreads AUC as a threshold-dependent metric — its whole point is that it summarises the ROC across all thresholds, so no operating point is needed to compare. C is too strong — comparing AUCs across validation sets is harder than on the same set, but the prof's exercise (Ex4.5c) does ask you to prefer the higher one. D leans on a real concern (PR vs ROC under heavy imbalance) but exaggerates it: AUC is well-defined regardless of prevalence, and the prof's exercise treats the comparison as direct.

Atoms: roc-auc.