← Back to wiki

Module 04 — Classification

28 questions · 100 points · ~45 min

Click an option to lock the answer; the explanation auto-opens. Score tracker bottom-left.

Question 1 3 points

In the classification setup, what is the Bayes error rate?

Show answer
Correct answer: C

The Bayes classifier picks $\arg\max_k \Pr(Y=k\mid X)$ using the true posterior; its expected $0/1$ loss is the irreducible floor, the classification analogue of $\sigma^2$ in regression.

A confuses the abstract Bayes classifier with a fitted Bayes-network model. B is the empirical training error, not the population-level rate. D conflates the Bayes classifier with naive Bayes — same word, different objects (one is the optimal abstraction, the other a specific generative model).

Atoms: classification-setup, diagnostic-vs-sampling-paradigm. Lecture: L07-classif-1.

Question 2 4 points ISLP §4 Q9

On average, what fraction of credit-card holders with odds $0.37$ of defaulting will in fact default?

Show answer
Correct answer: D

Invert odds → probability: $p = \text{odds}/(1+\text{odds}) = 0.37/1.37 \approx 0.27$.

A reports the odds as if it were already a probability (the most common slip — odds and probability are different scales). B inverts the formula and uses $1/(1+\text{odds}) = 1/1.37 \approx 0.73$ then takes its complement incorrectly, landing on $\approx 0.63$. C confuses odds with the log-odds via $e^{0.37}/(1+e^{0.37}) \approx 0.59$, which is another wrong scale-mixing.

Atoms: odds-and-log-odds. Lecture: L27-summary ("This is the kind of question I would ask. It's simple. You calculate it.").

Question 3 4 points Ex4.3

An individual has a $16\%$ chance of defaulting on her credit-card payment. What are the odds that she will default?

Show answer
Correct answer: C

Odds $= p/(1-p) = 0.16/0.84 \approx 0.19$.

A reports the probability itself as the odds (no conversion). B reports $1-p$. D inverts the ratio: $0.84/0.16 \approx 5.25$ — the odds against rather than the odds for the event.

Atoms: odds-and-log-odds.

Question 4 4 points Ex4.4

A logistic model for $Y = I(\text{student gets an A})$ has $\hat\beta_0 = -6$, $\hat\beta_1 = 0.05$ (study hours), $\hat\beta_2 = 1$ (GPA). For a student with $40$ hours and GPA $3.5$, what is $\hat p$?

Show answer
Correct answer: A

Linear predictor: $\hat\eta = -6 + 0.05\cdot 40 + 1\cdot 3.5 = -0.5$. Sigmoid: $\hat p = e^{-0.5}/(1+e^{-0.5}) = 1/(1+e^{0.5}) \approx 0.378$.

B drops the intercept and lands on $\hat p = \sigma(0) = 0.5$. C reports $\sigma(0.5) \approx 0.62$ — sign error on $\hat\eta$. D reports $\sigma(1)$, dropping the GPA contribution incorrectly.

Atoms: logistic-regression.

Question 5 4 points Ex4.4

Same model as Question 4 ($\hat\beta_0 = -6$, $\hat\beta_1 = 0.05$, $\hat\beta_2 = 1$). At GPA $3.5$, how many study hours give $\hat p = 0.5$?

Show answer
Correct answer: D

$\hat p = 0.5$ iff $\hat\eta = 0$. Solve $-6 + 0.05 x + 1\cdot 3.5 = 0$ → $0.05 x = 2.5$ → $x = 50$.

A is the value used in Q4 (gave $\hat p \approx 0.38$, not $0.5$). B mistakenly multiplies through, e.g. solves $-6 + 0.05 x + 3.5 = 1$ instead of $0$, yielding $x = 90$. C drops the $1\cdot \text{GPA}$ contribution and solves $-6 + 0.05 x = 0$.

Atoms: logistic-regression, odds-and-log-odds.

Question 6 4 points Exam 2025 P5

A logistic model for default uses SEX (1 = male, 2 = female), PAY_0, and the interaction SEX:PAY_0. Output gives $\hat\beta_{\text{PAY}_0} = 0.80$ and $\hat\beta_{\text{SEX:PAY}_0} = 0.15$. Holding other covariates fixed, by what factor do the odds of default change for one additional month of delayed payment, separately for males and females?

Show answer
Correct answer: B

With encoding $\texttt{SEX}=1$ for males and $\texttt{SEX}=2$ for females, increasing PAY_0 by one unit changes the linear predictor by $\beta_{\text{PAY}_0} + \beta_{\text{SEX:PAY}_0}\cdot\texttt{SEX}$. Multiplied through into the odds, that is $e^{\beta_{\text{PAY}_0}}$ for males ($\texttt{SEX}=1$ contributes the baseline) and $e^{\beta_{\text{PAY}_0} + \beta_{\text{SEX:PAY}_0}}$ for females once you absorb the level shift. The exam keys give $2.22$ and $2.57$.

A ignores the interaction entirely — the canonical interaction trap the prof flagged in L27. C reports the interaction coefficient alone for females, forgetting that the main effect is still active. D applies the female factor to both groups.

Atoms: odds-and-log-odds, logistic-regression. Lecture: L27-summary ("we need to be able to do it for the men and the women").

Question 7 3 points

Mark each statement about logistic regression as true or false.

Show answer
  1. False — $\beta_j$ is the change in log-odds; the corresponding probability change is non-constant (depends on where on the sigmoid you are). Reporting it as an additive probability change is the canonical pitfall.
  2. False — the score equations are non-linear in $\boldsymbol\beta$; fit by Newton-Raphson / Fisher scoring (no closed form).
  3. True — the linear predictor sign flips with the encoding; every $\beta_j$ flips. State your encoding when interpreting output.
  4. True — the prof: "the collinearity problem we talked about a minute ago, that can happen here, and then this thing's no longer having a single maximum and it gets weird."

Atoms: logistic-regression, odds-and-log-odds. Lecture: L08-classif-2.

Question 8 4 points Exam 2025 P5

A KNN classifier with $K=35$ on credit-default data gives the test confusion matrix below (rows = true, columns = predicted):

$\begin{array}{c|cc} & \hat y = 0 & \hat y = 1 \\\hline y = 0 & 1820 & 60 \\ y = 1 & 380 & 240 \end{array}$

What are the sensitivity and specificity (positive class = default = 1)?

Show answer
Correct answer: A

Sensitivity $= TP/(TP+FN) = 240/(240+380) \approx 0.387$. Specificity $= TN/(TN+FP) = 1820/(1820+60) \approx 0.968$.

B swaps the two — the classic "sniffs out positives / spares the negatives" mix-up. C reports overall accuracy split arbitrarily across both axes. D uses the wrong denominator on specificity ($TN/(TN+FN)$ — the negative-predictive-value formula, not specificity).

Atoms: confusion-matrix, sensitivity-specificity, knn-classification. Lecture: L27-summary.

Question 9 3 points

Mark each statement about sensitivity and specificity as true or false.

Show answer
  1. True — by definition. Mnemonic: sensitivity "Sniffs out positives".
  2. True — more positive predictions → more $TP$ (sens ↑) but also more $FP$ (spec ↓).
  3. False — that's precision (positive predictive value). Sensitivity uses $FN$ in the denominator, not $FP$.
  4. True — exactly the prof's class-imbalance vignette: high accuracy hides the model never catches a positive.

Atoms: sensitivity-specificity, confusion-matrix. Lecture: L27-summary.

Question 10 4 points Exam 2025 P5

Which statement most accurately describes a ROC curve and AUC for a binary classifier producing scores $\hat p(x)$?

Show answer
Correct answer: B

The ROC plots TPR (= sensitivity) on the $y$-axis against FPR (= $1-$specificity) on the $x$-axis as the threshold sweeps from $1$ down to $0$. AUC is the probability $\Pr(\hat p(X^+) > \hat p(X^-))$.

A swaps the $x$-axis convention: with specificity on $x$ the diagonal would slope the wrong way (some software uses this; the prof's slide deck and ISLP Fig. 4.8 do not). C reports an accuracy-vs-threshold curve, which is a different object. D describes a precision-recall curve (information-retrieval style), and the AUC-near-zero claim is wrong (AUC $0.5$ is useless; AUC $0$ is perfectly inverted, hence informative).

Atoms: roc-auc, sensitivity-specificity.

Question 11 3 points

Mark each statement about ROC curves and AUC as true or false.

Show answer
  1. True — the diagonal of the ROC plot.
  2. False — AUC $0.2$ means the scoring is informative but inverted; flipping all predictions gives AUC $0.8$. "Below the diagonal → invert your classifier."
  3. True — AUC summarizes the curve over all thresholds, you can pick an operating point afterwards.

Atoms: roc-auc.

Question 12 4 points

In LDA, why is the discriminant $\delta_k(x)$ linear in $x$, while in QDA it is quadratic?

Show answer
Correct answer: D

Take the log of $\pi_k f_k(x)$. The exponent contains $-\tfrac12 x^\top\Sigma_k^{-1}x$. With pooled $\Sigma$ this term has no $k$ — drop it. With class-specific $\Sigma_k$ the term depends on $k$ and stays, making $\delta_k(x)$ quadratic. The prof flagged "where does the quadratic come from?" as a typical exam question.

A is irrelevant — both methods take general priors. B misstates QDA: QDA also assumes Gaussian class-conditionals, just with class-specific $\Sigma_k$. C garbles the algebra (both methods use $\Sigma_k^{-1}$ in the discriminant; squaring the inverse is not the mechanism).

Atoms: quadratic-discriminant-analysis, linear-discriminant-analysis, discriminant-score-and-decision-boundary. Lecture: L09-classif-3.

Question 13 4 points

Two-class LDA with $\mu_A = (1,1)^\top$, $\mu_B = (3,3)^\top$, shared covariance $\Sigma = 2I$, equal priors $\pi_A = \pi_B = 0.5$. The decision boundary is:

Show answer
Correct answer: B

$\Sigma^{-1} = \tfrac12 I$. Equating $\delta_A(x) = \delta_B(x)$: cross term coefficient $\Sigma^{-1}(\mu_A - \mu_B) = \tfrac12(-2,-2) = (-1,-1)$. Intercepts $-\tfrac12\mu_A^\top\Sigma^{-1}\mu_A = -\tfrac12$ and $+\tfrac12\mu_B^\top\Sigma^{-1}\mu_B = +\tfrac92$ sum to $4$. With equal priors, $\log\pi_A - \log\pi_B = 0$. So $-x_1 - x_2 + 4 = 0$, i.e. $x_1 + x_2 = 4$.

A picks the midpoint of the means as if the boundary equation were $\tfrac{\mu_A + \mu_B}{2}\cdot\mathbf 1 = 2$ (the midpoint of $\mu_A$, not the line through the midpoint). C is the perpendicular bisector's other line through the means — reverses sign of $\mu_B - \mu_A$ in $\Sigma^{-1}$. D imagines a quadratic boundary, which would only arise under QDA.

Atoms: discriminant-score-and-decision-boundary, linear-discriminant-analysis. Lecture: L09-classif-3 ("I didn't ask for a line. I just equated the two things and then I solved for them and then it became a line.").

Question 14 4 points

In 1D LDA with two classes, $\mu_1 = 2$, $\mu_2 = 8$, shared variance $\sigma^2 = 9$, and equal priors. The decision threshold $x^\star$ where $\delta_1(x) = \delta_2(x)$ is:

Show answer
Correct answer: A

Solving $x\mu_1/\sigma^2 - \mu_1^2/(2\sigma^2) = x\mu_2/\sigma^2 - \mu_2^2/(2\sigma^2)$ with equal priors gives $x^\star = (\mu_1 + \mu_2)/2 = 5$ — the midpoint, independent of $\sigma^2$.

B, C, D smuggle $\sigma^2$ into the threshold. B looks like the unequal-prior correction term ($\sigma^2/(\mu_2 - \mu_1) \cdot \log(\pi_2/\pi_1)$) but with the prior log-ratio replaced by $1$, so it shifts the midpoint for no reason. With equal priors and equal variances the variance terms cancel; only the means matter.

Atoms: linear-discriminant-analysis, discriminant-score-and-decision-boundary.

Question 15 3 points

In LDA, you raise the prior $\pi_A$ from $0.5$ to $0.7$ (and lower $\pi_B$ to $0.3$) without changing the means or the shared covariance. What happens to the decision boundary between classes $A$ and $B$?

Show answer
Correct answer: A

The boundary equation contains $\log\pi_A - \log\pi_B$. Increasing $\pi_A$ raises $\delta_A$ everywhere; the equality $\delta_A = \delta_B$ now holds at a point closer to $\mu_B$, i.e. the boundary shifts away from $\mu_A$, expanding the region classified as $A$.

B gets the parallel-shift geometry right but botches the consequence: a parallel translation of the boundary in input space necessarily changes the volumes assigned to each class (the more-likely class claims more space). C ignores the $\log\pi_k$ term entirely. D mistakenly imagines a rotation; for fixed $\mu_k, \Sigma$, the boundary is just translated parallel to itself.

Atoms: linear-discriminant-analysis, discriminant-score-and-decision-boundary.

Question 16 3 points ISLP §4 Q5

Mark each statement comparing LDA and QDA as true or false.

Show answer
  1. False — LDA's pooling reduces variance; on small samples with a truly linear boundary, LDA usually wins on test error. The iris example: LDA test 0.17 vs QDA test 0.32.
  2. True — strictly more flexible models cannot increase training misclassification for the fitted maximum-likelihood solution.
  3. True — with more data, QDA can afford the extra $K\cdot p(p+1)/2$ parameters, and the lower-bias model wins.
  4. False — LDA pools across classes for one $\Sigma$ ($p(p+1)/2$ total). QDA keeps $K$ separate $\Sigma_k$'s, $K \cdot p(p+1)/2$ total.

Atoms: quadratic-discriminant-analysis, linear-discriminant-analysis, bias-variance-tradeoff. Lecture: L09-classif-3.

Question 17 4 points

For $p = 10$ predictors and $K = 3$ classes, how many free covariance parameters does QDA estimate (counting only the class covariance matrices, not means or priors)?

Show answer
Correct answer: D

Each $\Sigma_k$ is $p\times p$ symmetric → $p(p+1)/2 = 10 \cdot 11/2 = 55$ free parameters. With $K = 3$ class-specific covariances: $3 \cdot 55 = 165$.

A is one $\Sigma$ alone — that's the LDA pooled-covariance count. B forgets symmetry and reports $p^2 = 100$ for one matrix. C forgets symmetry and reports $K \cdot p^2 = 300$ — same mistake compounded.

Atoms: quadratic-discriminant-analysis.

Question 18 3 points

What is the defining modeling assumption of (Gaussian) naive Bayes?

Show answer
Correct answer: C

$f_k(x) = \prod_j f_{kj}(x_j)$ — predictors are independent within each class. With Gaussian marginals plus class-specific variances, this is exactly QDA with diagonal $\Sigma_k$. The win is the parameter count, $O(p)$ vs $O(p^2)$.

A drops the "given class" qualifier — that would be a much stronger (essentially never holding) assumption. B is a uniform-prior assumption used by some other classifiers; naive Bayes still estimates $\pi_k$ from frequencies. D conflates the Bayes classifier abstraction with naive Bayes (same prof-flagged confusion).

Atoms: naive-bayes, diagnostic-vs-sampling-paradigm.

Question 19 3 points

Mark each statement about the diagnostic vs sampling (generative) paradigm as true or false.

Show answer
  1. True — the canonical roster the prof drew up in L07/L09.
  2. True — the Bayes-flip $\Pr(Y=k\mid X) \propto \pi_k f_k(x)$ is the engine for LDA/QDA/naive Bayes.
  3. True — both yield a linear log-odds; LDA leans on a Gaussian $X$ assumption, logistic does not.
  4. False — naive Bayes is a generative (sampling) classifier: it models $\Pr(X \mid Y)$ via a product of marginals.

Atoms: diagnostic-vs-sampling-paradigm, logistic-regression, linear-discriminant-analysis, naive-bayes.

Question 20 4 points Ex4.1

A 7-point training set with two predictors and labels $\{A, A, A, B, B, B, B\}$. We want to predict the class at $x_0 = (1,2)$ using KNN. The prof's exercise asks why $K = 7$ is a poor choice. The best explanation is:

Show answer
Correct answer: A

$K = n$ collapses KNN to "predict the majority class everywhere" — pure underfit, not overfit. Here that majority is $B$ (4 of 7), regardless of where $x_0$ lives.

B inverts the bias-variance direction: large $K$ means more smoothing (high bias / low variance), not overfitting. C is a non-issue here ($4{:}3$ split, no tie). D blames sample size rather than the specific failure of $K = n$; the prof's question pinpoints the locality-collapse, not a generic small-$n$ complaint.

Atoms: knn-classification, bias-variance-tradeoff.

Question 21 3 points

Mark each statement about KNN classification as true or false.

Show answer
  1. True — $K$ is the flexibility knob; small $K$ chases noise (overfit), large $K$ over-smooths (underfit).
  2. False — wiggly Bayes boundary needs a flexible KNN, i.e. small $K$ to track it. Exercise 4.1c.
  3. True — KNN is not scale-invariant; one feature in millimetres swamps another in years if you don't standardize.
  4. False — using the test set to tune $K$ leaks information; tune via CV on the training set, evaluate on a held-out test set.

Atoms: knn-classification, cross-validation, standardization, bias-variance-tradeoff.

Question 22 4 points ISLP §4 Q4

Observations are uniformly distributed on the unit hypercube $[0,1]^p$. To predict at a new point we use only training observations whose coordinates each lie within a $10\%$ window centred on the test point's coordinate. As $p$ grows from $1$ to $100$, the average fraction of training points actually used shrinks roughly as:

Show answer
Correct answer: B

Each coordinate independently keeps a fraction $0.1$, so the joint event has probability $0.1^p$. At $p=100$ that is $10^{-100}$ — no training points are "local" to a test point. This is the geometric core of the curse of dimensionality and the reason KNN breaks in high $p$.

A treats the dimensions as if they shared a window rather than each contributing a factor. C is a linear model of an exponential phenomenon. D inverts the direction — in high $p$, the bulk of the volume is near the boundary, not near the centre.

Atoms: curse-of-dimensionality, knn-classification.

Question 23 3 points

Mark each statement about high-dimensional classification as true or false.

Show answer
  1. True — distance-based methods like KNN take the biggest hit; parametric/generative methods make stronger assumptions and survive better.
  2. False — standardizing equalises feature scales (which KNN needs anyway), but does not change the geometric concentration of distances.
  3. True — diagonal $\Sigma_k$ replaces $K\cdot p(p+1)/2$ parameters with $\sim 2pK$, dodging the variance explosion.

Atoms: curse-of-dimensionality, knn-classification, naive-bayes.

Question 24 4 points Exam 2025 P5

In the credit-default data, only about $5\%$ of clients default. A trivial classifier that predicts "no default" for every client achieves around $95\%$ accuracy. The most useful conclusion to draw from this is:

Show answer
Correct answer: A

Predicting "no default" always gives $TP = 0, FN = $ all defaulters, $TN = $ all non-defaulters, $FP = 0$. So sensitivity $= 0/P = 0$ and specificity $= N/N = 1$. Accuracy is dominated by the majority class. The prof: "models that are really naive and only predict that it's going to be a zero are already going to do pretty well."

B treats accuracy as definitive — exactly the trap the prof flagged. C swaps sensitivity and specificity: sensitivity is $0$ (no positives caught) and specificity is $1$ (no false alarms). D conflates a constant classifier with a thresholded probabilistic one: the trivial rule has no score to threshold on, so changing a "threshold" cannot recover sensitivity here.

Atoms: sensitivity-specificity, confusion-matrix. Lecture: L27-summary.

Question 25 4 points

For an iris-like 2-predictor / 3-class dataset, you observe the following error rates:

$\begin{array}{lcc} \text{Method} & \text{Train error} & \text{Test error} \\\hline \text{LDA} & 0.19 & 0.17 \\ \text{QDA} & 0.17 & 0.32 \end{array}$

Which method should you prefer, and why?

Show answer
Correct answer: D

Test error is the only honest generalisation signal. QDA's tighter training fit but worse test fit is the canonical bias-variance overfit pattern; pooling $\Sigma$ in LDA is paying off on this sample size. The prof's iris example, verbatim.

A treats training error as truth — the classic overfitting trap. B is hand-waving: regularising QDA to look like LDA just becomes LDA. C is wrong about the magnitude — nearly doubling test error is large; you don't need a formal test to pick the better generaliser.

Atoms: quadratic-discriminant-analysis, linear-discriminant-analysis, bias-variance-tradeoff, confusion-matrix. Lecture: L09-classif-3.

Question 26 4 points ISLP §4 Q7

A 1D LDA setup: companies that issued a dividend ($Y = 1$) had $\bar X = 10$, those that did not had $\bar X = 0$, with shared $\hat\sigma^2 = 36$. The prior is $\pi_1 = 0.8$. Assuming Gaussian class-conditionals, what is $\Pr(Y = 1 \mid X = 4)$?

Show answer
Correct answer: C

Bayes' theorem: $\Pr(Y=1\mid X=4) = \dfrac{\pi_1 f_1(4)}{\pi_1 f_1(4) + \pi_0 f_0(4)}$.
$f_1(4) \propto \exp(-(4-10)^2/(2\cdot 36)) = \exp(-0.5)$.
$f_0(4) \propto \exp(-(4-0)^2/(2\cdot 36)) = \exp(-8/36) = \exp(-2/9)$.
Numerator: $0.8 \exp(-0.5) \approx 0.4852$.
Denominator: $0.4852 + 0.2 \exp(-2/9) \approx 0.4852 + 0.1602 = 0.6454$.
Ratio: $\approx 0.752$.

A flips numerator and denominator (computes $\Pr(Y=0\mid X=4)$). B drops the prior and gets $\approx 0.5$. D forgets the $f_0$ density entirely (just the prior).

Atoms: linear-discriminant-analysis, diagnostic-vs-sampling-paradigm, multivariate-normal.

Question 27 3 points

Mark each statement about LDA's class-conditional density assumption as true or false.

Show answer
  1. True — exactly the LDA assumption set.
  2. True — the standard pooled estimator, an unbiased combination of within-class sample covariances.
  3. False — the boundary's normal direction is $\Sigma^{-1}(\mu_0 - \mu_1)$; the covariance rotates and rescales it. They coincide only when $\Sigma \propto I$.
  4. True — under the model assumptions LDA equals the Bayes classifier with plug-in MLEs; no other procedure does asymptotically better.

Atoms: linear-discriminant-analysis, multivariate-normal, discriminant-score-and-decision-boundary.

Question 28 3 points Ex4.5

Two competing classifiers $p(x)$ and $q(x)$ produce probabilities of disease for the same population. On independent validation sets, $p(x)$ has AUC $0.6$ and $q(x)$ has AUC $0.7$. Which method would you prefer, and why?

Show answer
Correct answer: B

Higher AUC = better discrimination across all thresholds. AUC $= \Pr(\hat p(X^+) > \hat p(X^-))$, so $q(x)$'s $0.7$ beats $p(x)$'s $0.6$.

A misreads AUC as a threshold-dependent metric — its whole point is that it summarises the ROC across all thresholds, so no operating point is needed to compare. C is too strong — comparing AUCs across validation sets is harder than on the same set, but the prof's exercise (Ex4.5c) does ask you to prefer the higher one. D leans on a real concern (PR vs ROC under heavy imbalance) but exaggerates it: AUC is well-defined regardless of prevalence, and the prof's exercise treats the comparison as direct.

Atoms: roc-auc.