← Back to wiki
Module 04 — Classification
28 questions · 100 points · ~45 min
Click an option to lock the answer; the explanation auto-opens.
Score tracker bottom-left.
In the classification setup, what is the Bayes error rate?
- A The training error of a fitted Bayes-network classifier on the data at hand, evaluated under the $0/1$ loss.
- B The classification analogue of the residual sum of squares: $\sum_i \mathbf 1(y_i \neq \hat y_i)$ counted on the training set.
- C The expected $0/1$ loss of the classifier $\hat y(x) = \arg\max_k \Pr(Y=k\mid X=x)$ that uses the true posterior, i.e. $1 - E[\max_k \Pr(Y=k\mid X)]$.
- D The misclassification rate of the naive-Bayes classifier when its conditional-independence assumption is satisfied.
Show answer
Correct answer: C
The Bayes classifier picks $\arg\max_k \Pr(Y=k\mid X)$ using the true posterior; its expected $0/1$ loss is the irreducible floor, the classification analogue of $\sigma^2$ in regression.
A confuses the abstract Bayes classifier with a fitted Bayes-network model. B is the empirical training error, not the population-level rate. D conflates the Bayes classifier with naive Bayes — same word, different objects (one is the optimal abstraction, the other a specific generative model).
Atoms: classification-setup, diagnostic-vs-sampling-paradigm. Lecture: L07-classif-1.
Question 2
4 points
ISLP §4 Q9
On average, what fraction of credit-card holders with odds $0.37$ of defaulting will in fact default?
- A $0.37$
- B $0.63$
- C $0.59$
- D $0.27$
Show answer
Correct answer: D
Invert odds → probability: $p = \text{odds}/(1+\text{odds}) = 0.37/1.37 \approx 0.27$.
A reports the odds as if it were already a probability (the most common slip — odds and probability are different scales). B inverts the formula and uses $1/(1+\text{odds}) = 1/1.37 \approx 0.73$ then takes its complement incorrectly, landing on $\approx 0.63$. C confuses odds with the log-odds via $e^{0.37}/(1+e^{0.37}) \approx 0.59$, which is another wrong scale-mixing.
Atoms: odds-and-log-odds. Lecture: L27-summary ("This is the kind of question I would ask. It's simple. You calculate it.").
Question 3
4 points
Ex4.3
An individual has a $16\%$ chance of defaulting on her credit-card payment. What are the odds that she will default?
- A $0.16$
- B $0.84$
- C $0.19$
- D $5.25$
Show answer
Correct answer: C
Odds $= p/(1-p) = 0.16/0.84 \approx 0.19$.
A reports the probability itself as the odds (no conversion). B reports $1-p$. D inverts the ratio: $0.84/0.16 \approx 5.25$ — the odds against rather than the odds for the event.
Atoms: odds-and-log-odds.
Question 4
4 points
Ex4.4
A logistic model for $Y = I(\text{student gets an A})$ has $\hat\beta_0 = -6$, $\hat\beta_1 = 0.05$ (study hours), $\hat\beta_2 = 1$ (GPA). For a student with $40$ hours and GPA $3.5$, what is $\hat p$?
- A $\hat p \approx 0.38$
- B $\hat p \approx 0.5$
- C $\hat p \approx 0.62$
- D $\hat p \approx 0.73$
Show answer
Correct answer: A
Linear predictor: $\hat\eta = -6 + 0.05\cdot 40 + 1\cdot 3.5 = -0.5$. Sigmoid: $\hat p = e^{-0.5}/(1+e^{-0.5}) = 1/(1+e^{0.5}) \approx 0.378$.
B drops the intercept and lands on $\hat p = \sigma(0) = 0.5$. C reports $\sigma(0.5) \approx 0.62$ — sign error on $\hat\eta$. D reports $\sigma(1)$, dropping the GPA contribution incorrectly.
Atoms: logistic-regression.
Question 5
4 points
Ex4.4
Same model as Question 4 ($\hat\beta_0 = -6$, $\hat\beta_1 = 0.05$, $\hat\beta_2 = 1$). At GPA $3.5$, how many study hours give $\hat p = 0.5$?
- A $40$ hours
- B $90$ hours
- C $55$ hours
- D $50$ hours
Show answer
Correct answer: D
$\hat p = 0.5$ iff $\hat\eta = 0$. Solve $-6 + 0.05 x + 1\cdot 3.5 = 0$ → $0.05 x = 2.5$ → $x = 50$.
A is the value used in Q4 (gave $\hat p \approx 0.38$, not $0.5$). B mistakenly multiplies through, e.g. solves $-6 + 0.05 x + 3.5 = 1$ instead of $0$, yielding $x = 90$. C drops the $1\cdot \text{GPA}$ contribution and solves $-6 + 0.05 x = 0$.
Atoms: logistic-regression, odds-and-log-odds.
Question 6
4 points
Exam 2025 P5
A logistic model for default uses SEX (1 = male, 2 = female), PAY_0, and the interaction SEX:PAY_0. Output gives $\hat\beta_{\text{PAY}_0} = 0.80$ and $\hat\beta_{\text{SEX:PAY}_0} = 0.15$. Holding other covariates fixed, by what factor do the odds of default change for one additional month of delayed payment, separately for males and females?
- A Both groups: $e^{0.80} \approx 2.23$.
- B Males: $e^{0.80} \approx 2.23$; females: $e^{0.80 + 0.15} \approx 2.59$.
- C Males: $e^{0.80} \approx 2.23$; females: $e^{0.15} \approx 1.16$.
- D Both groups: $e^{0.80 + 0.15} \approx 2.59$.
Show answer
Correct answer: B
With encoding $\texttt{SEX}=1$ for males and $\texttt{SEX}=2$ for females, increasing PAY_0 by one unit changes the linear predictor by $\beta_{\text{PAY}_0} + \beta_{\text{SEX:PAY}_0}\cdot\texttt{SEX}$. Multiplied through into the odds, that is $e^{\beta_{\text{PAY}_0}}$ for males ($\texttt{SEX}=1$ contributes the baseline) and $e^{\beta_{\text{PAY}_0} + \beta_{\text{SEX:PAY}_0}}$ for females once you absorb the level shift. The exam keys give $2.22$ and $2.57$.
A ignores the interaction entirely — the canonical interaction trap the prof flagged in L27. C reports the interaction coefficient alone for females, forgetting that the main effect is still active. D applies the female factor to both groups.
Atoms: odds-and-log-odds, logistic-regression. Lecture: L27-summary ("we need to be able to do it for the men and the women").
Mark each statement about logistic regression as true or false.
Show answer
- False — $\beta_j$ is the change in log-odds; the corresponding probability change is non-constant (depends on where on the sigmoid you are). Reporting it as an additive probability change is the canonical pitfall.
- False — the score equations are non-linear in $\boldsymbol\beta$; fit by Newton-Raphson / Fisher scoring (no closed form).
- True — the linear predictor sign flips with the encoding; every $\beta_j$ flips. State your encoding when interpreting output.
- True — the prof: "the collinearity problem we talked about a minute ago, that can happen here, and then this thing's no longer having a single maximum and it gets weird."
Atoms: logistic-regression, odds-and-log-odds. Lecture: L08-classif-2.
Question 8
4 points
Exam 2025 P5
A KNN classifier with $K=35$ on credit-default data gives the test confusion matrix below (rows = true, columns = predicted):
$\begin{array}{c|cc} & \hat y = 0 & \hat y = 1 \\\hline y = 0 & 1820 & 60 \\ y = 1 & 380 & 240 \end{array}$
What are the sensitivity and specificity (positive class = default = 1)?
- A Sensitivity $\approx 0.39$, specificity $\approx 0.97$.
- B Sensitivity $\approx 0.97$, specificity $\approx 0.39$.
- C Sensitivity $\approx 0.80$, specificity $\approx 0.83$.
- D Sensitivity $\approx 0.39$, specificity $\approx 0.86$.
Show answer
Correct answer: A
Sensitivity $= TP/(TP+FN) = 240/(240+380) \approx 0.387$. Specificity $= TN/(TN+FP) = 1820/(1820+60) \approx 0.968$.
B swaps the two — the classic "sniffs out positives / spares the negatives" mix-up. C reports overall accuracy split arbitrarily across both axes. D uses the wrong denominator on specificity ($TN/(TN+FN)$ — the negative-predictive-value formula, not specificity).
Atoms: confusion-matrix, sensitivity-specificity, knn-classification. Lecture: L27-summary.
Mark each statement about sensitivity and specificity as true or false.
Show answer
- True — by definition. Mnemonic: sensitivity "Sniffs out positives".
- True — more positive predictions → more $TP$ (sens ↑) but also more $FP$ (spec ↓).
- False — that's precision (positive predictive value). Sensitivity uses $FN$ in the denominator, not $FP$.
- True — exactly the prof's class-imbalance vignette: high accuracy hides the model never catches a positive.
Atoms: sensitivity-specificity, confusion-matrix. Lecture: L27-summary.
Question 10
4 points
Exam 2025 P5
Which statement most accurately describes a ROC curve and AUC for a binary classifier producing scores $\hat p(x)$?
- A The $y$-axis is sensitivity, the $x$-axis is specificity, and the curve is traced by sweeping the classification threshold; AUC near $1$ indicates a good classifier.
- B The $y$-axis is sensitivity, the $x$-axis is $1-$specificity; the curve sweeps the threshold; AUC equals the probability that a random positive scores higher than a random negative.
- C The curve plots overall accuracy against the threshold value; AUC is the area under that curve and equals exactly $1$ when the classifier is perfect.
- D The curve plots precision against recall as the threshold sweeps; an AUC near $0$ means the classifier carries essentially no useful information about class.
Show answer
Correct answer: B
The ROC plots TPR (= sensitivity) on the $y$-axis against FPR (= $1-$specificity) on the $x$-axis as the threshold sweeps from $1$ down to $0$. AUC is the probability $\Pr(\hat p(X^+) > \hat p(X^-))$.
A swaps the $x$-axis convention: with specificity on $x$ the diagonal would slope the wrong way (some software uses this; the prof's slide deck and ISLP Fig. 4.8 do not). C reports an accuracy-vs-threshold curve, which is a different object. D describes a precision-recall curve (information-retrieval style), and the AUC-near-zero claim is wrong (AUC $0.5$ is useless; AUC $0$ is perfectly inverted, hence informative).
Atoms: roc-auc, sensitivity-specificity.
Mark each statement about ROC curves and AUC as true or false.
Show answer
- True — the diagonal of the ROC plot.
- False — AUC $0.2$ means the scoring is informative but inverted; flipping all predictions gives AUC $0.8$. "Below the diagonal → invert your classifier."
- True — AUC summarizes the curve over all thresholds, you can pick an operating point afterwards.
Atoms: roc-auc.
In LDA, why is the discriminant $\delta_k(x)$ linear in $x$, while in QDA it is quadratic?
- A In LDA the priors $\pi_k$ are required to be equal across classes, while in QDA the priors differ, and this difference contributes the quadratic term in the discriminant.
- B LDA assumes the class-conditionals are Gaussian, while QDA drops this assumption and replaces $f_k(x)$ with a kernel density estimate, which makes the discriminant locally quadratic in $x$.
- C LDA writes the discriminant in terms of $\Sigma$ while QDA writes it in terms of $\Sigma^{-1}$, and the squared inverse turns the linear cross-term into a quadratic form in $x$.
- D Pooling $\Sigma$ in LDA makes the term $-\tfrac12 x^\top\Sigma^{-1}x$ independent of $k$ and it cancels in $\arg\max_k$; QDA keeps a class-specific $\Sigma_k$, so $-\tfrac12 x^\top\Sigma_k^{-1}x$ depends on $k$ and survives.
Show answer
Correct answer: D
Take the log of $\pi_k f_k(x)$. The exponent contains $-\tfrac12 x^\top\Sigma_k^{-1}x$. With pooled $\Sigma$ this term has no $k$ — drop it. With class-specific $\Sigma_k$ the term depends on $k$ and stays, making $\delta_k(x)$ quadratic. The prof flagged "where does the quadratic come from?" as a typical exam question.
A is irrelevant — both methods take general priors. B misstates QDA: QDA also assumes Gaussian class-conditionals, just with class-specific $\Sigma_k$. C garbles the algebra (both methods use $\Sigma_k^{-1}$ in the discriminant; squaring the inverse is not the mechanism).
Atoms: quadratic-discriminant-analysis, linear-discriminant-analysis, discriminant-score-and-decision-boundary. Lecture: L09-classif-3.
Two-class LDA with $\mu_A = (1,1)^\top$, $\mu_B = (3,3)^\top$, shared covariance $\Sigma = 2I$, equal priors $\pi_A = \pi_B = 0.5$. The decision boundary is:
- A $x_1 + x_2 = 2$
- B $x_1 + x_2 = 4$
- C $x_1 - x_2 = 0$
- D $x_1^2 + x_2^2 = 8$
Show answer
Correct answer: B
$\Sigma^{-1} = \tfrac12 I$. Equating $\delta_A(x) = \delta_B(x)$: cross term coefficient $\Sigma^{-1}(\mu_A - \mu_B) = \tfrac12(-2,-2) = (-1,-1)$. Intercepts $-\tfrac12\mu_A^\top\Sigma^{-1}\mu_A = -\tfrac12$ and $+\tfrac12\mu_B^\top\Sigma^{-1}\mu_B = +\tfrac92$ sum to $4$. With equal priors, $\log\pi_A - \log\pi_B = 0$. So $-x_1 - x_2 + 4 = 0$, i.e. $x_1 + x_2 = 4$.
A picks the midpoint of the means as if the boundary equation were $\tfrac{\mu_A + \mu_B}{2}\cdot\mathbf 1 = 2$ (the midpoint of $\mu_A$, not the line through the midpoint). C is the perpendicular bisector's other line through the means — reverses sign of $\mu_B - \mu_A$ in $\Sigma^{-1}$. D imagines a quadratic boundary, which would only arise under QDA.
Atoms: discriminant-score-and-decision-boundary, linear-discriminant-analysis. Lecture: L09-classif-3 ("I didn't ask for a line. I just equated the two things and then I solved for them and then it became a line.").
In 1D LDA with two classes, $\mu_1 = 2$, $\mu_2 = 8$, shared variance $\sigma^2 = 9$, and equal priors. The decision threshold $x^\star$ where $\delta_1(x) = \delta_2(x)$ is:
- A $x^\star = 5$
- B $x^\star = (\mu_1 + \mu_2)/2 - \tfrac{\sigma^2}{\mu_2 - \mu_1} \approx 3.5$
- C $x^\star = \tfrac{8 \cdot 2 + 2 \cdot 8}{9} \approx 3.6$
- D $x^\star = \sqrt{2 \cdot 8} \approx 4$
Show answer
Correct answer: A
Solving $x\mu_1/\sigma^2 - \mu_1^2/(2\sigma^2) = x\mu_2/\sigma^2 - \mu_2^2/(2\sigma^2)$ with equal priors gives $x^\star = (\mu_1 + \mu_2)/2 = 5$ — the midpoint, independent of $\sigma^2$.
B, C, D smuggle $\sigma^2$ into the threshold. B looks like the unequal-prior correction term ($\sigma^2/(\mu_2 - \mu_1) \cdot \log(\pi_2/\pi_1)$) but with the prior log-ratio replaced by $1$, so it shifts the midpoint for no reason. With equal priors and equal variances the variance terms cancel; only the means matter.
Atoms: linear-discriminant-analysis, discriminant-score-and-decision-boundary.
In LDA, you raise the prior $\pi_A$ from $0.5$ to $0.7$ (and lower $\pi_B$ to $0.3$) without changing the means or the shared covariance. What happens to the decision boundary between classes $A$ and $B$?
- A It shifts in space away from $\mu_A$ and toward $\mu_B$, so more of the input space is classified as class $A$.
- B Its intercept moves by $\log(\pi_B/\pi_A)$ but its orientation is unchanged, so equal volumes of the input space are still classified as $A$ and $B$ — only the location of the divide drifts symmetrically.
- C It is unchanged, since the LDA boundary depends only on the class means $\mu_k$ and the pooled covariance $\Sigma$.
- D It rotates around the midpoint $(\mu_A + \mu_B)/2$ by an angle proportional to $\log(\pi_A/\pi_B)$.
Show answer
Correct answer: A
The boundary equation contains $\log\pi_A - \log\pi_B$. Increasing $\pi_A$ raises $\delta_A$ everywhere; the equality $\delta_A = \delta_B$ now holds at a point closer to $\mu_B$, i.e. the boundary shifts away from $\mu_A$, expanding the region classified as $A$.
B gets the parallel-shift geometry right but botches the consequence: a parallel translation of the boundary in input space necessarily changes the volumes assigned to each class (the more-likely class claims more space). C ignores the $\log\pi_k$ term entirely. D mistakenly imagines a rotation; for fixed $\mu_k, \Sigma$, the boundary is just translated parallel to itself.
Atoms: linear-discriminant-analysis, discriminant-score-and-decision-boundary.
Question 16
3 points
ISLP §4 Q5
Mark each statement comparing LDA and QDA as true or false.
Show answer
- False — LDA's pooling reduces variance; on small samples with a truly linear boundary, LDA usually wins on test error. The iris example: LDA test 0.17 vs QDA test 0.32.
- True — strictly more flexible models cannot increase training misclassification for the fitted maximum-likelihood solution.
- True — with more data, QDA can afford the extra $K\cdot p(p+1)/2$ parameters, and the lower-bias model wins.
- False — LDA pools across classes for one $\Sigma$ ($p(p+1)/2$ total). QDA keeps $K$ separate $\Sigma_k$'s, $K \cdot p(p+1)/2$ total.
Atoms: quadratic-discriminant-analysis, linear-discriminant-analysis, bias-variance-tradeoff. Lecture: L09-classif-3.
For $p = 10$ predictors and $K = 3$ classes, how many free covariance parameters does QDA estimate (counting only the class covariance matrices, not means or priors)?
- A $55$
- B $100$
- C $300$
- D $165$
Show answer
Correct answer: D
Each $\Sigma_k$ is $p\times p$ symmetric → $p(p+1)/2 = 10 \cdot 11/2 = 55$ free parameters. With $K = 3$ class-specific covariances: $3 \cdot 55 = 165$.
A is one $\Sigma$ alone — that's the LDA pooled-covariance count. B forgets symmetry and reports $p^2 = 100$ for one matrix. C forgets symmetry and reports $K \cdot p^2 = 300$ — same mistake compounded.
Atoms: quadratic-discriminant-analysis.
What is the defining modeling assumption of (Gaussian) naive Bayes?
- A The predictors $X_1, \dots, X_p$ are unconditionally independent of each other in the population, regardless of the class label.
- B The class prior $\pi_k$ is assumed uniform across the $K$ classes, so that all classification information comes from the class-conditional density.
- C The predictors $X_1, \dots, X_p$ are conditionally independent given the class; in the Gaussian case, this is equivalent to QDA with each $\Sigma_k$ restricted to be diagonal.
- D The Bayes posterior is computed using a directed-graphical Bayesian network rather than via the Bayes-classifier $\arg\max_k$ abstraction.
Show answer
Correct answer: C
$f_k(x) = \prod_j f_{kj}(x_j)$ — predictors are independent within each class. With Gaussian marginals plus class-specific variances, this is exactly QDA with diagonal $\Sigma_k$. The win is the parameter count, $O(p)$ vs $O(p^2)$.
A drops the "given class" qualifier — that would be a much stronger (essentially never holding) assumption. B is a uniform-prior assumption used by some other classifiers; naive Bayes still estimates $\pi_k$ from frequencies. D conflates the Bayes classifier abstraction with naive Bayes (same prof-flagged confusion).
Atoms: naive-bayes, diagnostic-vs-sampling-paradigm.
Mark each statement about the diagnostic vs sampling (generative) paradigm as true or false.
Show answer
- True — the canonical roster the prof drew up in L07/L09.
- True — the Bayes-flip $\Pr(Y=k\mid X) \propto \pi_k f_k(x)$ is the engine for LDA/QDA/naive Bayes.
- True — both yield a linear log-odds; LDA leans on a Gaussian $X$ assumption, logistic does not.
- False — naive Bayes is a generative (sampling) classifier: it models $\Pr(X \mid Y)$ via a product of marginals.
Atoms: diagnostic-vs-sampling-paradigm, logistic-regression, linear-discriminant-analysis, naive-bayes.
Question 20
4 points
Ex4.1
A 7-point training set with two predictors and labels $\{A, A, A, B, B, B, B\}$. We want to predict the class at $x_0 = (1,2)$ using KNN. The prof's exercise asks why $K = 7$ is a poor choice. The best explanation is:
- A With $K = 7$ on a $7$-point training set, every training point is a neighbor; the prediction equals the global majority class and entirely ignores the location of $x_0$.
- B With $K = 7$ the classifier overfits drastically because it uses too many neighbors and chases noise in the training labels.
- C A choice of $K = 7$ on a $7$-point training set produces ties and leaves the majority-vote prediction mathematically undefined here.
- D $K = 7$ is fine in principle; the actual issue is that with only $7$ training points the local-neighbourhood estimate of $\Pr(Y\mid X)$ has too high variance to use any KNN at all here.
Show answer
Correct answer: A
$K = n$ collapses KNN to "predict the majority class everywhere" — pure underfit, not overfit. Here that majority is $B$ (4 of 7), regardless of where $x_0$ lives.
B inverts the bias-variance direction: large $K$ means more smoothing (high bias / low variance), not overfitting. C is a non-issue here ($4{:}3$ split, no tie). D blames sample size rather than the specific failure of $K = n$; the prof's question pinpoints the locality-collapse, not a generic small-$n$ complaint.
Atoms: knn-classification, bias-variance-tradeoff.
Mark each statement about KNN classification as true or false.
Show answer
- True — $K$ is the flexibility knob; small $K$ chases noise (overfit), large $K$ over-smooths (underfit).
- False — wiggly Bayes boundary needs a flexible KNN, i.e. small $K$ to track it. Exercise 4.1c.
- True — KNN is not scale-invariant; one feature in millimetres swamps another in years if you don't standardize.
- False — using the test set to tune $K$ leaks information; tune via CV on the training set, evaluate on a held-out test set.
Atoms: knn-classification, cross-validation, standardization, bias-variance-tradeoff.
Question 22
4 points
ISLP §4 Q4
Observations are uniformly distributed on the unit hypercube $[0,1]^p$. To predict at a new point we use only training observations whose coordinates each lie within a $10\%$ window centred on the test point's coordinate. As $p$ grows from $1$ to $100$, the average fraction of training points actually used shrinks roughly as:
- A $0.1$ for any $p$, since each coordinate independently keeps the same $10\%$ window around the test point.
- B $0.1^p$ — exponentially toward zero, so at $p = 100$ essentially no training points qualify as "near".
- C $0.1 \cdot p$ — linearly growing, so the curse of dimensionality only kicks in at very large $p$.
- D $1 - 0.1^p$ — most points are nearby in high dimensions because the unit-cube volume concentrates near its centre.
Show answer
Correct answer: B
Each coordinate independently keeps a fraction $0.1$, so the joint event has probability $0.1^p$. At $p=100$ that is $10^{-100}$ — no training points are "local" to a test point. This is the geometric core of the curse of dimensionality and the reason KNN breaks in high $p$.
A treats the dimensions as if they shared a window rather than each contributing a factor. C is a linear model of an exponential phenomenon. D inverts the direction — in high $p$, the bulk of the volume is near the boundary, not near the centre.
Atoms: curse-of-dimensionality, knn-classification.
Mark each statement about high-dimensional classification as true or false.
Show answer
- True — distance-based methods like KNN take the biggest hit; parametric/generative methods make stronger assumptions and survive better.
- False — standardizing equalises feature scales (which KNN needs anyway), but does not change the geometric concentration of distances.
- True — diagonal $\Sigma_k$ replaces $K\cdot p(p+1)/2$ parameters with $\sim 2pK$, dodging the variance explosion.
Atoms: curse-of-dimensionality, knn-classification, naive-bayes.
Question 24
4 points
Exam 2025 P5
In the credit-default data, only about $5\%$ of clients default. A trivial classifier that predicts "no default" for every client achieves around $95\%$ accuracy. The most useful conclusion to draw from this is:
- A Accuracy is misleading under class imbalance; sensitivity is $0$ and specificity is $1$, so the classifier is useless for catching defaulters.
- B Accuracy is the right summary metric here; any candidate classifier whose accuracy is below this $95\%$ baseline should simply be discarded.
- C The trivial classifier has good sensitivity but poor specificity, since it never raises false alarms but also catches every actual default.
- D The trivial classifier's $95\%$ accuracy reflects a genuinely high specificity at the chosen threshold; on this dataset the operating point is simply too aggressive on negatives, and dialling the threshold down to $0.2$ would recover useful sensitivity from the same model.
Show answer
Correct answer: A
Predicting "no default" always gives $TP = 0, FN = $ all defaulters, $TN = $ all non-defaulters, $FP = 0$. So sensitivity $= 0/P = 0$ and specificity $= N/N = 1$. Accuracy is dominated by the majority class. The prof: "models that are really naive and only predict that it's going to be a zero are already going to do pretty well."
B treats accuracy as definitive — exactly the trap the prof flagged. C swaps sensitivity and specificity: sensitivity is $0$ (no positives caught) and specificity is $1$ (no false alarms). D conflates a constant classifier with a thresholded probabilistic one: the trivial rule has no score to threshold on, so changing a "threshold" cannot recover sensitivity here.
Atoms: sensitivity-specificity, confusion-matrix. Lecture: L27-summary.
For an iris-like 2-predictor / 3-class dataset, you observe the following error rates:
$\begin{array}{lcc} \text{Method} & \text{Train error} & \text{Test error} \\\hline \text{LDA} & 0.19 & 0.17 \\ \text{QDA} & 0.17 & 0.32 \end{array}$
Which method should you prefer, and why?
- A QDA: it has the lowest training error, which is the most reliable indicator of generalisation.
- B QDA: it is strictly more flexible, so any apparent training-error advantage will translate to test-error advantage with enough regularisation.
- C Either; the difference between $0.17$ and $0.32$ is too small to matter without a formal hypothesis test.
- D LDA: its test error is lower, suggesting the extra parameters in QDA hurt more (variance) than they help (bias) on this sample size.
Show answer
Correct answer: D
Test error is the only honest generalisation signal. QDA's tighter training fit but worse test fit is the canonical bias-variance overfit pattern; pooling $\Sigma$ in LDA is paying off on this sample size. The prof's iris example, verbatim.
A treats training error as truth — the classic overfitting trap. B is hand-waving: regularising QDA to look like LDA just becomes LDA. C is wrong about the magnitude — nearly doubling test error is large; you don't need a formal test to pick the better generaliser.
Atoms: quadratic-discriminant-analysis, linear-discriminant-analysis, bias-variance-tradeoff, confusion-matrix. Lecture: L09-classif-3.
Question 26
4 points
ISLP §4 Q7
A 1D LDA setup: companies that issued a dividend ($Y = 1$) had $\bar X = 10$, those that did not had $\bar X = 0$, with shared $\hat\sigma^2 = 36$. The prior is $\pi_1 = 0.8$. Assuming Gaussian class-conditionals, what is $\Pr(Y = 1 \mid X = 4)$?
- A $\approx 0.20$
- B $\approx 0.50$
- C $\approx 0.75$
- D $\approx 0.94$
Show answer
Correct answer: C
Bayes' theorem: $\Pr(Y=1\mid X=4) = \dfrac{\pi_1 f_1(4)}{\pi_1 f_1(4) + \pi_0 f_0(4)}$.
$f_1(4) \propto \exp(-(4-10)^2/(2\cdot 36)) = \exp(-0.5)$.
$f_0(4) \propto \exp(-(4-0)^2/(2\cdot 36)) = \exp(-8/36) = \exp(-2/9)$.
Numerator: $0.8 \exp(-0.5) \approx 0.4852$.
Denominator: $0.4852 + 0.2 \exp(-2/9) \approx 0.4852 + 0.1602 = 0.6454$.
Ratio: $\approx 0.752$.
A flips numerator and denominator (computes $\Pr(Y=0\mid X=4)$). B drops the prior and gets $\approx 0.5$. D forgets the $f_0$ density entirely (just the prior).
Atoms: linear-discriminant-analysis, diagnostic-vs-sampling-paradigm, multivariate-normal.
Mark each statement about LDA's class-conditional density assumption as true or false.
Show answer
- True — exactly the LDA assumption set.
- True — the standard pooled estimator, an unbiased combination of within-class sample covariances.
- False — the boundary's normal direction is $\Sigma^{-1}(\mu_0 - \mu_1)$; the covariance rotates and rescales it. They coincide only when $\Sigma \propto I$.
- True — under the model assumptions LDA equals the Bayes classifier with plug-in MLEs; no other procedure does asymptotically better.
Atoms: linear-discriminant-analysis, multivariate-normal, discriminant-score-and-decision-boundary.
Question 28
3 points
Ex4.5
Two competing classifiers $p(x)$ and $q(x)$ produce probabilities of disease for the same population. On independent validation sets, $p(x)$ has AUC $0.6$ and $q(x)$ has AUC $0.7$. Which method would you prefer, and why?
- A $q(x)$, but only after the operating threshold is fixed: an AUC gap of $0.1$ is meaningless until you read off the corresponding sensitivities at one common threshold.
- B $q(x)$: higher AUC means a larger probability that a random positive scores higher than a random negative, summarising better trade-offs across all thresholds.
- C Cannot tell: AUCs computed on independent validation sets are not comparable across classifiers in any meaningful way.
- D Cannot tell: AUC is unreliable when the validation set has class imbalance, so without the prevalence we cannot know which classifier discriminates better.
Show answer
Correct answer: B
Higher AUC = better discrimination across all thresholds. AUC $= \Pr(\hat p(X^+) > \hat p(X^-))$, so $q(x)$'s $0.7$ beats $p(x)$'s $0.6$.
A misreads AUC as a threshold-dependent metric — its whole point is that it summarises the ROC across all thresholds, so no operating point is needed to compare. C is too strong — comparing AUCs across validation sets is harder than on the same set, but the prof's exercise (Ex4.5c) does ask you to prefer the higher one. D leans on a real concern (PR vs ROC under heavy imbalance) but exaggerates it: AUC is well-defined regardless of prevalence, and the prof's exercise treats the comparison as direct.
Atoms: roc-auc.