Module 04: Classification — Book delta

ISLP §4.3–§4.5 is a remarkably complete treatment of logistic regression, LDA, QDA, naive Bayes, and AUC: the model assumptions, the discriminant-score formulas (eqs. 4.18, 4.24, 4.28), and the bias-variance argument for choosing among them are all in the book. The deltas in this module are not large doctrinal pieces — they are concrete formulas and worked artifacts that the prof (or the slide deck he taught from) wrote down explicitly, and that ISLP either omits, sketches but never closes, or hides inside a verbal aside.

The biggest single delta is the MLE machinery for logistic regression: ISLP gives the product-form likelihood (eq. 4.5) and then says “the mathematical details of maximum likelihood are beyond the scope of this book,” whereas Benjamin’s slide deck writes the log-likelihood, takes derivatives, and names Newton–Raphson. Second-biggest: the explicit multivariate pooled covariance estimator (the 1D version is in ISLP eq. 4.20 but the formula is only described verbally). Third: the prof’s two flagged decision-boundary derivation patterns (the 1D-LDA boundary in the form he asked for in R, and the 2D worked example , , , which he flagged as the exam template for module 4).

Everything below is in scope per docs/scope.md. Out-of-scope items (multinomial logistic regression beyond the brief mention, probit/cloglog link functions, Fisher’s eigenvalue derivation of LDA, asymmetric ROC analysis for imbalanced classes) get no atom here.


1. Logistic regression: the MLE machinery

1.1 The three equivalent forms of the log-likelihood

[L07, L08, slide deck §“Estimating the regression coefficients with ML”; concept: logistic-regression]

ISLP eq. (4.5) gives only the product-form likelihood:

Then ISLP says “the mathematical details of maximum likelihood are beyond the scope of this book.” Benjamin’s slide deck explicitly derives the log-likelihood in three equivalent forms — this is the object you actually differentiate, and the form he taught from:

where is the linear predictor and .

The third form is the clean one: is linear in , and is the log-partition that makes concave in — that concavity is why the MLE is unique (when it exists) and why the iterative solver converges.

1.2 Score equations and Newton–Raphson

[L07, slide deck §“Estimating the regression coefficients with ML”; concept: logistic-regression]

ISLP does not name Newton–Raphson; the slide deck does. The prof’s framing (L07):

“MLE has no closed form; you solve the score equations numerically with Newton’s method (Newton–Raphson / Fisher scoring).”

The mechanics, reproduced from the slide deck and L07:

  1. Differentiate in (1.1) with respect to each :

    (Take for the intercept.) Stack into a vector:

    where is the design matrix, the -vector of responses, the -vector of fitted probabilities. Setting this to zero gives nonlinear equations in — no closed form.

  2. Solve numerically by Newton–Raphson: at iterate ,

    where the Hessian is

    Fisher scoring replaces with its expectation ; for canonical-link GLMs (the logit is canonical for Bernoulli) the two are identical, so the distinction is verbal here.

  3. Convergence is fast (quadratic near the optimum) because is concave. The standard error of comes from the inverse Fisher information at convergence: evaluated at . This is the engine behind the R-style GLM summary table (estimate / SE / ).

Exam-relevant takeaway

You will not be asked to run a Newton step by hand. You may be asked to write — in math, English, or pseudocode — how logistic-regression coefficients are estimated. The answer is: write the log-likelihood, differentiate, set , observe no closed form, iterate by Newton–Raphson. ISLP gives you the first half; this section gives you the second.

1.3 Closed-form inversion: predicted-probability ↔ required covariate value

[L07; Exercise 4.4b; concept: logistic-regression]

Not in ISLP as a named result, but worth pinning down for hand calculation. Given fitted , the predicted probability at is the sigmoid

Inverting for “what gives ?” with all other covariates held fixed: take logit of both sides,

solve linearly:

Special case : , so the threshold-crossing covariate value is . Calculator-friendly, the kind of problem Exercise 4.4b drills.


2. LDA: the full pooled-covariance toolkit

2.1 Multivariate pooled covariance estimator (the formula)

[L09, slide deck §“Estimators for p>1”; concept: linear-discriminant-analysis]

ISLP gives the 1D pooled-variance estimator explicitly in eq. (4.20):

ISLP then says, of the case, only that “the formulas are similar to those used in the one-dimensional case, given in (4.20).” It does not write them out. The slide deck does, and Anders will want this at the exam table:

Per-class sample covariance:

Pooled covariance (the LDA ):

Equivalently: a weighted average of per-class sample covariances, weighted by degrees of freedom. The denominator is (degrees of freedom of the pooled estimator), not . In 1D the formula collapses to ISLP eq. (4.20). Exercise 4.2a drills exactly this for a 2-class bank-note example.

2.2 1D-LDA decision boundary, two classes, unequal priors

[L09, slide deck §“Parameter estimators”, R code chunk; concept: discriminant-score-and-decision-boundary]

ISLP gives the equal-priors 1D boundary (eq. 4.19):

The unequal-priors version is in the slide deck’s R example but never written algebraically. The prof’s R rule line:

rule = 0.5*(mean(train1) + mean(train2))
     + var.pool*(log(n2train/n) - log(n1train/n)) / (mean(train1) - mean(train2))

is exactly the closed-form solution of for 1D LDA. Reproducing the derivation:

Set with (the 1D LDA discriminant of ISLP eq. 4.18):

Collect:

Solve for (dividing both sides by , valid whenever the means differ):

Equal priors second term vanishes ISLP’s midpoint rule. Direction of effect: increasing makes more negative, which (with , so the first factor is positive) decreases — the boundary moves away from class 1, i.e. more -values get classified as class 1. The prof flagged the sign direction as a common trap.

2.3 The multivariate worked example, , , — full algebra

[L09, slide deck §“Back to our synthetic example”; concept: discriminant-score-and-decision-boundary]

This is the prof’s flagged exam template (“And that’s often an exam question … you have the pies, you have the mu’s, and you’d have a value for the standard deviation, and then you would solve for where is the decision point”). The slide deck stops at the answer ; reproducing every step.

Setup: , , so , equal priors .

The multivariate LDA discriminant (ISLP eq. 4.24):

Equating :

Compute each piece.

Cross-term coefficient :

So .

Intercept piece :

So the intercept piece is .

Prior piece: equal priors .

Assemble:

A line, falling straight out of the LDA assumption — the prof’s emphasized observation: “I didn’t ask for a line. I just equated the two things and then I solved for them and then it became a line.”

Procedural template for exam day

Whenever a 2-class LDA problem in 2D drops with given :

  1. Compute (small dimension, hand-invert).
  2. Compute the cross-term row vector . The boundary’s -coefficient row is .
  3. Compute the constant .
  4. Boundary equation: . In 2D, solve for as a linear function of .
  5. Normal direction is , not . This is the subtlety ISLP glosses: rotates and rescales the obvious connecting vector. Only when do the two coincide.

2.4 Posterior-probability recovery: the softmax of discriminant scores

[L09, slide deck §“Posterior probabilities”; concept: linear-discriminant-analysis]

ISLP gives the Bayes-rule expression (eq. 4.15) and notes you compute it explicitly. The slide deck adds a much cleaner recovery formula directly from the discriminant scores, which is what you’d actually use once you have the ‘s:

Why this works. . Exponentiating and normalizing kills the additive class-independent constant in the numerator and denominator simultaneously, so the softmax of ‘s returns the correct posterior. It works identically for QDA (just use the QDA ) and for naive Bayes.

Practical consequence: once you’ve computed for all , you have everything — the classification (largest ) and the probability estimates — without ever explicitly evaluating the Gaussian density.

2.5 LDA as dimensionality reduction: classes → scores

[L08, L09; concept: linear-discriminant-analysis]

Not in ISLP §4.4 (it lives behind the door of §4.4 “Fisher’s discriminant” treatment, which is the optional section the prof skipped). The prof’s framing:

classes → discriminant scores … you’re basically taking a big matrix and transforming it into a new matrix, now of discriminant scores, that if you have fewer categories than you did originally, then actually you have a new transformation of the matrix specifically designed such that these things are best separated with a line.”

Concretely: the discriminant functions sum to a constant up to a single -free term (because the posteriors sum to 1), so only of them are linearly independent. The effective representation of for classification purposes is the -vector of contrasts , which lives in , independent of .

This is why LDA is robust in high where KNN dies of the curse: LDA implicitly projects -dimensional down to a -dimensional discriminant-score representation where classes are linearly separable by construction. For , you get a single discriminant score and the entire problem reduces to thresholding it.


3. QDA: the compact form and the parameter count

3.1 QDA discriminant in compact (Mahalanobis) form

[L09, slide deck §“Quadratic Discriminant Analysis”; concept: quadratic-discriminant-analysis]

ISLP gives the expanded form (eq. 4.28) — both equivalent forms are shown in the slide deck, and the compact one is much easier to keep in working memory:

Reading: (negative half-squared Mahalanobis distance to class mean) + (volume penalty for class ) + (log-prior of class ).

  • The Mahalanobis term measures how far is from accounting for the shape of .
  • The is a “volume penalty” — wider class distributions (large ) lose, because a wide-density class explains a fixed less well per unit volume than a tight-density class. This term is what survives the QDA derivation that didn’t survive LDA’s (in LDA, is the same across and cancels).
  • The log-prior boosts more-populous classes.

This three-piece reading is the prof’s go-to mental model and the cleanest setup for an exam-day derivation of “where does the quadratic come from.”

3.2 Parameter count: LDA vs QDA vs naive Bayes (the bias-variance ledger)

[L09, slide deck §“LDA vs QDA”; concepts: linear-discriminant-analysis, quadratic-discriminant-analysis, naive-bayes]

ISLP states (§4.4.3, p. 152) “QDA estimates a separate covariance matrix for each class, for a total of parameters” and contrasts with “LDA … there are linear coefficients to estimate” — the latter is loose (it’s counting the discriminant-function coefficients, not the underlying covariance parameters). Benjamin’s slide deck gives the clean apples-to-apples count of covariance parameters only, which is the headline trade-off:

MethodCovariance parametersNotes
LDAOne shared ; symmetric matrix has free entries
QDA class-specific ‘s, each symmetric
Naive Bayes (Gaussian, class-specific ) (variances)Diagonal ‘s; no off-diagonal parameters
Naive Bayes (Gaussian, pooled )Single diagonal ; equivalent to LDA-with-diagonal-

Add mean parameters and priors to each row to get the full count, but the covariance row is the one the bias-variance argument lives on.

Slide-deck numerical example: , :

  • LDA covariance parameters: .
  • QDA covariance parameters: .
  • Gaussian naive Bayes (class-specific): .

For total parameter count, naive Bayes scales as , LDA as , QDA as . That’s the bias-variance ledger.

Symmetry constraint trap: a covariance has free parameters, not . The matrix is symmetric, so only the upper triangle (including the diagonal) is free: diagonal entries + off-diagonal = .


4. ROC and AUC: the formal pieces ISLP leaves verbal

4.1 Threshold-indexed TPR and FPR formulas

[L09, L10, slide deck §“ROC curves and AUC”; concept: roc-auc]

ISLP §4.4.2 describes the ROC curve verbally and shows Figure 4.8, but never writes the threshold-indexed empirical formulas. For pseudocode questions (“write how you’d compute X”), the prof wants these:

For a probabilistic classifier producing scores on a test set with binary , at threshold :

Constructing the ROC curve. Sweep over the distinct values of (in practice, also include and for the endpoints), compute , plot. By construction:

  • → never predict positive → both rates → origin.
  • → always predict positive → both rates → corner .
  • Curve is monotone non-decreasing in both axes as .

4.2 Probabilistic interpretation of AUC

[L09, L10; concept: roc-auc]

ISLP says only that AUC is the area under the ROC curve and that an ideal classifier has AUC , chance gives AUC . The probabilistic interpretation is the load-bearing one — it’s how the prof characterizes “what AUC = 0.7 actually means”:

where is a randomly drawn positive () example, a randomly drawn negative (), independently. AUC is the probability that a random positive scores higher than a random negative.

Consequences (the prof’s exam-style direction-of-effect facts):

  • AUC : chance; the score has no separating power.
  • AUC : classifier is genuinely informative but ordered backwards. Invert predictions to get .
  • AUC : perfect ordering (every positive scores higher than every negative).
  • AUC is invariant under any monotone transformation of . So you can compare classifiers across different score scales; only the ranking matters.
  • AUC is independent of class prevalence, unlike accuracy. That’s why it’s the stable metric in medicine.

4.3 Qualitative AUC scale (prof-specific)

[L09, slide deck; concept: roc-auc]

ISLP doesn’t quantify “what’s a good AUC?” The slide deck and L09 give a working scale:

AUC rangeReading
Useless (chance).
”OK” — informative but not great.
”Good."
"Very good.”
Perfect (suspicious if seen in a real-world dataset).

Slide-deck reference points: LDA on the Default data gets AUC = 0.95 (“close to the maximum of 1.0, so would be considered very good”); logistic on the SAheart data gets AUC = 0.78.

4.4 LDA and logistic produce nearly identical ROC curves

[L09, slide deck §“Linearity”; concepts: roc-auc, linear-discriminant-analysis, logistic-regression]

ISLP mentions in passing (§4.4.2 caption of Fig. 4.8) “the ROC curve for the logistic regression model … is virtually indistinguishable from this one for the LDA model.” The slide deck makes the reason explicit and reproduces here as a formal observation:

For a two-class problem, both LDA and logistic regression produce a posterior with the same functional form:

The two methods estimate the coefficients differently — LDA plugs in Gaussian MLEs and Bayes’ rule; logistic regression maximizes the Bernoulli likelihood directly — but produce the same family of decision boundaries (linear in ) and monotonically related score functions. Since AUC depends only on the ranking, the two AUCs are forced to be close when the parameter estimates are close.

When do they actually differ? When the Gaussian assumption is badly off (logistic wins, because it doesn’t lean on the assumption) or when the classes are very well separated (LDA’s MLE is more stable; logistic’s MLE can be unstable or fail to exist due to perfect separation).


5. Notation and naming differences

The prof and the slide deck deviate from ISLP notation in a few small ways. These are pure relabelings, not separate concepts.

QuantityISLP notationProf / slide deck notation
Linear predictor"" written out (occasionally )
Logistic link”logistic function” (no symbol)Logit link:
Loss function (classification)Implicit, called “error rate""0/1 loss”
Estimator paradigmsNot named; described in §4.4 proseDiagnostic (logistic, KNN) vs sampling/generative (LDA, QDA, naive Bayes)
ROC curve -axis”false positive rate” (Fig. 4.8 label)“1 − specificity” (used interchangeably)
Justice-system analogy for sens/spec trade-offNot presentLecture-only verbal device
Bayes-flip denominatorJust “sum over classes” prose”Partition function if you’re from physics” (substitute lecturer, L07)
Naive Bayes alternate name”naive Bayes""Idiot’s Bayes” (slide deck variant)

ISLP also reserves “Bayes classifier” for the abstraction (the optimal decision rule under the true posterior) and uses “naive Bayes” for the specific generative classifier — the prof keeps this distinction strictly and warned about confusing the two; both terms mean what ISLP means.