Linear discriminant analysis (LDA)
The prof’s canonical generative classifier: model the class-conditional density as Gaussian with class-specific means and a pooled covariance , plus a class prior , then flip with Bayes. The resulting discriminant score is linear in , that’s the “L.” Decision boundary derivation is exam-flagged twice.
Definition (prof’s framing)
“We’re not trying to model directly. We’re trying to get the other distributions to then use Bayes to get the classification.” - L09-classif-3
“We’re flipping it around. So now, instead of modeling this probability of Y given X directly, we want to make a model of what is the prior distribution of Y, like how likely is this class A or class B, and the probability of X given Y.” - L09-classif-3
The two assumptions that make it work:
- Within each class, , Gaussian class-conditional.
- The covariance is shared across classes (“pooled”). This is the L in LDA. Relax it → QDA.
Plus a class prior for each class.
Notation & setup
- classes labeled ; binary case: .
- , with .
- , class- mean vector.
- , shared covariance matrix (positive definite).
- , multivariate normal density with .
For , replace with shared scalar .
Formula(s) to know cold
Bayes’ rule (the engine):
1D Gaussian discriminant score (linear in , drop terms not depending on , take logs):
Multivariate Gaussian discriminant score:
Classify by .
Posterior recovery (softmax over discriminants):
Parameter estimators (plug-in MLEs)
- (class frequency).
- .
- Pooled covariance: the formula to know:
where .
In 1D: .
Why the discriminant is linear
Take the log of with multivariate Gaussian. The exponent has , which expands as
The term has no (because is shared) → it cancels in → drop. What’s left is linear in .
“The term has no , gone… [in QDA we don’t pool], so the coefficient becomes -dependent, the term survives, and the discriminant becomes quadratic.” - L09-classif-3
Insights & mental models
- The decision boundary is a consequence of the assumed model, not a fitted parameter. Contrast with logistic regression, where the boundary’s slope is a parameter. In LDA, boundaries fall out from equating . - L09-classif-3
- Priors literally move the boundary. Increasing shifts the boundary away from class (more things classified as ). The product-curve picture is the prof’s go-to visualization.
- Equal priors, 2 classes, equal , 1D: boundary at . Solve to verify.
- LDA = dimensionality reduction. classes → discriminant scores. “Mapping it down to a dimension specifically to separate out these categories.” - L09-classif-3. Useful in high where KNN is cursed.
- In-sample LDA can beat the Bayes-optimal classifier on training data. Sound paradoxical; it’s just overfitting. The Bayes classifier uses true parameters; the fitted LDA chases noise. - L09-classif-3
- LDA ↔ logistic regression are very close. Same linear log-odds form (slide deck note: for both). Different parameter-estimation routes, Gaussian-plug-in vs MLE.
Worked 2D example, the prof’s recurring template
Setup: , , , equal priors. Equate :
With , this collapses to , i.e. .
“I didn’t ask for a line. I just equated the two things and then I solved for them and then it became a line. I didn’t tell the math to give me a line.” - L09-classif-3
The prof did the algebra live, slipped on a cancellation, came back from break with the corrected version. Worth knowing because this is the exam pattern (next section).
Exam signals
“And that’s often an exam question. Or that would be a typical exam question. So you would, given an LDA setting and here are the values for the parameters, you have the pies, you have the mu’s, and you’d have a value for the standard deviation, and then you would solve for where is the decision point.” - L09-classif-3
“This would be another kind of question that you could ask on an exam, like find the equation for the decision boundary between these two categories. Then you solve for the equation for the line or the plane or whatever it happens to be depending on the dimension of the X’s.” - L09-classif-3
“That would be another question one could ask if I was so inspired, show that this leads to this thing.” - L09-classif-3 (re: deriving from the Gaussian)
The prof flagged decision-boundary derivation twice, in two different lectures’ worth of material, see discriminant-score-and-decision-boundary for the standalone procedural atom.
Pitfalls
- Forgetting that is pooled. If you use class-specific , you’ve done QDA, not LDA. Boundary becomes quadratic.
- The matrix gets confused with in the formula. It’s , the precision matrix. Don’t drop the inverse.
- Wrong direction of “decision boundary moves with prior.” Increasing moves the boundary away from class (more area classified as ). Easy to flip in a hurry.
- Treating in-sample confusion matrix as the truth. “You’re not looking at how well you classified out of sample. This is in-sample, on the data you actually train on. So you’re going to do really well because you have all the noise in the data.” - L09-classif-3
- Gaussian assumption violated. “Maybe it’s not a good idea to pretend that the X’s are well modeled by a Gaussian, that’s a good way to break a model.” - L09-classif-3
- Pooling assumption violated when class covariances genuinely differ, bias goes up.
Scope vs ISLP
- In scope: Bayes’ rule for class probabilities, Gaussian class-conditionals, pooled covariance, derivation of (1D and multivariate), decision-boundary derivation, parameter estimation (plug-in MLEs), softmax recovery of posterior, comparison with logistic regression, LDA-as-dimensionality-reduction.
- Look up in ISLP: §4.4.1 (LDA for ), §4.4.2 (LDA for ), pp. 145–155. Equations (4.18) and (4.24) are the canonical formulas; Figure 4.6 is the 3-class decision-boundary picture.
- Skip in ISLP:
- Fisher’s discriminant derivation (within-class vs between-class variance ratio, eigenvectors of ), slide deck section is marked “Optional” and the prof never lectured on it.
- Multinomial-logistic-vs-LDA detailed mapping: prof skipped multinomial logistic.
Exercise instances
- Exercise4.2a: write the pooled covariance estimator across two groups (genuine vs fake bank notes); plug in and equal sample sizes.
- Exercise4.2b: state the LDA assumptions; write the classification rule for a new observation (need to assume , normality, equal ).
- Exercise4.2c: classify a bank note with length 214 / diagonal 140.4 using LDA. R-friendly matrix calc.
- Exercise4.6e:
lda(Direction ~ Lag2)on theWeeklydata; held-out confusion matrix. - CE1 problem 3d: explain (prior), (class mean vector), (pooled covariance), (multivariate Gaussian density) in words.
- CE1 problem 3e: derive from Bayes’ rule, solve for the boundary in the form , plot it.
- CE1 problem 3f:
lda()in R, confusion matrix, sensitivity/specificity on the tennis test set.
How it might appear on the exam
- Decision-boundary derivation (the prof’s flagged pattern): Given , , (or in 1D), solve for . 1D → a point; 2D → a line. The prof said this twice.
- Discriminant-score derivation: Show how comes from by dropping -independent terms. Show why the drops out (the is pooled).
- Pooled-covariance computation: Given per-class and ‘s, compute .
- Output interpretation: Given an
lda()output (means, prior, scaling), classify a new observation; or read off whether a sample is closer to class A or B in the discriminant space. - Method comparison: “When would you prefer LDA to logistic regression?” → Gaussian holds, well-separated classes, small , multi-class. Or “to QDA?” → small , equal-covariance assumption defensible (bias-variance argument).
- Confusion-matrix companion: sensitivity, specificity from an LDA confusion matrix.
Related
- discriminant-score-and-decision-boundary: the standalone procedural atom for deriving and solving for the boundary.
- quadratic-discriminant-analysis: drop the pooled- assumption, get quadratic boundaries.
- naive-bayes: assume is diagonal (predictors conditionally independent given class).
- logistic-regression: same linear log-odds form, different fitting route.
- multivariate-normal: the class-conditional density assumption.
- diagnostic-vs-sampling-paradigm: LDA is the canonical sampling method.
- dimensionality-reduction: LDA as -D projection.
- bias-variance-tradeoff: pooling reduces variance (fewer parameters) at the cost of bias if differ.
- confusion-matrix, sensitivity-specificity, roc-auc: performance metrics.