Linear discriminant analysis (LDA)

The prof’s canonical generative classifier: model the class-conditional density $f_{k} (x) = Pr (X ∣ Y = k)$ as Gaussian with class-specific means $μ_{k}$ and a pooled covariance $Σ$ , plus a class prior $π_{k}$ , then flip with Bayes. The resulting discriminant score is linear in $x$ , that’s the “L.” Decision boundary derivation is exam-flagged twice.

Definition (prof’s framing)

“We’re not trying to model $P (Y ∣ X)$ directly. We’re trying to get the other distributions to then use Bayes to get the classification.” - L09-classif-3

“We’re flipping it around. So now, instead of modeling this probability of Y given X directly, we want to make a model of what is the prior distribution of Y, like how likely is this class A or class B, and the probability of X given Y.” - L09-classif-3

The two assumptions that make it work:

Within each class, $X \sim N (μ_{k}, Σ)$ , Gaussian class-conditional.
The covariance $Σ$ is shared across classes (“pooled”). This is the L in LDA. Relax it → QDA.

Plus a class prior $π_{k} = Pr (Y = k)$ for each class.

Notation & setup

$K$ classes labeled $k = 1, \dots, K$ ; binary case: $K = 2$ .
$π_{k} = Pr (Y = k)$ , with $\sum_{k} π_{k} = 1$ .
$μ_{k} \in R^{p}$ , class- $k$ mean vector.
$Σ \in R^{p \times p}$ , shared covariance matrix (positive definite).
$f_{k} (x)$ , multivariate normal density with $(μ_{k}, Σ)$ .

For $p = 1$ , replace $Σ$ with shared scalar $σ^{2}$ .

Formula(s) to know cold

Bayes’ rule (the engine):

$Pr (Y = k ∣ X = x) = \frac{π _{k} f _{k} ( x )}{\sum _{ℓ} π _{ℓ} f _{ℓ} ( x )}$

1D Gaussian discriminant score (linear in $x$ , drop terms not depending on $k$ , take logs):

$δ_{k} (x) = x \cdot \frac{μ _{k}}{σ ^{2}} - \frac{μ _{k}^{2}}{2 σ ^{2}} + lo g π_{k}$

Multivariate Gaussian discriminant score:

$δ_{k} (x) = x^{⊤} Σ^{- 1} μ_{k} - \frac{1}{2} μ_{k}^{⊤} Σ^{- 1} μ_{k} + lo g π_{k}$

Classify by $\overset{y}{^} = ar g max_{k} δ_{k} (x)$ .

Posterior recovery (softmax over discriminants):

$\hat{Pr} (Y = k ∣ X = x) = \frac{e ^{δ_{k} (x)}}{\sum _{ℓ} e ^{δ_{ℓ} (x)}}$

Parameter estimators (plug-in MLEs)

$\overset{π}{^}_{k} = n_{k} / n$ (class frequency).
$\overset{μ}{^}_{k} = \frac{1}{n _{k}} \sum_{i : y_{i} = k} x_{i}$ .
Pooled covariance: the formula to know:

$\hat{Σ} = \sum_{k = 1}^{K} \frac{n _{k} - 1}{n - K} \cdot \hat{Σ}_{k}$

where $\hat{Σ}_{k} = \frac{1}{n _{k} - 1} \sum_{i : y_{i} = k} (x_{i} - \overset{μ}{^}_{k}) (x_{i} - \overset{μ}{^}_{k})^{⊤}$ .

In 1D: $\overset{σ}{^}^{2} = \frac{1}{n - K} \sum_{k} \sum_{i : y_{i} = k} (x_{i} - \overset{μ}{^}_{k})^{2}$ .

Why the discriminant is linear

Take the log of $π_{k} f_{k} (x)$ with $f_{k}$ multivariate Gaussian. The exponent has $- \frac{1}{2} (x - μ_{k})^{⊤} Σ^{- 1} (x - μ_{k})$ , which expands as

$- \frac{1}{2} x^{⊤} Σ^{- 1} x + x^{⊤} Σ^{- 1} μ_{k} - \frac{1}{2} μ_{k}^{⊤} Σ^{- 1} μ_{k}$

The $x^{⊤} Σ^{- 1} x$ term has no $k$ (because $Σ$ is shared) → it cancels in $ar g max_{k}$ → drop. What’s left is linear in $x$ .

“The $x^{2}$ term has no $k$ , gone… [in QDA we don’t pool], so the coefficient becomes $k$ -dependent, the $x^{2}$ term survives, and the discriminant becomes quadratic.” - L09-classif-3

Insights & mental models

The decision boundary is a consequence of the assumed model, not a fitted parameter. Contrast with logistic regression, where the boundary’s slope is a parameter. In LDA, boundaries fall out from equating $δ_{k} = δ_{ℓ}$ . - L09-classif-3
Priors literally move the boundary. Increasing $π_{k}$ shifts the boundary away from class $k$ (more things classified as $k$ ). The product-curve picture is the prof’s go-to visualization.
Equal priors, 2 classes, equal $σ$ , 1D: boundary at $(μ_{1} + μ_{2}) /2$ . Solve $δ_{1} (x) = δ_{2} (x)$ to verify.
LDA = dimensionality reduction. $K$ classes → $K - 1$ discriminant scores. “Mapping it down to a dimension specifically to separate out these categories.” - L09-classif-3. Useful in high $p$ where KNN is cursed.
In-sample LDA can beat the Bayes-optimal classifier on training data. Sound paradoxical; it’s just overfitting. The Bayes classifier uses true parameters; the fitted LDA chases noise. - L09-classif-3
LDA ↔ logistic regression are very close. Same linear log-odds form (slide deck note: $lo g (p_{1} (x) / (1 - p_{1} (x))) = c_{0} + c_{1} x$ for both). Different parameter-estimation routes, Gaussian-plug-in vs MLE.

Worked 2D example, the prof’s recurring template

Setup: $μ_{A} = (1, 1)$ , $μ_{B} = (3, 3)$ , $Σ = 2 I$ , equal priors. Equate $δ_{A} (x) = δ_{B} (x)$ :

$x^{⊤} Σ^{- 1} (μ_{A} - μ_{B}) - \frac{1}{2} μ_{A}^{⊤} Σ^{- 1} μ_{A} + \frac{1}{2} μ_{B}^{⊤} Σ^{- 1} μ_{B} + lo g π_{A} - lo g π_{B} = 0$

With $Σ^{- 1} = \frac{1}{2} I$ , this collapses to $- x_{1} - x_{2} + 4 = 0$ , i.e. $x_{2} = 4 - x_{1}$ .

“I didn’t ask for a line. I just equated the two things and then I solved for them and then it became a line. I didn’t tell the math to give me a line.” - L09-classif-3

The prof did the algebra live, slipped on a cancellation, came back from break with the corrected version. Worth knowing because this is the exam pattern (next section).

Exam signals

“And that’s often an exam question. Or that would be a typical exam question. So you would, given an LDA setting and here are the values for the parameters, you have the pies, you have the mu’s, and you’d have a value for the standard deviation, and then you would solve for where is the decision point.” - L09-classif-3

“This would be another kind of question that you could ask on an exam, like find the equation for the decision boundary between these two categories. Then you solve for the equation for the line or the plane or whatever it happens to be depending on the dimension of the X’s.” - L09-classif-3

“That would be another question one could ask if I was so inspired, show that this leads to this thing.” - L09-classif-3 (re: deriving $δ_{k}$ from the Gaussian)

The prof flagged decision-boundary derivation twice, in two different lectures’ worth of material, see discriminant-score-and-decision-boundary for the standalone procedural atom.

Pitfalls

Forgetting that $Σ$ is pooled. If you use class-specific $Σ_{k}$ , you’ve done QDA, not LDA. Boundary becomes quadratic.
The $Σ^{- 1}$ matrix gets confused with $Σ$ in the $δ_{k}$ formula. It’s $Σ^{- 1}$ , the precision matrix. Don’t drop the inverse.
Wrong direction of “decision boundary moves with prior.” Increasing $π_{k}$ moves the boundary away from class $k$ (more area classified as $k$ ). Easy to flip in a hurry.
Treating in-sample confusion matrix as the truth. “You’re not looking at how well you classified out of sample. This is in-sample, on the data you actually train on. So you’re going to do really well because you have all the noise in the data.” - L09-classif-3
Gaussian assumption violated. “Maybe it’s not a good idea to pretend that the X’s are well modeled by a Gaussian, that’s a good way to break a model.” - L09-classif-3
Pooling assumption violated when class covariances genuinely differ, bias goes up.

Scope vs ISLP

In scope: Bayes’ rule for class probabilities, Gaussian class-conditionals, pooled covariance, derivation of $δ_{k} (x)$ (1D and multivariate), decision-boundary derivation, parameter estimation (plug-in MLEs), softmax recovery of posterior, comparison with logistic regression, LDA-as-dimensionality-reduction.
Look up in ISLP: §4.4.1 (LDA for $p = 1$ ), §4.4.2 (LDA for $p > 1$ ), pp. 145–155. Equations (4.18) and (4.24) are the canonical $δ_{k}$ formulas; Figure 4.6 is the 3-class decision-boundary picture.
Skip in ISLP:
- Fisher’s discriminant derivation (within-class vs between-class variance ratio, eigenvectors of $Σ^{- 1} B$ ), slide deck section is marked “Optional” and the prof never lectured on it.
- Multinomial-logistic-vs-LDA detailed mapping: prof skipped multinomial logistic.

Exercise instances

Exercise4.2a: write the pooled covariance estimator across two groups (genuine vs fake bank notes); plug in $\hat{Σ}_{G}, \hat{Σ}_{F}$ and equal sample sizes.
Exercise4.2b: state the LDA assumptions; write the classification rule for a new observation (need to assume $π_{G} = π_{F}$ , normality, equal $Σ$ ).
Exercise4.2c: classify a bank note with length 214 / diagonal 140.4 using LDA. R-friendly matrix calc.
Exercise4.6e: lda(Direction ~ Lag2) on the Weekly data; held-out confusion matrix.
CE1 problem 3d: explain $π_{k}$ (prior), $μ_{k}$ (class mean vector), $Σ$ (pooled covariance), $f_{k} (x)$ (multivariate Gaussian density) in words.
CE1 problem 3e: derive $δ_{k} (x)$ from Bayes’ rule, solve $δ_{0} (x) = δ_{1} (x)$ for the boundary in the form $a x_{1} + b x_{2} + c = 0$ , plot it.
CE1 problem 3f: lda() in R, confusion matrix, sensitivity/specificity on the tennis test set.

How it might appear on the exam

Decision-boundary derivation (the prof’s flagged pattern): Given $π_{k}$ , $μ_{k}$ , $Σ$ (or $σ$ in 1D), solve $δ_{1} = δ_{2}$ for $x$ . 1D → a point; 2D → a line. The prof said this twice.
Discriminant-score derivation: Show how $δ_{k}$ comes from $lo g (π_{k} f_{k} (x))$ by dropping $k$ -independent terms. Show why the $x^{2}$ drops out (the $Σ$ is pooled).
Pooled-covariance computation: Given per-class $\hat{Σ}_{k}$ and $n_{k}$ ‘s, compute $\hat{Σ}$ .
Output interpretation: Given an lda() output (means, prior, scaling), classify a new observation; or read off whether a sample is closer to class A or B in the discriminant space.
Method comparison: “When would you prefer LDA to logistic regression?” → Gaussian holds, well-separated classes, small $n$ , multi-class. Or “to QDA?” → small $n$ , equal-covariance assumption defensible (bias-variance argument).
Confusion-matrix companion: sensitivity, specificity from an LDA confusion matrix.

discriminant-score-and-decision-boundary: the standalone procedural atom for deriving $δ_{k}$ and solving for the boundary.
quadratic-discriminant-analysis: drop the pooled- $Σ$ assumption, get quadratic boundaries.
naive-bayes: assume $Σ$ is diagonal (predictors conditionally independent given class).
logistic-regression: same linear log-odds form, different fitting route.
multivariate-normal: the class-conditional density assumption.
diagnostic-vs-sampling-paradigm: LDA is the canonical sampling method.
dimensionality-reduction: LDA as $K - 1$ -D projection.
bias-variance-tradeoff: pooling reduces variance (fewer parameters) at the cost of bias if $Σ_{k}$ differ.
confusion-matrix, sensitivity-specificity, roc-auc: performance metrics.

statistical.dog

Explorer

linear-discriminant-analysis

Linear discriminant analysis (LDA)

Definition (prof’s framing)

Notation & setup

Formula(s) to know cold

Parameter estimators (plug-in MLEs)

Why the discriminant is linear

Insights & mental models

Worked 2D example, the prof’s recurring template

Exam signals

Pitfalls

Scope vs ISLP

Exercise instances

How it might appear on the exam

Graph View

Table of Contents

Backlinks

statistical.dog

Explorer

linear-discriminant-analysis

Linear discriminant analysis (LDA)

Definition (prof’s framing)

Notation & setup

Formula(s) to know cold

Parameter estimators (plug-in MLEs)

Why the discriminant is linear

Insights & mental models

Worked 2D example, the prof’s recurring template

Exam signals

Pitfalls

Scope vs ISLP

Exercise instances

How it might appear on the exam

Related

Graph View

Table of Contents

Backlinks