Multivariate normal distribution

The bridge between modules 2, 3, and 4. Module 2 introduces it as the joint distribution of a random vector $X$ ; module 3 uses it for the sampling distribution of $\hat{β}$ in OLS; module 4 uses it as the class-conditional density $f_{k} (x)$ in LDA and QDA. Same density, three roles. The exam-bait facts: contour ellipsoids tell you the covariance structure (CE1 1g); marginals are normal; zero covariance ⇒ independence under joint normality (only here).

Definition (prof’s framing)

“Today we’re going to talk about random vectors. We’re going to talk about the covariance matrix, correlation matrix, what those are, and also the normal distribution, in particular the multivariate case. And this is setting us up to talk about regression.” - L04-statlearn-3

“Do you guys know the relationship between the normal distribution and regression? … If you minimize the normal distribution, if you assume your data is normally distributed and you have it the mean parameterized by some model, then that’s equivalent to linear regression. So this multivariate case is a way of understanding how we do regression in multiple variables.” - L04-statlearn-3

For $x \in R^{p}$ with mean vector $μ$ and covariance matrix $Σ$ :

$f (x) = \frac{1}{( 2 π ) ^{p /2} ∣ Σ ∣ ^{1/2}} exp (- \frac{1}{2} (x - μ)^{⊤} Σ^{- 1} (x - μ))$

Reduces to the univariate normal $f (x) = (2 π σ^{2})^{- 1/2} exp (- (x - μ)^{2} / (2 σ^{2}))$ when $p = 1$ , “you have to go really, really far out, but you can see how it reduces down.” (L04-statlearn-3)

Returns in other modules

L04-statlearn-3: first introduction. Cork-trees-as-running-example. Univariate Gaussian generalized to $p$ dimensions: $x - μ$ becomes a vector, $σ^{2}$ becomes $Σ$ , $1/ σ^{2}$ becomes $Σ^{- 1}$ in the exponent, $σ$ in the normalizer becomes $∣Σ ∣^{1/2}$ . Ran out of time mid-contour-matching.
L05-linreg-1: finishes the contour-matching exercise from L04. Then makes the mindset shift from joint $(X, Y)$ distribution to conditional $Y ∣ X$ , “we will look at how things co-vary, but in the sense of how Y varies as a function of X.” This is the bridge to regression. Under Gaussian errors, $\hat{β} \sim N (β, σ^{2} (X^{⊤} X)^{- 1})$ , a multivariate normal in parameter space.
L09-classif-3: used as the class-conditional density in LDA and QDA. $f_{k} (x) = \frac{1}{( 2 π ) ^{p /2} ∣Σ ∣ ^{1/2}} exp (- \frac{1}{2} (x - μ_{k})^{⊤} Σ^{- 1} (x - μ_{k}))$ for LDA (shared $Σ$ ); $f_{k} (x)$ uses $Σ_{k}$ for QDA. The whole module-4 generative-classifier story is multivariate normals. Naive Bayes = multivariate normal with diagonal $Σ$ (predictors conditionally independent given class).

Notation & setup

$x = (x_{1}, \dots, x_{p})^{⊤} \in R^{p}$ : a $p$ -dimensional column vector.
$μ = (μ_{1}, \dots, μ_{p})^{⊤} \in R^{p}$ : the mean vector.
$Σ$ : the $p \times p$ covariance matrix: $Σ_{ij} = Cov (X_{i}, X_{j})$ , with variances $σ_{j}^{2}$ on the diagonal and covariances off.
$∣ Σ ∣$ : determinant of $Σ$ . $∣Σ∣ = 0$ is bad ( $Σ$ singular), division by zero in the density, “yuck” (L04-statlearn-3).
$Σ^{- 1}$ : the precision matrix. Plays the role of $1/ σ^{2}$ in the univariate case.

Notation: $X \sim N_{p} (μ, Σ)$ means $X$ has the multivariate normal distribution with mean $μ$ and covariance $Σ$ .

Formula(s) to know cold

The density

$f (x) = \frac{1}{( 2 π ) ^{p /2} ∣ Σ ∣ ^{1/2}} exp (- \frac{1}{2} (x - μ)^{⊤} Σ^{- 1} (x - μ))$

Mapping the pieces from the univariate Gaussian:

Univariate	Multivariate
$x - μ$	$x - μ$ (vector)
$(x - μ)^{2} / σ^{2}$	$(x - μ)^{⊤} Σ^{- 1} (x - μ)$ (Mahalanobis distance)
$σ$ in normalizer	$
$2 π$	$(2 π)^{p /2}$

Useful properties (listed by the prof)

From L04-statlearn-3:

Contours are ellipsoids. Level sets ${x : (x - μ)^{⊤} Σ^{- 1} (x - μ) = c}$ are ellipsoids centered at $μ$ , oriented and stretched by $Σ$ .
Linear combinations are normal. $A X + b \sim N (A μ + b, A Σ A^{⊤})$ for any matrix $A$ and vector $b$ . (See contrasts.)
Marginals are normal. Any subset of the components is multivariate normal in its own right.
Conditionals are normal. $X_{1} ∣ X_{2} = x_{2}$ is multivariate normal, this is what enables linear regression to be exact under joint normality.
Zero covariance ⇒ independence, under joint normality only. This is special to Gaussians; in general, $Cov (X, Y) = 0$ does not imply $X ⊥ Y$ .

Use in OLS sampling

Under Gaussian errors $ε \sim N (0, σ^{2} I)$ :

$\hat{β} = (X^{⊤} X)^{- 1} X^{⊤} y \sim N_{p} (β, σ^{2} (X^{⊤} X)^{- 1})$

the sampling distribution of the OLS estimator is a $p$ -variate normal. (sampling-distribution-of-beta for the full derivation; covered in module 3.)

Use in LDA/QDA discriminants

LDA assumes $X ∣ Y = k \sim N_{p} (μ_{k}, Σ)$ , pooled $Σ$ , class-specific means. Apply Bayes, take logs, drop $k$ -independent terms:

$δ_{k} (x) = x^{⊤} Σ^{- 1} μ_{k} - \frac{1}{2} μ_{k}^{⊤} Σ^{- 1} μ_{k} + lo g π_{k}$

linear in $x$ , because the $x^{⊤} Σ^{- 1} x$ term is the same for every $k$ and drops out of the $ar g max$ . (L09-classif-3)

QDA assumes class-specific $Σ_{k}$ → the quadratic term doesn’t cancel:

$δ_{k} (x) = - \frac{1}{2} x^{⊤} Σ_{k}^{- 1} x + x^{⊤} Σ_{k}^{- 1} μ_{k} - \frac{1}{2} μ_{k}^{⊤} Σ_{k}^{- 1} μ_{k} - \frac{1}{2} lo g ∣ Σ_{k} ∣ + lo g π_{k}$

quadratic in $x$ , hence “Quadratic Discriminant Analysis.” (L09-classif-3, the prof flagged “where does the quadratic come from in QDA?” as a typical exam question.)

Insights & mental models

Reading contour plots (CE1 1g, the explicit exam-bait)

Two-dimensional MVN contours are ellipses. Reading them:

Σ pattern	Contour appearance
$σ_{1}^{2} = σ_{2}^{2}$ , $ρ = 0$	Circle centered at $(μ_{1}, μ_{2})$ , equal variances, no correlation
$σ_{1}^{2} \neq = σ_{2}^{2}$ , $ρ = 0$	Axis-aligned ellipse stretched along the larger-variance axis
$ρ > 0$	Ellipse tilted upward-right (positive diagonal)
$ρ < 0$	Ellipse tilted upward-left (negative diagonal)
$σ_{1}^{2} \neq = σ_{2}^{2}$ , $ρ \neq = 0$	Tilted ellipse with the long axis closer to whichever variance is larger

L05-linreg-1 worked through these explicitly:

“Circular ellipse, no diagonal pull → correlation 0, equal variances.”
“Diagonal pull going up → positive correlation.”
“Diagonal pull going down → negative correlation.”
“Stretched only along one axis → unequal variances, no correlation.”

Why MVN is load-bearing for the whole course

“Today we will discuss Y given X. So not the joint distribution of them, but Y given X. So we’re trying to essentially make a model of it… we will look at how things co-vary, but in the sense of how Y varies as a function of X.” - L05-linreg-1

MVN gives a joint model of $(X_{1}, \dots, X_{p}, Y)$ . The conditional $Y ∣ X$ extracted from a joint MVN is linear in $X$ with constant variance, that’s exactly the linear regression model. So all the OLS sampling theory (CIs, t-tests, F-tests, prediction intervals) is exactly correct when joint normality holds and only approximately correct otherwise.

Why MVN is load-bearing for classification

LDA / QDA / Naive Bayes are all generative classifiers: model $P (X ∣ Y = k)$ instead of $P (Y ∣ X)$ . The choice of $P (X ∣ Y = k)$ as multivariate normal is the modeling assumption for module 4’s generative half. Three flavors:

LDA: $X ∣ Y = k \sim N (μ_{k}, Σ)$ , pooled covariance.
QDA: $X ∣ Y = k \sim N (μ_{k}, Σ_{k})$ , class-specific covariance.
Naive Bayes: $X ∣ Y = k \sim N (μ_{k}, diag (σ_{k}^{2}))$ , predictors conditionally independent given class.

The “where does the quadratic come from” question is direct algebra on the MVN density.

Mahalanobis distance

The exponent of the MVN density is $- \frac{1}{2} D_{M}^{2}$ , where

$D_{M}^{2} (x, μ) = (x - μ)^{⊤} Σ^{- 1} (x - μ)$

is the Mahalanobis distance (squared), Euclidean distance after rotating and rescaling by $Σ^{- 1/2}$ . It’s the natural distance under joint normality. (Not on the syllabus by name, but the exponent of the density is exactly this.)

Conditional MVN, why linear regression works

If $(X, Y)$ is jointly MVN with the appropriate block structure, then

$Y ∣ X = x \sim N (μ_{Y} + Σ_{Y X} Σ_{XX}^{- 1} (x - μ_{X}), σ_{Y}^{2} - Σ_{Y X} Σ_{XX}^{- 1} Σ_{X Y})$

linear in $x$ with constant variance. This is the linear regression model, derived from joint normality. The prof gestured at this in L04-statlearn-3 without writing it out, “the multivariate case is a way of understanding how we do regression in multiple variables.”

Singular Σ

“If the determinant is zero, then that’s typically bad, because you divide by zero and yuck. And I think that’s the point they’re trying to make.” - L04-statlearn-3

$∣Σ∣ = 0$ means the data lies on a lower-dimensional subspace; the density isn’t well-defined on the full $p$ -dim space. This is the same pathology as collinearity in OLS ( $X^{⊤} X$ singular), extreme correlation between predictors. Fix: drop a variable, regularize, or use PCA.

Exam signals

“Today we will discuss Y given X. So not the joint distribution of them, but Y given X.” - L05-linreg-1 (the conditional → regression bridge)

“If you minimize the normal distribution, if you assume your data is normally distributed and you have it the mean parameterized by some model, then that’s equivalent to linear regression. So this multivariate case is a way of understanding how we do regression in multiple variables.” - L04-statlearn-3

“That’s another good exam question, where does the quadratic come from in QDA? Or show that, yeah… it’s an interesting point that simply by making the sigma $k$ -dependent, we introduced a new term, and that term is quadratic.” - L09-classif-3

CE1 problem 1g (matching contour plot to covariance matrix) is the direct exam-style application, the prof has explicitly drilled it into a compulsory exercise.

Pitfalls

Zero covariance ⇒ independence is special to MVN. In general, $Cov (X, Y) = 0$ does NOT mean $X ⊥ Y$ , only under joint normality. Don’t carry this over to other distributions.
Joint normality of marginals ≠ joint normality. Two normal marginals can have a non-Gaussian joint (counter-example: dependent normals concocted to be marginally Gaussian but jointly not). The MVN assumption is on the joint, not just on each component.
$∣Σ∣ = 0$ breaks the density: handle singular cases via dimensionality reduction or regularization.
Don’t read “the variance is on the diagonal” as a special property: it’s the definition of $Σ_{ii} = Var (X_{i})$ . Off-diagonal entries are covariances; rescaling to $ρ$ ‘s gives the correlation matrix.
Standardization doesn’t make data multivariate normal: z-scoring sets means to 0 and variances to 1, but doesn’t change the joint shape. Skewed data stays skewed after standardization.
Spectral / eigen-decomposition of Σ is OUT of scope. L04-statlearn-3 verbatim: “we don’t talk about spectral decomposition”, deferred to Linear Statistical Models. You can use the contour-stretching intuition without the full eigen-machinery.
Pooled covariance is just a convex combination of class-specific covariances: $\hat{Σ}_{pooled} = \sum_{k} \frac{n _{k} - 1}{n - K} \hat{S}_{k}$ . Don’t confuse it with the unconditional sample covariance.
The MVN assumption is a modeling assumption, not a fact about your data, flagged for LDA/QDA in L09-classif-3: “Maybe it’s not a good idea to pretend that the X’s are well modeled by a Gaussian, that’s a good way to break a model.”
Contour shapes are about $Σ$ , not $μ$ : the mean vector just shifts the center of the ellipsoid; the shape and orientation are pure $Σ$ .

Scope vs ISLP

In scope: the density formula, what each piece means, contour-matching, the role in LDA/QDA discriminants (with the where-does-the-quadratic-come-from derivation), zero-covariance ⇒ independence under normality, the basic properties (linear combos / marginals / conditionals stay normal).
Look up in ISLP: §4.4.2 (LDA for $p > 1$ , with the MVN density); §4.4.3 (QDA, where $Σ$ becomes $Σ_{k}$ ); §4.4.4 (Naive Bayes, where $Σ_{k}$ becomes diagonal).
Skip in ISLP (book-only, prof excluded): spectral decomposition / eigenanalysis of $Σ$ - L04-statlearn-3: “we don’t talk about spectral decomposition”, deferred to Linear Statistical Models. The eigenvalue-as-PC-variance fact comes back in PCA (principal-component-analysis) but the full spectral theory of $Σ$ doesn’t.

Exercise instances

Exercise2.4: simulate from a multivariate normal with mvrnorm under four different covariance settings (uncorrelated equal var, uncorrelated unequal var, positive correlation, negative correlation). The hands-on version of the contour-matching exercise.
CE1 problem 1g: match a contour plot to a given covariance matrix. Single-choice question, explicit exam-style. $Σ = (1 0.2 0.2 4)$ → axis-aligned ellipse with the long axis along $X_{2}$ (variance 4 vs 1) and slight upward tilt (positive correlation 0.1).

How it might appear on the exam

Contour-matching question (CE1 1g style): given a $Σ$ , pick the right contour plot. Or vice versa. Read the variances off the diagonal (which axis is longer?), read the correlation off the off-diagonal (tilted up or down? how much?).
“Where does the quadratic come from in QDA?”: algebra on the MVN density: when $Σ$ is the same for all classes, the $x^{⊤} Σ^{- 1} x$ term cancels in $ar g max_{k}$ ; when $Σ_{k}$ differs by class, the term survives → quadratic in $x$ .
Derive the LDA discriminant from scratch: start from $ar g max_{k} π_{k} f_{k} (x)$ , take logs, drop $k$ -independent terms, end with $δ_{k} (x)$ . Same template applies to QDA but you keep the quadratic term.
T/F: “Zero covariance implies independence”: depends on the assumption. Under joint normality, TRUE. In general, FALSE.
T/F: “If $X_{1}$ and $X_{2}$ are each marginally normal, then $(X_{1}, X_{2})$ is multivariate normal”: FALSE. Counter-examples exist.
“Why is the MVN assumption sometimes a bad idea?”: flagged in L09-classif-3: real predictors aren’t always Gaussian; LDA/QDA suffer if the MVN class-conditional model is far off.
Connection to OLS sampling: under Gaussian errors, $\hat{β}$ is multivariate normal. From this you derive the sampling distribution (and CIs and t-tests).

random-vector-and-covariance: the build-up: $Σ$ , $Corr$ , expectation rules. Module 2 prerequisite.
contrasts: linear combinations $C X$ ; under MVN, $C X$ is also MVN.
linear-regression: the conditional of MVN $(X, Y)$ gives the linear regression model exactly.
gaussian-error-assumptions: the MVN of the error vector $ε$ ; the assumption that makes OLS = MLE.
least-squares-and-mle: OLS minimization is MLE under Gaussian errors. Direct consequence of the MVN density’s exponent.
sampling-distribution-of-beta: under Gaussian errors, $\hat{β} \sim N_{p} (β, σ^{2} (X^{⊤} X)^{- 1})$ .
linear-discriminant-analysis: uses MVN class-conditionals with pooled $Σ$ ; discriminants linear in $x$ .
quadratic-discriminant-analysis: uses MVN class-conditionals with class-specific $Σ_{k}$ ; discriminants quadratic in $x$ .
naive-bayes: uses MVN class-conditionals with diagonal $Σ_{k}$ ; predictors conditionally independent given class.
discriminant-score-and-decision-boundary: derived from the MVN-based discriminants.
principal-component-analysis: eigenvectors of $Σ$ give the directions of maximal variance under joint normality.

statistical.dog

Explorer

multivariate-normal

Multivariate normal distribution

Definition (prof’s framing)

Returns in other modules

Notation & setup

Formula(s) to know cold

The density

Useful properties (listed by the prof)

Use in OLS sampling

Use in LDA/QDA discriminants

Insights & mental models

Reading contour plots (CE1 1g, the explicit exam-bait)

Why MVN is load-bearing for the whole course

Why MVN is load-bearing for classification

Mahalanobis distance

Conditional MVN, why linear regression works

Singular Σ

Exam signals

Pitfalls

Scope vs ISLP

Exercise instances

How it might appear on the exam

Graph View

Table of Contents

Backlinks

statistical.dog

Explorer

multivariate-normal

Multivariate normal distribution

Definition (prof’s framing)

Returns in other modules

Notation & setup

Formula(s) to know cold

The density

Useful properties (listed by the prof)

Use in OLS sampling

Use in LDA/QDA discriminants

Insights & mental models

Reading contour plots (CE1 1g, the explicit exam-bait)

Why MVN is load-bearing for the whole course

Why MVN is load-bearing for classification

Mahalanobis distance

Conditional MVN, why linear regression works

Singular Σ

Exam signals

Pitfalls

Scope vs ISLP

Exercise instances

How it might appear on the exam

Related

Graph View

Table of Contents

Backlinks