PCA explained variance and scree plot

The companion to PCA: how do you decide how many components to keep? Two plots, both built from the eigenvalues, the scree plot (PVE per PC) and the cumulative PVE curve. The recipe: chop where the marginal PVE flattens out, or where the cumulative crosses a threshold (90/95/99%). When PCA is the front end of a supervised pipeline, CV on the held-out response beats either heuristic.

Definition (prof’s framing)

“Each subsequent dimension is the direction of maximal variance … you’re going to be explaining less and less variance.” - L14-modelsel-3

The proportion of variance explained by PC $m$ is the share of total data variance that PC $m$ accounts for:

$PVE_{m} = \frac{λ _{m}}{\sum _{k = 1}^{p} λ _{k}} = \frac{Var ( Z _{m} )}{\sum _{k} Var ( X _{k} )} .$

The scree plot plots $PVE_{m}$ against $m$ (decreasing); the cumulative PVE plots $\sum_{k \leq m} PVE_{k}$ against $m$ (saturating toward 1). Both are read off the eigenvalues, both have $p$ points on the x-axis (one per PC).

Notation & setup

$λ_{m}$ : variance of PC $m$ , equivalently the $m$ -th eigenvalue of the sample covariance $Σ = \frac{1}{n - 1} X^{⊤} X$ (after centering / standardization).
$\sum_{m} λ_{m} = \sum_{j} Var (X_{j})$ : total variance is preserved under the orthonormal rotation. After standardization each $Var (X_{j}) = 1$ , so total variance = $p$ (handy mental check).
$PVE_{m} \in (0, 1)$ , sums to 1 over $m = 1, \dots, min (n - 1, p)$ .
“Cumulative PVE” = $\sum_{k = 1}^{m} PVE_{k}$ : the share of variance kept if you truncate at $M = m$ .

Formula(s) to know cold

PVE for PC $m$ (book formula, ISL 12.10):

PVE_{m} = \frac{\sum _{i = 1}^{n} z _{im}^{2}}{\sum _{j = 1}^{p} \sum _{i = 1}^{n} x _{ij}^{2}} = \frac{λ _{m}}{\sum _{k = 1}^{p} λ _{k}} .

Cumulative PVE through PC $M$ :

$CumPVE (M) = \sum_{m = 1}^{M} PVE_{m} .$

The PVE-as- $R^{2}$ identity (ISL 12.11): the variance of the data decomposes as the variance kept by the first $M$ PCs plus the MSE of the $M$ -dim approximation, so $CumPVE (M) = 1 - RSS / TSS$ where RSS is the squared reconstruction error.

Insights & mental models

The chop-here recipe. “Here you can see what the first component actually explains … 60-something percent of the variance. And now with two components, you’re almost to 90% … that last bit going from the third to the fourth, that was small. So probably you could chop that away.” - L21-unsupervised-1 You’re looking for the elbow where the next PC stops earning its keep.

Three readings of the same plot.

Scree plot (PVE per PC): look for the “elbow”, the point where the curve goes from steep to flat. Components past the elbow are noise-tier.
Cumulative PVE: pick a threshold (90 / 95 / 99%), find the $M$ that crosses it. Common in chemometrics and high-dim genomics.
Eyeball: the prof’s own move on USArrests was just to look, “first PC ~62%, two PCs ~87%, third tiny, chop here.”

Two settings, two stopping rules.

Unsupervised PCA (visualization, exploration): scree elbow / cumulative threshold. Subjective. The book is candid: “this type of visual analysis is inherently ad hoc … there is no well-accepted objective way to decide” (ISL §12.2.4).
Supervised PCR / NN-feature-extractor: pick $M$ by cross-validating the downstream model. Treat $M$ as a tuning parameter, choose the value that minimizes CV-MSE / CV-misclassification. This is objective and what the prof actually does in PCR.

Total variance = sum of eigenvalues. Useful sanity check. Under standardization (variance 1 per column), total variance equals $p$ , so $PVE_{m} = λ_{m} / p$ .

Scree plot magnitudes scale with the data. On USArrests ( $p = 4$ ): PC1 ~62%, PC2 ~25%, two PCs cover 87%. On the NYT stories data ( $p = 4432$ , very wide & sparse): the first two PCs explain only a small fraction; you need many more to cover 90%. In wide-data regimes the cumulative-PVE curve rises slowly and the elbow is gentle.

The prof’s saturating-curve heuristic. On a $p = 13$ regression example: 90% threshold lands at ~7 PCs. “Remember, with model selection, 7 is a lot better than 13 when you’re fitting a model.” - L14-modelsel-3 On a $p = 300$ example: 95% might give you 75-80 PCs, still a huge reduction.

Exam signals

“Probably you could chop that away.” - L21-unsupervised-1 (on the third-to-fourth USArrests gap)

“Number of PCs needed for 90% of total variance → cumulative sum: 0.54 + 0.30 = 0.84 (not enough), +0.10 → past 0.90, so 3 PCs.” - L27-summary (worked threshold-counting calc; verbatim recipe)

“This kind of question I would also say is fair game because it tests basic knowledge of PCA that we covered in class, with simple calculations. And again, show your work so that you can get partial credit when the inevitable little mistake happens.” - L27-summary (on the PCA explained-variance / scree question)

Pitfalls

PVE is computed after standardization. If you forget to standardize, your eigenvalues are dominated by the largest-unit variable and PC1 will look spuriously dominant. The slide deck demos exactly this on USArrests (scale = FALSE vs scale = TRUE). See principal-component-analysis standardize-first warning.
Don’t confuse the scree plot with the cumulative plot. Scree = per-PC PVE (decreasing curve, elbow). Cumulative = running sum (increasing, saturating).
The “elbow” is subjective. Two reasonable readers may differ by ±1 PC. Justify your call. For supervised use, defer to CV instead.
When all PCs explain similar variance, PCA isn’t doing anything for you. “If the variables were already orthogonal then adding more PCs is just the same as adding another variable.” - L15-modelsel-4 Flat scree → no real low-dim structure; rethink.
PVE and ” $R^{2}$ of the approximation” are the same number (ISL 12.11). Don’t be surprised if the cumulative PVE at $M = p$ equals 1, that’s just “100% of the variance reconstructed when you keep all components.”
Cumulative PVE is monotone non-decreasing. A drop signals a coding bug.

Scope vs ISLP

In scope: the PVE formula and its scree-plot / cumulative reading; the chop-here heuristic; the contrast with supervised CV-based selection of $M$ in PCR.
Look up in ISLP: §12.2.3 (the PVE derivation, including the $R^{2} = 1 - RSS / TSS$ identity); §12.2.4 (“Deciding how many principal components to use”, the elbow + the supervised-CV alternative).
Skip in ISLP: spectral-decomposition theory of why the eigenvalues are the variances. L04-statlearn-3: “we don’t talk about spectral decomposition” (deferred to Linear Statistical Models). Use the fact, don’t derive it.

Exercise instances

Exercise 6.7: How many PCs for the Credit dataset, justify. Pure scree-plot / cumulative-PVE reading on a small $p = 11$ -ish dataset. Two acceptable answers: elbow read, or threshold-based count.
Exercise 10.1: On the NYT stories data, plot PVE and cumulative PVE; decide where to chop. With $p = 4432$ (very wide, sparse word counts), the cumulative curve rises slowly. The headline observation is that even two PCs cover only a small share of the variance, but they still separate music from art (the supervised-validation move).

How it might appear on the exam

Threshold counting (Q3e-style): given five eigenvalue ratios, “how many PCs for 90%?” Sum until you cross.
Elbow reading: given a scree plot, identify the elbow. Justify the cut-point in one sentence.
PVE arithmetic: “the first PC explains 62%, the second 25%, what’s the cumulative PVE through PC2?” Trivial sum, but write it out.
Output interpretation: given a printout from summary(prcomp(...)) (the standard R output is a row of “Proportion of Variance” and a row of “Cumulative Proportion”), read off how many PCs are needed for some threshold.
Method-of-selection question: “would you use a scree plot or cross-validation to choose $M$ here?” Answer keys on whether the downstream task is supervised. PCA-for-visualization → scree. PCR → CV.
T-F traps:
- “Cumulative PVE is monotone increasing.” → True.
- “PVE is invariant to the units of $X$ .” → False (depends on standardization).
- “If all PCs have similar PVE, PCA has revealed strong low-dim structure.” → False (the opposite, flat scree means no structure).

principal-component-analysis: defines the eigenvalues this atom plots; this atom is the “now what” companion.
principal-component-regression: supervised use of $M$ ; CV beats scree for this case.
dimensionality-reduction: the umbrella; the scree plot is the standard “how much did we lose?” diagnostic.
cross-validation: the better way to pick $M$ when there is a downstream model.

statistical.dog

Explorer

explained-variance-and-scree-plot

PCA explained variance and scree plot

Definition (prof’s framing)

Notation & setup

Formula(s) to know cold

Insights & mental models

Exam signals

Pitfalls

Scope vs ISLP

Exercise instances

How it might appear on the exam

Graph View

Table of Contents

Backlinks

statistical.dog

Explorer

explained-variance-and-scree-plot

PCA explained variance and scree plot

Definition (prof’s framing)

Notation & setup

Formula(s) to know cold

Insights & mental models

Exam signals

Pitfalls

Scope vs ISLP

Exercise instances

How it might appear on the exam

Related

Graph View

Table of Contents

Backlinks