Module 10: Unsupervised Learning — Book delta

This file reproduces the concrete artifacts (formulas, derivations, templates, definitions) that the prof taught for module 10 and that are not clean lookup-able statements in wiki/book/12-unsupervised.md. The book has Eq 12.17 (k-means objective), Eq 12.18 (pairwise-to-centroid identity), Eq 12.10 (PVE in score-sum form), Table 12.3 (linkage definitions), Algorithms 12.2 and 12.3, the ordering fact, and the qualitative Euclidean-vs-correlation discussion (Fig 12.15). Everything in this delta file is something the prof said, derived, or computed that is additional to that.


1. PCA

1.1 PCA via eigendecomposition of the covariance matrix

The book footnotes that “the principal component directions are given by the ordered sequence of eigenvectors of the matrix , and the variances of the components are the eigenvalues” (footnote 2 to §12.2.1) and otherwise says “the details are outside of the scope of this book.” The prof teaches this as the operational definition he wants the student to use on the exam (“the fact, don’t derive it”). L21

Let be the centred-and-standardised data matrix. Let

denote the sample covariance matrix of the columns of . Because is symmetric positive semi-definite, it admits the eigendecomposition

with a diagonal matrix of eigenvalues in decreasing order , and an orthonormal matrix whose columns are the unit-norm eigenvectors of . (Slide-deck “PCA — General setup”, L21.)

The first principal component is then

because for any unit-norm the variance of equals

and the maximum of on the unit sphere is , achieved at the top eigenvector. The -th PC is the eigenvector of the -th eigenvalue:

“And this is the covariance matrix of your data. … The solution to the problem of finding the directions of maximal variance, the PCA problem, is actually solved by finding the eigenvalues and eigenvectors.” — L21

“Your ends up just being the eigenvector corresponding to the largest eigenvalue.” — L21

The full spectral-derivation theory is OUT of scope (“we don’t talk about spectral decomposition” — docs/scope.md). What is IN scope: the identity above as a one-line working tool.

1.2 PVE as a ratio of eigenvalues

Book Eq 12.10 writes the proportion of variance explained as

The prof’s working form, used in the Q3e exam template and the slide-deck “PCA — General setup” (L21), is the eigenvalue ratio:

where is the cumulative PVE through PCs (slide notation: “fraction of original variance kept by the first principal components”). The two forms are equivalent because (up to the factor that cancels in the ratio).

Standardisation simplification. After standardisation each column has variance 1, so the total variance is . Then

This is the form Anders should expect on a Q3e-style hand calculation: eigenvalues are handed to him, he sums and divides by . explained-variance-and-scree-plot

1.3 Q3e exam template — PCA hand calculation

Templated by the prof for the 2026 exam in L27. Setup: five standardised variables; the question hands you the eigenvalues , the loading matrix , and one observation . The three sub-questions:

(a) Total variance explained by the first PCs.

After standardisation . Plug and divide.

(b) Number of PCs needed for 90% (or 95% or 99%) variance. Compute the cumulative sum step-by-step until it crosses the threshold:

The prof’s worked numerical example in L27: eigenvalue ratios . Cumulative: (not enough) (past 0.90). Answer: 3 PCs. explained-variance-and-scree-plot

(c) Compute the score for one observation on one PC.

i.e. plug the observation into the -th loading vector. The book defines this in Eq 12.2 in passing; the prof flags this exact plug-and-compute as a graded exam item, with the instruction:

“Show your work so that you can get partial credit when the inevitable little mistake happens.” — L27

principal-component-analysis §“How it might appear on the exam”

1.4 The objective in terms of (variance maximisation as a quadratic form)

The book’s optimisation (12.3) is written in score-sum form:

The prof’s equivalent statement (slide “PCA — General setup”, L21) is the quadratic-form version that makes the connection to explicit:

Subsequent PCs add the orthogonality constraint:

This is the form that immediately yields the eigendecomposition recipe of §1.1. L21 / principal-component-analysis

1.5 Standardisation z-score formula (load-bearing for module 10)

The book §12.2.4 and §12.4.2 prescribes “scaling the variables to have standard deviation one” but never writes the formula. The prof writes it explicitly (slide-deck L21, also restated for clustering in L22):

After this transformation each column has mean 0 and sample variance 1; therefore . This is the input to the PCA / k-means / hierarchical pipelines as the prof teaches them. standardization / L21

“If you don’t standardize them so that their mean is zero and their variance is one, then if one had a standard deviation of like a million, then that will be your strongest variable.” — L21


2. K-means clustering

The book gives Eq 12.17 (objective), Eq 12.18 (the pairwise-to-centroid identity), and Algorithm 12.2. The book also states monotonicity in two sentences. The items below are the prof’s additions.

2.1 Structured monotonicity proof (Exercise 10.2 template)

The book says: “in Step 2(a) the cluster means for each feature are the constants that minimise the sum-of-squared deviations, and in Step 2(b), reallocating the observations can only improve (12.18).” The prof’s structured proof, the one graded for Exercise 10.2, decomposes this into three explicit steps. L22 / k-means-clustering

Setup. The k-means objective is

Step 1 — Rewrite via identity 12.18. Apply

to each cluster. The objective becomes (up to a factor of 2)

This is a sum of squared deviations from cluster means.

Step 2 — Centroid step decreases . Fix the partition and vary the per-cluster reference points :

(One-liner derivation: differentiate in , set to zero, get .) So step 2(a), which sets each to the cluster mean, weakly decreases every cluster’s contribution and therefore .

Step 3 — Assignment step decreases . Fix the centroids and reassign each observation to its nearest centroid:

Each observation’s contribution to is now individually minimised over , so weakly decreases.

Step 4 — Convergence. , the sequence of values is nonincreasing and bounded below, and there are only finitely many partitions of objects into groups. Therefore the algorithm must stabilise in finitely many iterations — and the limiting partition is a local (not necessarily global) optimum.

“It’s not clear that this will lead to a fixed point just from looking at it, but it does.” — L22

k-means-clustering §“Why monotone (Exercise 10.2)“

2.2 Ensemble fix for local-optima (the prof’s “funny way”)

A workaround the prof named explicitly in L22, not in the book. Run the k-means algorithm times from independent random initialisations, producing assignments . Define a new dissimilarity between observations and by the co-clustering frequency:

Then re-cluster the data using as the input dissimilarity (e.g. through hierarchical clustering, or another round of k-means with a kernel). The intuition is that pairs that consistently land together across initialisations are “really” similar; pairs that land together only occasionally are at the boundary.

“Actually that works quite well but it’s kind of a funny way of doing things.” — L22

This is an ensembling idea analogous to bagging-the-clusterer, and the prof connects it to the boosting / random-forest theme from modules 8–9. k-means-clustering

2.3 Brute-force partition count

The book mentions “almost ways to partition observations into clusters.” The prof’s concrete instantiation, useful for an exam justification of why we need the iterative algorithm:

At this is . (L22) Use this as a one-line “why we don’t brute-force” answer.

2.4 Trivial degenerate case

Not stated in the book. If then every observation is its own cluster, , and the objective evaluates to

This is the global minimum, but it is trivial: no structure has been learnt. L22 / k-means-clustering §“Insights & mental models”


3. Hierarchical clustering

The book gives Algorithm 12.3, Table 12.3 (linkage formulas), the reordering fact, and the qualitative gender-vs-nationality non-nested example. The book does not walk through a numerical worked example with a complete-linkage hand-computation; the prof does, and grades it with a -per-mistake rubric. L22 / L27 / hierarchical-clustering

3.1 Linkage formulas (notation cleaned)

The book’s Table 12.3 gives definitions in prose. Formally, with and two clusters and the point-to-point dissimilarity:

Linkage
Complete
Single
Average$\displaystyle\frac{1}{
Centroid

The prof also name-checks median linkage (in scope only as “exists”). Ward linkage is out of scope per docs/scope.md. The average-linkage definition is the unweighted mean (each pair contributes equally); not the weighted variant some packages use.

“Average and complete tend to yield more balanced clusters.” — L22

3.2 Hand-computation recipe (the -per-mistake exam question)

Templated by the prof in L27 as a graded exam item. From a given dissimilarity matrix and a chosen linkage:

Recipe

  1. Find the smallest off-diagonal entry . Fuse and at height .
  2. Update the matrix. Replace rows/columns and with a single row/column for the new cluster . The new entries against any remaining are computed by the linkage rule:
    • Complete: .
    • Single: .
    • Average: when fusing singletons; in general, average over all cross-cluster pairs.
  3. Repeat on the smaller matrix: find next-smallest entry, fuse, recompute.
  4. Stop when one cluster remains.
  5. Draw the dendrogram with x-axis = arbitrary leaf order and y-axis = fusion heights. Each fusion is a horizontal bar at the recorded height linking its children.
  6. Cut at the requested height (or to get clusters) and report the cluster membership.

Grading rubric

“Negative one point for each mistake … I won’t give negative points, don’t worry.” — L27

Mistakes propagate: a wrong first fusion ruins the whole tree. Always recompute the matrix in writing after every step.

3.3 Worked example — complete linkage on the prof’s 4×4 matrix

This is the prof’s slide-deck worked example for the L22 hand-computation drill (also Exercise 2 of ISLP §12.6, whose solution the book does not provide). L22 slides “Exercise 2 from the book” / hierarchical-clustering §“Hand-computation procedure”

Input matrix:

Step 1. Smallest off-diagonal: . Fuse at height .

Step 2. Update the matrix under complete linkage:

New matrix (rows: ):

Step 3. Smallest entry: . Fuse at height .

Step 4. Update:

Step 5. Final fusion at height , producing the single cluster .

Dendrogram (heights only — the x-ordering is arbitrary):

  • Bottom: leaves .
  • Bar at joining 1 and 2.
  • Bar at joining 3 and 4.
  • Bar at joining and .

Cuts. A cut at any yields two clusters: and . A cut at yields three clusters: , , . A cut at yields four singletons.

3.4 Worked example — single linkage on the same matrix

Same input matrix; swap for . L22 slides Exercise 2 (b)

Step 1. Smallest entry . Fuse at .

Step 2. Update under single linkage:

and .

Step 3. Smallest entry: . Fuse at .

Step 4. Update:

Step 5. Final fusion at height .

Cut at . Below the partition is and different from complete linkage, which gave and . This is the canonical “linkage choice changes the answer” demonstration the prof uses.

3.5 Correlation-based distance, explicit formula

The book (§12.4.2, Fig 12.15) describes correlation-based distance qualitatively as “two observations are similar if their features are highly correlated”, but does not give the formula or its range. The prof writes it out explicitly in L22:

where is the Pearson correlation between the two observation profiles (each profile is a length- vector across features). The range:

with at perfect positive correlation and at perfect anti-correlation. The variant is used when the sign of the correlation is not meaningful (range ).

“Often people will say all right let’s minimize 1 minus the Pearson correlation. All right, that way, if they’re perfectly correlated, you have a dissimilarity of zero.” — L22

“Note: Correlation is actually a similarity measure, not a distance measure.” — slide-deck L21/L22

This is a similarity-to-distance conversion, not a proper metric: does not satisfy the triangle inequality. For clustering this does not matter — only the ordering of pairs is needed. distance-metrics

3.6 Squared Euclidean vs Euclidean inside k-means (ranking equivalence)

Not explicit in the book — the book uses squared Euclidean throughout §12.4.1 without commenting on why the square root is dropped. The prof’s justification, useful as a one-line exam answer:

“There’s no reason to take the square root, it’s just an extra step that won’t get you anything … it won’t change the goal or won’t change who wins or how they win.” — L22

Formally: is strictly monotone on , so for any partition and any pair of partitions ,

iff the same inequality holds with replaced by as a function of which partition wins on a per-cluster basis. The squared form is preferred because the centroid-as-minimizer property (§2.1, step 2) holds in closed form for squared distance and not for plain Euclidean distance — for plain Euclidean, the cluster-wise minimizer is the geometric median (a much harder object). distance-metrics / k-means-clustering


4. Notation / terminology drift

A few small differences between the prof’s notation and the book’s that an exam reader should keep straight.

  • Covariance normalisation. The prof writes (slide L21). The book uses in the variance form of Eq 12.10 (consistent with population variance). The eigenvectors are identical; only the eigenvalues scale by , and ratios (PVE) are unaffected.
  • Eigendecomposition shorthand. The slide writes . Because is symmetric, is orthonormal and . Some packages return , others ; the eigenvectors are the same up to sign.
  • “Dissimilarity” vs “distance”. The prof uses these interchangeably; the book technically distinguishes (a “dissimilarity” need not satisfy the triangle inequality). For module 10 they are the same object.
  • Average linkage. Both the prof and the book mean the unweighted mean over pairs (each cross-cluster pair contributes once with weight ). Some R packages and the exam_analysis cheat-sheet phrase it as a “weighted mean” — that is loose language for the same formula. UPGMA = unweighted average linkage; do not confuse with WPGMA (weighted, not in scope).
  • PCA loading sign. The prof and the book both note that loadings are unique up to a sign flip. If two software packages disagree on the sign of a loading column, that is not an error.
  • Sample variance denominator inside standardisation. The prof writes (). Many packages use . After dividing by , the resulting columns have sample variance 1 vs population variance 1 — the difference is cosmetic for everything downstream.
  • vs . The book uses uppercase for the score variable, lowercase for the score of observation . The prof’s slide-deck uses both interchangeably; the wiki keeps the book convention.
  • The letter . In hierarchical clustering, is the number of clusters obtained by cutting the dendrogram at some height. In k-means, is the pre-specified target. Same symbol, different roles.