Module 10: Unsupervised Learning — Book delta

This file reproduces the concrete artifacts (formulas, derivations, templates, definitions) that the prof taught for module 10 and that are not clean lookup-able statements in wiki/book/12-unsupervised.md. The book has Eq 12.17 (k-means objective), Eq 12.18 (pairwise-to-centroid identity), Eq 12.10 (PVE in score-sum form), Table 12.3 (linkage definitions), Algorithms 12.2 and 12.3, the $2^{n - 1}$ ordering fact, and the qualitative Euclidean-vs-correlation discussion (Fig 12.15). Everything in this delta file is something the prof said, derived, or computed that is additional to that.

1. PCA

1.1 PCA via eigendecomposition of the covariance matrix

The book footnotes that “the principal component directions $ϕ_{1}, ϕ_{2}, ϕ_{3}, \dots$ are given by the ordered sequence of eigenvectors of the matrix $X^{T} X$ , and the variances of the components are the eigenvalues” (footnote 2 to §12.2.1) and otherwise says “the details are outside of the scope of this book.” The prof teaches this as the operational definition he wants the student to use on the exam (“the fact, don’t derive it”). L21

Let $X \in R^{n \times p}$ be the centred-and-standardised data matrix. Let

Σ = \frac{1}{n - 1} X^{⊤} X

denote the sample covariance matrix of the columns of $X$ . Because $Σ$ is symmetric positive semi-definite, it admits the eigendecomposition

Σ = C Λ C^{- 1} = C Λ C^{⊤}

with $Λ = diag (λ_{1}, \dots, λ_{p})$ a diagonal matrix of eigenvalues in decreasing order $λ_{1} \geq λ_{2} \geq \dots \geq λ_{p} \geq 0$ , and $C$ an orthonormal matrix whose columns are the unit-norm eigenvectors of $Σ$ . (Slide-deck “PCA — General setup”, L21.)

The first principal component is then

ϕ_{1} = eigenvector of Σ corresponding to the largest eigenvalue λ_{1},

because for any unit-norm $ϕ$ the variance of $Z = X ϕ$ equals

Var (Z) = ϕ^{⊤} Σ ϕ,

and the maximum of $ϕ^{⊤} Σ ϕ$ on the unit sphere is $λ_{1}$ , achieved at the top eigenvector. The $m$ -th PC is the eigenvector of the $m$ -th eigenvalue:

Var (Z_{m}) = ϕ_{m}^{⊤} Σ ϕ_{m} = λ_{m} .

“And this is the covariance matrix of your data. … The solution to the problem of finding the directions of maximal variance, the PCA problem, is actually solved by finding the eigenvalues and eigenvectors.” — L21

“Your $ϕ_{1}$ ends up just being the eigenvector corresponding to the largest eigenvalue.” — L21

The full spectral-derivation theory is OUT of scope (“we don’t talk about spectral decomposition” — docs/scope.md). What is IN scope: the identity above as a one-line working tool.

1.2 PVE as a ratio of eigenvalues

Book Eq 12.10 writes the proportion of variance explained as

PVE_{m} = \frac{\sum _{i = 1}^{n} z _{im}^{2}}{\sum _{j = 1}^{p} \sum _{i = 1}^{n} x _{ij}^{2}} .

The prof’s working form, used in the Q3e exam template and the slide-deck “PCA — General setup” (L21), is the eigenvalue ratio:

PVE_{m} = \frac{λ _{m}}{\sum _{k = 1}^{p} λ _{k}}, R^{2} (M) = \frac{\sum _{i = 1}^{M} λ _{i}}{\sum _{j = 1}^{p} λ _{j}}

where $R^{2} (M)$ is the cumulative PVE through $M$ PCs (slide notation: “fraction of original variance kept by the first $M$ principal components”). The two forms are equivalent because $λ_{m} = Var (Z_{m}) = \frac{1}{n} \sum_{i} z_{im}^{2}$ (up to the $1/ n$ factor that cancels in the ratio).

Standardisation simplification. After standardisation each column has variance 1, so the total variance is $\sum_{k} λ_{k} = p$ . Then

PVE_{m} = \frac{λ _{m}}{p}, R^{2} (M) = \frac{1}{p} m = 1 \sum M λ_{m} .

This is the form Anders should expect on a Q3e-style hand calculation: eigenvalues are handed to him, he sums and divides by $p$ . explained-variance-and-scree-plot

1.3 Q3e exam template — PCA hand calculation

Templated by the prof for the 2026 exam in L27. Setup: five standardised variables; the question hands you the eigenvalues $λ_{1}, \dots, λ_{5}$ , the loading matrix $[ϕ_{1}, \dots, ϕ_{5}]$ , and one observation $x = (x_{1}, \dots, x_{5})$ . The three sub-questions:

(a) Total variance explained by the first $M$ PCs.

R^{2} (M) = \frac{\sum _{m = 1}^{M} λ _{m}}{\sum _{k = 1}^{p} λ _{k}} .

After standardisation $\sum_{k} λ_{k} = p$ . Plug and divide.

(b) Number of PCs needed for 90% (or 95% or 99%) variance. Compute the cumulative sum $R^{2} (M)$ step-by-step until it crosses the threshold:

find smallest M such that m = 1 \sum M λ_{m} / k \sum λ_{k} \geq 0.90.

The prof’s worked numerical example in L27: eigenvalue ratios $0.54, 0.30, 0.10, \dots$ . Cumulative: $0.54 \to 0.84$ (not enough) $\to 0.94$ (past 0.90). Answer: 3 PCs. explained-variance-and-scree-plot

(c) Compute the score $z_{im}$ for one observation on one PC.

z_{im} = j = 1 \sum p ϕ_{j m} x_{ij} = ϕ_{1 m} x_{i 1} + ϕ_{2 m} x_{i 2} + \dots + ϕ_{p m} x_{i p}

i.e. plug the observation into the $m$ -th loading vector. The book defines this in Eq 12.2 in passing; the prof flags this exact plug-and-compute as a graded exam item, with the instruction:

“Show your work so that you can get partial credit when the inevitable little mistake happens.” — L27

principal-component-analysis §“How it might appear on the exam”

1.4 The objective in terms of $ϕ$ (variance maximisation as a quadratic form)

The book’s optimisation (12.3) is written in score-sum form:

ϕ_{1} max \frac{1}{n} i = 1 \sum n (j = 1 \sum p ϕ_{j 1} x_{ij})^{2} s.t. j \sum ϕ_{j 1}^{2} = 1.

The prof’s equivalent statement (slide “PCA — General setup”, L21) is the quadratic-form version that makes the connection to $Σ$ explicit:

ϕ_{1} : ∥ ϕ_{1} ∥_{2} = 1 max ϕ_{1}^{⊤} Σ ϕ_{1}

Subsequent PCs add the orthogonality constraint:

ϕ_{m} = argmax_{∥ ϕ ∥ = 1 ϕ ⊥ ϕ_{1}, \dots, ϕ_{m - 1}} ϕ^{⊤} Σ ϕ, Var (Z_{m}) = ϕ_{m}^{⊤} Σ ϕ_{m} = λ_{m} .

This is the form that immediately yields the eigendecomposition recipe of §1.1. L21 / principal-component-analysis

1.5 Standardisation z-score formula (load-bearing for module 10)

The book §12.2.4 and §12.4.2 prescribes “scaling the variables to have standard deviation one” but never writes the formula. The prof writes it explicitly (slide-deck L21, also restated for clustering in L22):

z_{ij} = \frac{x _{ij} - x ˉ _{j}}{s _{j}}, \overset{x}{ˉ}_{j} = \frac{1}{n} i = 1 \sum n x_{ij}, s_{j} = \frac{1}{n - 1} i = 1 \sum n (x_{ij} - \overset{x}{ˉ}_{j})^{2} .

After this transformation each column has mean 0 and sample variance 1; therefore $\sum_{k} λ_{k} = p$ . This is the input to the PCA / k-means / hierarchical pipelines as the prof teaches them. standardization / L21

“If you don’t standardize them so that their mean is zero and their variance is one, then if one had a standard deviation of like a million, then that will be your strongest variable.” — L21

2. K-means clustering

The book gives Eq 12.17 (objective), Eq 12.18 (the pairwise-to-centroid identity), and Algorithm 12.2. The book also states monotonicity in two sentences. The items below are the prof’s additions.

2.1 Structured monotonicity proof (Exercise 10.2 template)

The book says: “in Step 2(a) the cluster means for each feature are the constants that minimise the sum-of-squared deviations, and in Step 2(b), reallocating the observations can only improve (12.18).” The prof’s structured proof, the one graded for Exercise 10.2, decomposes this into three explicit steps. L22 / k-means-clustering

Setup. The k-means objective is

J (C_{1}, \dots, C_{K}) = k = 1 \sum K \frac{1}{∣ C _{k} ∣} i, i^{'} \in C_{k} \sum j = 1 \sum p (x_{ij} - x_{i^{'} j})^{2} .

Step 1 — Rewrite via identity 12.18. Apply

\frac{1}{∣ C _{k} ∣} i, i^{'} \in C_{k} \sum j = 1 \sum p (x_{ij} - x_{i^{'} j})^{2} = 2 i \in C_{k} \sum j = 1 \sum p (x_{ij} - \overset{x}{ˉ}_{k j})^{2}

to each cluster. The objective becomes (up to a factor of 2)

J = 2 k = 1 \sum K i \in C_{k} \sum ∥ x_{i} - \overset{x}{ˉ}_{k} ∥^{2}, \overset{x}{ˉ}_{k} = \frac{1}{∣ C _{k} ∣} i \in C_{k} \sum x_{i} .

This is a sum of squared deviations from cluster means.

Step 2 — Centroid step decreases $J$ . Fix the partition ${C_{k}}$ and vary the per-cluster reference points $c_{k}$ :

\overset{x}{ˉ}_{k} = argmin_{c \in R^{p}} i \in C_{k} \sum ∥ x_{i} - c ∥^{2} .

(One-liner derivation: differentiate $\sum_{i} ∥ x_{i} - c ∥^{2}$ in $c$ , set to zero, get $c = \frac{1}{∣ C _{k} ∣} \sum_{i} x_{i}$ .) So step 2(a), which sets each $c_{k}$ to the cluster mean, weakly decreases every cluster’s contribution and therefore $J$ .

Step 3 — Assignment step decreases $J$ . Fix the centroids ${\overset{x}{ˉ}_{k}}$ and reassign each observation $i$ to its nearest centroid:

k (i) = argmin_{k} ∥ x_{i} - \overset{x}{ˉ}_{k} ∥^{2} .

Each observation’s contribution to $J$ is now individually minimised over $k$ , so $J$ weakly decreases.

Step 4 — Convergence. $J \geq 0$ , the sequence of values $J^{(0)} \geq J^{(1)} \geq \dots$ is nonincreasing and bounded below, and there are only finitely many partitions of $n$ objects into $K$ groups. Therefore the algorithm must stabilise in finitely many iterations — and the limiting partition is a local (not necessarily global) optimum.

“It’s not clear that this will lead to a fixed point just from looking at it, but it does.” — L22

k-means-clustering §“Why monotone (Exercise 10.2)“

2.2 Ensemble fix for local-optima (the prof’s “funny way”)

A workaround the prof named explicitly in L22, not in the book. Run the k-means algorithm $B$ times from independent random initialisations, producing assignments $σ^{(1)}, \dots, σ^{(B)} : {1, \dots, n} \to {1, \dots, K}$ . Define a new dissimilarity between observations $i$ and $i^{'}$ by the co-clustering frequency:

\tilde{d} (i, i^{'}) = 1 - \frac{1}{B} b = 1 \sum B 1 {σ^{(b)} (i) = σ^{(b)} (i^{'})} .

Then re-cluster the data using $\tilde{d}$ as the input dissimilarity (e.g. through hierarchical clustering, or another round of k-means with a kernel). The intuition is that pairs that consistently land together across initialisations are “really” similar; pairs that land together only occasionally are at the boundary.

“Actually that works quite well but it’s kind of a funny way of doing things.” — L22

This is an ensembling idea analogous to bagging-the-clusterer, and the prof connects it to the boosting / random-forest theme from modules 8–9. k-means-clustering

2.3 Brute-force partition count

The book mentions “almost $K^{n}$ ways to partition $n$ observations into $K$ clusters.” The prof’s concrete instantiation, useful for an exam justification of why we need the iterative algorithm:

partitions of n into K groups \approx K^{n} .

At $n = 1000, K = 10$ this is $1 0^{1000}$ . (L22) Use this as a one-line “why we don’t brute-force” answer.

2.4 Trivial degenerate case $K = n$

Not stated in the book. If $K = n$ then every observation is its own cluster, $\overset{x}{ˉ}_{k} = x_{k}$ , and the objective evaluates to

J (C_{1}, \dots, C_{n}) = 0.

This is the global minimum, but it is trivial: no structure has been learnt. L22 / k-means-clustering §“Insights & mental models”

3. Hierarchical clustering

The book gives Algorithm 12.3, Table 12.3 (linkage formulas), the $2^{n - 1}$ reordering fact, and the qualitative gender-vs-nationality non-nested example. The book does not walk through a numerical worked example with a complete-linkage hand-computation; the prof does, and grades it with a $- 1$ -per-mistake rubric. L22 / L27 / hierarchical-clustering

3.1 Linkage formulas (notation cleaned)

The book’s Table 12.3 gives definitions in prose. Formally, with $A$ and $B$ two clusters and $d (\cdot, \cdot)$ the point-to-point dissimilarity:

Linkage	$D (A, B) =$
Complete	$i \in A, j \in B max d (x_{i}, x_{j})$
Single	$i \in A, j \in B min d (x_{i}, x_{j})$
Average	$\displaystyle\frac{1}{
Centroid	$d (\overset{x}{ˉ}_{A}, \overset{x}{ˉ}_{B})$

The prof also name-checks median linkage (in scope only as “exists”). Ward linkage is out of scope per docs/scope.md. The average-linkage definition is the unweighted mean (each pair contributes equally); not the weighted variant some packages use.

“Average and complete tend to yield more balanced clusters.” — L22

3.2 Hand-computation recipe (the $- 1$ -per-mistake exam question)

Templated by the prof in L27 as a graded exam item. From a given $n \times n$ dissimilarity matrix $D$ and a chosen linkage:

Recipe

Find the smallest off-diagonal entry $D_{ij}$ . Fuse ${i}$ and ${j}$ at height $h = D_{ij}$ .

Update the matrix. Replace rows/columns $i$ and $j$ with a single row/column for the new cluster ${i, j}$ . The new entries against any remaining $k$ are computed by the linkage rule:

Complete: $D ({i, j}, k) = max (D_{ik}, D_{j k})$ .

Single: $D ({i, j}, k) = min (D_{ik}, D_{j k})$ .

Average: $D ({i, j}, k) = \frac{1}{2} (D_{ik} + D_{j k})$ when fusing singletons; in general, average over all cross-cluster pairs.

Repeat on the smaller matrix: find next-smallest entry, fuse, recompute.

Stop when one cluster remains.

Draw the dendrogram with x-axis = arbitrary leaf order and y-axis = fusion heights. Each fusion is a horizontal bar at the recorded height linking its children.

Cut at the requested height (or to get $K$ clusters) and report the cluster membership.

Grading rubric

“Negative one point for each mistake … I won’t give negative points, don’t worry.” — L27

Mistakes propagate: a wrong first fusion ruins the whole tree. Always recompute the matrix in writing after every step.

3.3 Worked example — complete linkage on the prof’s 4×4 matrix

This is the prof’s slide-deck worked example for the L22 hand-computation drill (also Exercise 2 of ISLP §12.6, whose solution the book does not provide). L22 slides “Exercise 2 from the book” / hierarchical-clustering §“Hand-computation procedure”

Input matrix:

D = 0 0.3 0.4 0.7 0.3 0 0.5 0.8 0.4 0.5 0 0.45 0.7 0.8 0.45 0 .

Step 1. Smallest off-diagonal: $D_{12} = 0.3$ . Fuse ${1, 2}$ at height $0.3$ .

Step 2. Update the matrix under complete linkage:

D ({1, 2}, 3) = max (D_{13}, D_{23}) = max (0.4, 0.5) = 0.5,

D ({1, 2}, 4) = max (D_{14}, D_{24}) = max (0.7, 0.8) = 0.8,

D (3, 4) = 0.45.

New $3 \times 3$ matrix (rows: ${1, 2}, 3, 4$ ):

0 0.5 0.8 0.5 0 0.45 0.8 0.45 0 .

Step 3. Smallest entry: $D_{34} = 0.45$ . Fuse ${3, 4}$ at height $0.45$ .

Step 4. Update:

D ({1, 2}, {3, 4}) = max (D_{13}, D_{14}, D_{23}, D_{24}) = max (0.4, 0.7, 0.5, 0.8) = 0.8.

Step 5. Final fusion at height $0.8$ , producing the single cluster ${1, 2, 3, 4}$ .

Dendrogram (heights only — the x-ordering is arbitrary):

Bottom: leaves $1, 2, 3, 4$ .
Bar at $h = 0.3$ joining 1 and 2.
Bar at $h = 0.45$ joining 3 and 4.
Bar at $h = 0.8$ joining ${1, 2}$ and ${3, 4}$ .

Cuts. A cut at any $h \in (0.45, 0.8)$ yields two clusters: ${1, 2}$ and ${3, 4}$ . A cut at $h \in (0.3, 0.45)$ yields three clusters: ${1, 2}$ , ${3}$ , ${4}$ . A cut at $h < 0.3$ yields four singletons.

3.4 Worked example — single linkage on the same matrix

Same input matrix; swap $max$ for $min$ . L22 slides Exercise 2 (b)

Step 1. Smallest entry $D_{12} = 0.3$ . Fuse ${1, 2}$ at $0.3$ .

Step 2. Update under single linkage:

D ({1, 2}, 3) = min (0.4, 0.5) = 0.4, D ({1, 2}, 4) = min (0.7, 0.8) = 0.7,

and $D (3, 4) = 0.45$ .

Step 3. Smallest entry: $D ({1, 2}, 3) = 0.4$ . Fuse ${1, 2, 3}$ at $0.4$ .

Step 4. Update:

D ({1, 2, 3}, 4) = min (D_{14}, D_{24}, D_{34}) = min (0.7, 0.8, 0.45) = 0.45.

Step 5. Final fusion at height $0.45$ .

Cut at $K = 2$ . Below $h = 0.45$ the partition is ${1, 2, 3}$ and ${4}$ — different from complete linkage, which gave ${1, 2}$ and ${3, 4}$ . This is the canonical “linkage choice changes the answer” demonstration the prof uses.

3.5 Correlation-based distance, explicit formula

The book (§12.4.2, Fig 12.15) describes correlation-based distance qualitatively as “two observations are similar if their features are highly correlated”, but does not give the formula or its range. The prof writes it out explicitly in L22:

d_{corr} (x_{i}, x_{i^{'}}) = 1 - ρ (x_{i}, x_{i^{'}}),

where $ρ (x_{i}, x_{i^{'}})$ is the Pearson correlation between the two observation profiles (each profile is a length- $p$ vector across features). The range:

ρ \in [- 1, 1] ⟹ d_{corr} \in [0, 2],

with $d_{corr} = 0$ at perfect positive correlation and $d_{corr} = 2$ at perfect anti-correlation. The variant $1 - ∣ ρ ∣$ is used when the sign of the correlation is not meaningful (range $[0, 1]$ ).

“Often people will say all right let’s minimize 1 minus the Pearson correlation. All right, that way, if they’re perfectly correlated, you have a dissimilarity of zero.” — L22

“Note: Correlation is actually a similarity measure, not a distance measure.” — slide-deck L21/L22

This is a similarity-to-distance conversion, not a proper metric: $d_{corr}$ does not satisfy the triangle inequality. For clustering this does not matter — only the ordering of pairs is needed. distance-metrics

3.6 Squared Euclidean vs Euclidean inside k-means (ranking equivalence)

Not explicit in the book — the book uses squared Euclidean throughout §12.4.1 without commenting on why the square root is dropped. The prof’s justification, useful as a one-line exam answer:

“There’s no reason to take the square root, it’s just an extra step that won’t get you anything … it won’t change the goal or won’t change who wins or how they win.” — L22

Formally: $\cdot$ is strictly monotone on $[0, \infty)$ , so for any partition ${C_{k}}$ and any pair of partitions ${C_{k}}, {C_{k}^{'}}$ ,

k \sum i \in C_{k} \sum ∥ x_{i} - \overset{x}{ˉ}_{k} ∥^{2} < k \sum i \in C_{k}^{'} \sum ∥ x_{i} - \overset{x}{ˉ}_{k}^{'} ∥^{2}

iff the same inequality holds with $∥ \cdot ∥^{2}$ replaced by $∥ \cdot ∥$ as a function of which partition wins on a per-cluster basis. The squared form is preferred because the centroid-as-minimizer property (§2.1, step 2) holds in closed form for squared distance and not for plain Euclidean distance — for plain Euclidean, the cluster-wise minimizer is the geometric median (a much harder object). distance-metrics / k-means-clustering

4. Notation / terminology drift

A few small differences between the prof’s notation and the book’s that an exam reader should keep straight.

Covariance normalisation. The prof writes $Σ = \frac{1}{n - 1} X^{⊤} X$ (slide L21). The book uses $\frac{1}{n} X^{⊤} X$ in the variance form of Eq 12.10 (consistent with population variance). The eigenvectors are identical; only the eigenvalues scale by $n / (n - 1)$ , and ratios (PVE) are unaffected.
Eigendecomposition shorthand. The slide writes $Σ = C Λ C^{- 1}$ . Because $Σ$ is symmetric, $C$ is orthonormal and $C^{- 1} = C^{⊤}$ . Some packages return $C$ , others $C^{⊤}$ ; the eigenvectors are the same up to sign.
“Dissimilarity” vs “distance”. The prof uses these interchangeably; the book technically distinguishes (a “dissimilarity” need not satisfy the triangle inequality). For module 10 they are the same object.
Average linkage. Both the prof and the book mean the unweighted mean over pairs (each cross-cluster pair contributes once with weight $1/ (∣ A ∣∣ B ∣)$ ). Some R packages and the exam_analysis cheat-sheet phrase it as a “weighted mean” — that is loose language for the same formula. UPGMA = unweighted average linkage; do not confuse with WPGMA (weighted, not in scope).
PCA loading sign. The prof and the book both note that loadings are unique up to a sign flip. If two software packages disagree on the sign of a loading column, that is not an error.
Sample variance denominator inside standardisation. The prof writes $s_{j} = \frac{1}{n - 1} \sum_{i} (x_{ij} - \overset{x}{ˉ}_{j})^{2}$ ( $n - 1$ ). Many packages use $n$ . After dividing by $s_{j}$ , the resulting columns have sample variance 1 vs population variance 1 — the difference is cosmetic for everything downstream.
$z_{m}$ vs $Z_{m}$ . The book uses uppercase $Z_{m}$ for the score variable, lowercase $z_{im}$ for the score of observation $i$ . The prof’s slide-deck uses both interchangeably; the wiki keeps the book convention.
The letter $K$ . In hierarchical clustering, $K$ is the number of clusters obtained by cutting the dendrogram at some height. In k-means, $K$ is the pre-specified target. Same symbol, different roles.

statistical.dog

Explorer

M10: Unsupervised Learning — Book delta

Module 10: Unsupervised Learning — Book delta

1. PCA

1.1 PCA via eigendecomposition of the covariance matrix

1.2 PVE as a ratio of eigenvalues

1.3 Q3e exam template — PCA hand calculation

1.4 The objective in terms of $ϕ$ (variance maximisation as a quadratic form)

1.5 Standardisation z-score formula (load-bearing for module 10)

2. K-means clustering

2.1 Structured monotonicity proof (Exercise 10.2 template)

2.2 Ensemble fix for local-optima (the prof’s “funny way”)

2.3 Brute-force partition count

2.4 Trivial degenerate case $K = n$

3. Hierarchical clustering

3.1 Linkage formulas (notation cleaned)

3.2 Hand-computation recipe (the $- 1$ -per-mistake exam question)

3.3 Worked example — complete linkage on the prof’s 4×4 matrix

3.4 Worked example — single linkage on the same matrix

3.5 Correlation-based distance, explicit formula

3.6 Squared Euclidean vs Euclidean inside k-means (ranking equivalence)

4. Notation / terminology drift

Graph View

Table of Contents

Backlinks

statistical.dog

Explorer

M10: Unsupervised Learning — Book delta

Module 10: Unsupervised Learning — Book delta

1. PCA

1.1 PCA via eigendecomposition of the covariance matrix

1.2 PVE as a ratio of eigenvalues

1.3 Q3e exam template — PCA hand calculation

1.4 The objective in terms of ϕ (variance maximisation as a quadratic form)

1.5 Standardisation z-score formula (load-bearing for module 10)

2. K-means clustering

2.1 Structured monotonicity proof (Exercise 10.2 template)

2.2 Ensemble fix for local-optima (the prof’s “funny way”)

2.3 Brute-force partition count

2.4 Trivial degenerate case K=n

3. Hierarchical clustering

3.1 Linkage formulas (notation cleaned)

3.2 Hand-computation recipe (the −1-per-mistake exam question)

3.3 Worked example — complete linkage on the prof’s 4×4 matrix

3.4 Worked example — single linkage on the same matrix

3.5 Correlation-based distance, explicit formula

3.6 Squared Euclidean vs Euclidean inside k-means (ranking equivalence)

4. Notation / terminology drift

Graph View

Table of Contents

Backlinks

1.4 The objective in terms of $ϕ$ (variance maximisation as a quadratic form)

2.4 Trivial degenerate case $K = n$

3.2 Hand-computation recipe (the $- 1$ -per-mistake exam question)