← Back to wiki

Module 10 — Unsupervised learning

24 questions · 100 points · ~40 min

Click an option to lock the answer; the explanation auto-opens. Score tracker bottom-left.

Question 1 3 points

The first principal component of a centered data matrix $X \in \mathbb{R}^{n \times p}$ is best defined as the direction $\phi_1 \in \mathbb{R}^p$ that:

Show answer
Correct answer: B

This is the prof's verbatim framing: "find the best way of rotating our shit such that now it has a maximum variance" with a unit-norm constraint so the algorithm rotates rather than rescales.

A confuses PCA with K-means (within-cluster sum-of-squares minimization). C describes partial least squares (PLS), which is supervised and uses $Y$ — PCA is unsupervised. D mistakes "direction of largest variance" for "variable with largest variance"; PCA finds linear combinations, not single columns.

Atoms: principal-component-analysis, dimensionality-reduction. Lecture: L21-unsupervised-1.

Question 2 5 points Exam 2025 P2f

PCA is performed on a dataset with 5 standardized variables. The eigenvalues are:

$\lambda_1 = 2.7,\ \lambda_2 = 1.5,\ \lambda_3 = 0.5,\ \lambda_4 = 0.2,\ \lambda_5 = 0.1.$

How many principal components must you keep to retain at least 90% of the total variance?

Show answer
Correct answer: D

Total variance $= 2.7+1.5+0.5+0.2+0.1 = 5$ (which equals $p$, since the variables are standardized). Cumulative PVE: $2.7/5 = 0.54$; $0.54+0.30 = 0.84$ (still under 0.90); $0.84+0.10 = 0.94$ — threshold crossed at $M = 3$.

A stops at 0.84 and reads "almost 90%" as enough — the threshold is 0.90, not 0.85. B and C add components past the threshold; once 0.94 is reached, more components do not earn their keep.

Atoms: explained-variance-and-scree-plot, principal-component-analysis. Lecture: L27-summary.

Question 3 5 points Exam 2025 P2f

Continuing the PCA above, the loadings of the first principal component are $\phi_1 = (0.85,\ 0.2,\ 0.2,\ 0.1,\ 0.05)^\top$. For an observation with standardized values $x = (1,\ 0.5,\ -0.5,\ -1,\ 0)$, what is the score $z_1 = \phi_1^\top x$ on PC1?

Show answer
Correct answer: B

$z_1 = 0.85(1) + 0.2(0.5) + 0.2(-0.5) + 0.1(-1) + 0.05(0) = 0.85 + 0.1 - 0.1 - 0.1 + 0 = 0.75$.

A drops the $0.85 \cdot 1$ contribution and keeps only the rest. C drops the negative terms (sign error on $X_3$ or $X_4$). D adds $|0.85| + |0.1| + |0.1| + |0.1| + 0$, ignoring signs entirely — a common "absolute-value" mistake.

Atoms: principal-component-analysis. Lecture: L27-summary.

Question 4 4 points

Mark each statement as true or false.

Show answer
  1. False — PCA is not scale invariant: the largest-unit variable dominates the first PC purely because its numbers are bigger. The prof's slide bullet: "PCA is not scale invariant."
  2. True — Euclidean distance is dominated by the largest-scale variable; standardize so all coordinates contribute on equal footing.
  3. False — same Euclidean-distance issue applies to hierarchical clustering as to K-means.
  4. False — OLS is scale-invariant in the fit ($\hat\beta_j$ rescales to compensate); standardization for OLS is cosmetic, useful for comparing $\hat\beta$'s but not required.

Atoms: standardization, principal-component-analysis, k-means-clustering. Lecture: L21-unsupervised-1.

Question 5 3 points

Suppose the variance of PC1 is $\lambda_1 = 6$ and the total variance of the centered data is 12. What is the proportion of variance explained by PC1?

Show answer
Correct answer: D

$\text{PVE}_1 = \lambda_1 / \sum_k \lambda_k = 6 / 12 = 0.5$.

A divides $\lambda_1^2 / \text{total}^2$ or otherwise squares the wrong quantity. C uses $\sqrt{\lambda_1 / \text{total}}$ — confuses variance with standard deviation. B inverts the ratio (total / $\lambda_1$); a PVE cannot exceed 1.

Atoms: explained-variance-and-scree-plot.

Question 6 5 points

Mark each statement about principal component analysis as true or false.

Show answer
  1. True — loading vectors are unit norm by construction; this is the constraint that prevents PCA from cheating by inflating variance through scaling.
  2. False — PCA is unsupervised and never sees $Y$; max-variance directions of $X$ may be unrelated to $Y$. This is precisely PCR's failure mode that motivates PLS.
  3. True — orthogonal loadings yield uncorrelated scores; the prof's framing is that PCA "removes all these annoying correlations in your data."
  4. False — PCs are unique up to a sign flip. "The sign flip is boring; it just means which direction it is." Different software packages may report opposite signs for the same component.
  5. False — keeping all $p$ PCs is a rotation of coordinates, not a reduction. The reduction comes from truncating to the first $M < p$.

Atoms: principal-component-analysis, dimensionality-reduction. Lecture: L21-unsupervised-1.

Question 7 4 points

You run PCA on the USArrests dataset with four columns: Murder (per 100,000, sample variance ≈ 19), Assault (per 100,000, variance ≈ 6945), UrbanPop (% of population, variance ≈ 210), and Rape (per 100,000, variance ≈ 88). You forget to standardize. Which of the following is the most likely outcome?

Show answer
Correct answer: A

PCA chases total variance, and unscaled Assault has variance ≈ 6945, which is ~35× larger than the next biggest column. The first PC's loading on Assault approaches 1 and the substantive "overall criminality" axis is lost. Standardize, and PC1 picks up the meaningful all-three-crime axis instead.

B confuses qualitative interpretability with the numerical objective; PCA does not know what variables "mean." C contradicts the prof's slide bullet "PCA is not scale invariant" — this is the canonical scope trap. D states a common but false belief: PCA does not standardize for you. You either pass the covariance matrix (raw, scale-sensitive) or the correlation matrix (standardized); the choice is yours.

Atoms: principal-component-analysis, standardization. Lecture: L21-unsupervised-1.

Question 8 4 points

A PCA on $p = 8$ standardized variables yields a scree plot in which all eight bars are roughly equal in height. What is the most defensible conclusion?

Show answer
Correct answer: C

A flat scree plot means each PC explains roughly $1/p$ of the variance — i.e., the variables were already nearly orthogonal. The prof: "if the variables were already orthogonal then adding more PCs is just the same as adding another variable." There is no low-dim structure to exploit.

A miscounts: each of eight equal bars carries $\approx 12.5\%$, so PC1+PC2 cover only $\approx 25\%$ — not 90%. B describes the opposite extreme (one PC explaining ≈ 100%); perfect collinearity gives a steep, not flat, scree. D invents a "rescale by kurtosis" fix that is not part of PCA — the input was already standardized, and standardization is exactly why each PC is equal-share.

Atoms: explained-variance-and-scree-plot. Lecture: L15-modelsel-4.

Question 9 6 points Exam 2025 P3b

You have four observations with the following Euclidean dissimilarity matrix:

$$D = \begin{pmatrix} 0 & 6.5 & 5 & 7 \\ 6.5 & 0 & 6 & 4 \\ 5 & 6 & 0 & 2 \\ 7 & 4 & 2 & 0 \end{pmatrix}.$$

Perform hierarchical clustering with complete linkage. At which fusion heights do the three merges occur, in order?

Show answer
Correct answer: C

Smallest entry is $d_{34} = 2$ → fuse $\{3, 4\}$ at height 2. Recompute with complete linkage (max): $D(\{3,4\}, 1) = \max(5, 7) = 7$; $D(\{3,4\}, 2) = \max(6, 4) = 6$; $d_{12} = 6.5$ unchanged. Smallest in the 3×3 is $D(\{3,4\}, 2) = 6$ → fuse $\{2, 3, 4\}$ at height 6. Final: $D(\{2,3,4\}, 1) = \max(6.5, 5, 7) = 7$ → fuse at height 7.

A takes the second merge as $d_{13} = 5$, ignoring the recomputation step (forgets that observation 2 has been incorporated into the small cluster after the recomputed distances). B uses single linkage (min) instead of complete (max) — common linkage mix-up. D uses single linkage at step 2 ($\min(6,4) = 4$) but complete at step 3, mixing the two rules within one dendrogram.

Atoms: hierarchical-clustering. Lecture: L27-summary. Note: −1 point per mistake on this question type.

Question 10 5 points

Using the same dissimilarity matrix as Q9, perform hierarchical clustering with single linkage instead. What are the three fusion heights, in order?

Show answer
Correct answer: A

First fusion is the same: $d_{34} = 2$ → fuse $\{3, 4\}$ at height 2. Single linkage (min): $D(\{3,4\}, 1) = \min(5, 7) = 5$; $D(\{3,4\}, 2) = \min(6, 4) = 4$; $d_{12} = 6.5$. Smallest = $D(\{3,4\}, 2) = 4$ → fuse $\{2, 3, 4\}$ at height 4. Final: $D(\{2,3,4\}, 1) = \min(6.5, 5, 7) = 5$ → fuse at height 5. Note that single-linkage merges typically occur at lower heights than complete-linkage merges on the same data.

B mixes max for the third merge ($\max(6.5, 5, 7) = 7$ would give 7 not 6.5; here 6.5 is just the unchanged $d_{12}$, but at this step $\{1,2\}$ is no longer a possible fusion). C is the complete-linkage answer (Q9). D uses single for the second merge but complete for the third.

Atoms: hierarchical-clustering.

Question 11 4 points

Mark each statement about hierarchical clustering linkage as true or false.

Show answer
  1. True — prof verbatim: "Average and complete tend to yield more balanced clusters."
  2. True — single linkage's $\min$ rule means a single near-neighbor pulls a whole cluster in; the result is a chain rather than a compact group.
  3. True — for any sets $A, B$: $\max_{i \in A, j \in B} d(x_i, x_j) \ge \min_{i \in A, j \in B} d(x_i, x_j)$. They are equal only when $A$ and $B$ are singletons, where all three linkages collapse to the same point-to-point distance.
  4. False — all three linkages are computed from the same $\binom{n}{2}$ pairwise distances; complete uses the max, single the min, average the mean. Computational cost is the same.

Atoms: hierarchical-clustering. Lecture: L22-unsupervised-2.

Question 12 4 points

On a hierarchical-clustering dendrogram, observations 9 and 2 sit far apart on the horizontal (x) axis but their vertical fusion line meets at a low height. Observations 9 and 7 sit next to each other on the x-axis but only fuse at a much higher height. Which pair is most similar?

Show answer
Correct answer: B

Only the fusion height (vertical axis) carries information. The prof: "Don't assume that 9 and 2 are somehow closer together for some reason. Don't interpret this [horizontal layout] as being distances between things." There are $2^{n-1}$ valid horizontal orderings of the same dendrogram (children can be swapped at every fusion); horizontal proximity is arbitrary.

A is the canonical dendrogram-reading trap. C is overly cautious — fusion height does encode similarity, you just have to read the right axis. D drags in a real concern (linkage choice does shift fusion heights) but irrelevantly: once a single dendrogram is drawn, that linkage is already fixed, and pair similarities are read directly off the y-axis.

Atoms: hierarchical-clustering. Lecture: L22-unsupervised-2.

Question 13 3 points

Which optimization criterion does K-means clustering minimize?

Show answer
Correct answer: A

K-means minimizes the within-cluster sum of pairwise squared Euclidean distances, normalized by cluster size (book Eq. 12.17).

B replaces the sum with a max — that would be a "diameter" objective, not K-means. C is the lasso objective. D is the regression-tree splitting criterion. Each is the right answer to a different question.

Atoms: k-means-clustering.

Question 14 5 points

Mark each statement about the K-means algorithm as true or false.

Show answer
  1. True — both the centroid step (cluster mean is the minimizer of squared deviations) and the assign step (each point moves to its closest centroid) weakly decrease the same nonnegative objective.
  2. True — random init is the prof's headline pitfall: "this random initialization can really screw you." Standard fix: run many times (e.g. nstart = 20) and keep the best.
  3. False — there is no nesting between K-means partitions across $K$. The $K = 4$ and $K = 5$ solutions can be totally unrelated. This is a core contrast with hierarchical clustering, where the dendrogram makes partitions nested by construction.
  4. False — K-means converges to a local minimum only. Brute-force is infeasible: $K^n$ partitions exist (e.g. $10^{1000}$ for $n = 1000, K = 10$).
  5. True — every observation is its own cluster, so the within-cluster pairwise sum is empty and the objective is 0. This is the trivial degenerate solution; you have learned nothing.

Atoms: k-means-clustering, hierarchical-clustering. Lecture: L22-unsupervised-2.

Question 15 5 points

You run K-means with $K = 2$ on three observations in 1-D: $x_1 = 0,\ x_2 = 2,\ x_3 = 8$. The random init assigns $x_1$ and $x_3$ to cluster A and $x_2$ to cluster B. After one centroid step plus one reassignment step, which cluster is each observation in?

Show answer
Correct answer: D

Centroid step: $\bar x_A = (0 + 8)/2 = 4$, $\bar x_B = 2$. Reassignment: distance from $x_1 = 0$ to centroid A is 4, to centroid B is 2 → move to B. $x_2 = 2$ stays in B (distance 0 < 2). $x_3 = 8$ stays in A (distance 4 < 6). Final: $x_1 \in B, x_2 \in B, x_3 \in A$.

A misses the reassignment step entirely. C swaps the labels (treats $x_3$ as the outlier that gets isolated and pulls the rest together — but the centroid update has happened, and $x_3$ stays nearest to its old centroid). B moves $x_3$ rather than $x_1$, which would require $x_3$ to be closer to centroid B = 2 than to centroid A = 4, but $|8 - 2| = 6 > |8 - 4| = 4$.

Atoms: k-means-clustering. Lecture: L27-summary (k-means hand-computation flagged as fair game).

Question 16 4 points

An online retailer wants to cluster shoppers so that customers with similar tastes (the relative mix of products in their basket) end up together, regardless of how much they spend overall. Which dissimilarity measure best fits this goal?

Show answer
Correct answer: C

Correlation distance measures whether the shapes of two profiles match. Two shoppers with identical proportional preferences but different overall volumes have $\rho \approx 1$ → distance ≈ 0. The prof's verbatim example: "Euclidean distance might group together infrequent shoppers — people just don't shop a lot — whereas the correlation distance would find people who have similar preferences."

A groups infrequent shoppers together regardless of taste, the canonical wrong choice. B handles units but not the volume-vs-shape distinction; standardizing columns does not make Euclidean distance shape-sensitive. D is the Jaccard-style overlap measure — it captures "did both buy in the same categories" but ignores the proportions, missing the "similar tastes" signal.

Atoms: distance-metrics, hierarchical-clustering. Lecture: L22-unsupervised-2.

Question 17 4 points

Which of the following best distinguishes hierarchical clustering from K-means clustering?

Show answer
Correct answer: A

Hierarchical clustering produces a tree (the dendrogram) you can cut at any height — one fit, all $K$'s available simultaneously. K-means must be re-run for each $K$, and partitions across $K$ are not nested.

B is wrong on both counts — both methods are unsupervised. C invents a non-existent guarantee; hierarchical merges are deterministic but the result depends on the linkage and dissimilarity choices, neither method "globally optimizes" anything in a useful sense. D inverts the cost story: hierarchical agglomerative clustering builds and updates a full $n\times n$ dissimilarity matrix (at least $O(n^2)$ memory), while K-means costs only $O(nKp)$ per iteration and is the cheaper option for large $n$.

Atoms: hierarchical-clustering, k-means-clustering.

Question 18 3 points

You suspect that a long, snake-like cluster will form in your data if the wrong linkage is used, and you want to discourage that. Which linkage rule should you choose?

Show answer
Correct answer: C

Single linkage's $\min$ rule is what produces chaining: a single near-neighbor pulls a whole cluster forward one step at a time. Complete linkage uses $\max$, so a candidate merge has to be close to every point in the existing cluster, producing balanced compact clusters. (Average is also fine; it isn't offered here.)

A is the linkage that causes chaining — the opposite of what was asked. B (centroid) can produce dendrogram inversions and is generally not preferred; the prof name-checks it but doesn't recommend it. D mislocates the cause of chaining: it comes from single linkage's $\min$ rule (one near-neighbor pulls the cluster forward), not from the choice of metric. Switching Euclidean for correlation does not fix it; switching $\min$ for $\max$ does.

Atoms: hierarchical-clustering. Lecture: L22-unsupervised-2.

Question 19 5 points ISLP §12 Q4

For a particular dataset, you build two dendrograms — one with single linkage, one with complete linkage. Mark each statement as true or false.

Show answer
  1. True — when the same two clusters fuse, complete linkage uses the max pairwise distance and single uses the min, so complete ≥ single (with equality only in degenerate cases).
  2. True — for singletons, all linkages collapse to the point-to-point distance $d(x_5, x_6)$. Linkage formulas only differ for set-to-set distances.
  3. False — different linkages produce different fusion heights, so the same horizontal cut crosses different numbers of vertical lines, yielding different cluster counts in general.
  4. False — even when the cluster counts coincide, the partitions can differ because the merging order differs.

Atoms: hierarchical-clustering.

Question 20 4 points

A colleague runs K-means once on a dataset with $K = 3$ and reports the final cluster assignments. You re-run the same code and get a different answer. What is the most likely cause and the standard fix?

Show answer
Correct answer: B

K-means converges to a local minimum determined by the random initial assignment. Different seeds → different basins of attraction → different final clusterings. The standard fix is to run many initializations and keep the one with the smallest within-cluster sum of squares.

A is a real concern in general but does not explain run-to-run variation when the only thing changing is the seed. C addresses the curse of dimensionality, a separate failure mode unrelated to randomness. D is wrong: the objective generically has multiple local minima, and not all are global; picking arbitrarily forfeits the lowest-objective solution.

Atoms: k-means-clustering. Lecture: L22-unsupervised-2.

Question 21 4 points

Among PCA, partial least squares (PLS), lasso, and LDA, which one performs variable selection — i.e. produces a model in which some original predictors $X_j$ have coefficient exactly zero?

Show answer
Correct answer: C

The lasso's $\ell_1$ penalty has corners on the coordinate axes, so the optimum lies on these corners for large enough $\lambda$ — driving some $\hat\beta_j$ to exactly zero. The other three are dimensionality-reduction methods: each new variable $Z_m = \sum_j \phi_{jm} X_j$ is a linear combination of all $X_j$'s; the back-implied $\hat\beta_j = \sum_m \theta_m \phi_{jm}$ is generically nonzero for every $j$.

A: PCA picks max-variance directions of $X$, no variable is dropped. B: PLS uses $Y$ to pick directions but still mixes all variables. D: LDA gives discriminant scores that are linear combinations of all features. None of these zero out individual $X_j$'s.

Atoms: dimensionality-reduction, principal-component-analysis. Lecture: L15-modelsel-4.

Question 22 4 points

Mark each statement about dimensionality reduction as true or false.

Show answer
  1. True — PCA is unsupervised; loadings are eigenvectors of the sample covariance of $X$, with no $Y$ involvement. This is exactly why PCR can fail when the high-variance directions of $X$ are unrelated to $Y$.
  2. True — PLS is the supervised counterpart of PCR. Each direction maximizes $\text{Cov}(Z_m, Y)$ subject to orthogonality.
  3. False — PCA is linear and "cannot do anything non-linear." On a curve, the first two PCs cut through the cloud rather than wrapping along the manifold; clustering or nonlinear dim-reduction is needed.
  4. True — for supervised use, treat $M$ as a tuning parameter and pick the value minimizing CV error on the downstream model. Scree-plot heuristics are for pure visualization, not for PCR.

Atoms: dimensionality-reduction, principal-component-analysis, explained-variance-and-scree-plot. Lecture: L21-unsupervised-1.

Question 23 4 points Exam 2024 P2g

A PCA biplot of a four-variable standardized dataset shows the first two PCs explaining 62% and 25% of the total variance. The arrow for variable $X_2$ points sharply along the PC2 axis with large magnitude; the arrows for $X_1$ and $X_3$ point along PC1 with similar magnitudes; $X_4$'s arrow is short and ambiguous. Which interpretation is best supported?

Show answer
Correct answer: A

Biplot arrows show variable loadings projected onto the first two PCs. Arrow direction = which PC the variable loads on; arrow length = magnitude of those loadings. A short arrow means the variable contributes little to PC1 or PC2 (most of its variance lives in higher PCs).

B drags in the response $Y$, but PCA never sees $Y$; the biplot says nothing about prediction. C confuses an individual variable's loading with the variance the PC explains: 25% is the share PC2 captures across all four variables, not what $X_2$ alone explains. D inverts the geometry — the sample variance of each standardized variable is exactly 1; arrow length reflects how that variance is distributed across PCs, not its absolute magnitude.

Atoms: principal-component-analysis, explained-variance-and-scree-plot.

Question 24 3 points

In which of the following situations is hierarchical clustering most likely to perform worse than K-means?

Show answer
Correct answer: D

Hierarchical clustering imposes nested structure by construction. When the true partitions don't nest — gender vs. nationality is the prof's example — hierarchical clustering forces the data into a tree it doesn't have, and K-means at the appropriate $K$ may recover the structure better. "If you're trying to cluster data that doesn't necessarily have a hierarchical structure to it, then probably bad."

A is precisely where hierarchical shines (the prof's animal-behavior data, taxonomic data). C is a strength, not a weakness — the dendrogram is the visualization. B is also a strength: cutting the dendrogram at any height gives any $K$, no need to commit in advance.

Atoms: hierarchical-clustering, k-means-clustering. Lecture: L22-unsupervised-2.