Module 10 — Unsupervised learning

Question 1 3 points

The first principal component of a centered data matrix $X \in \mathbb{R}^{n \times p}$ is best defined as the direction $\phi_1 \in \mathbb{R}^p$ that:

A minimizes the sum of squared distances from each $x_i$ to its nearest cluster centroid, subject to $\|\phi_1\| = 1$.
B maximizes the sample variance of the projection $z_{i1} = \phi_1^\top x_i$, subject to $\sum_j \phi_{j1}^2 = 1$.
C maximizes the covariance between the projection $z_{i1}$ and the response $Y$, with no norm constraint on $\phi_1$.
D picks the original variable $X_j$ with the largest sample variance and sets $\phi_1$ to the corresponding standard-basis vector.

Show answer

Correct answer: B

This is the prof's verbatim framing: "find the best way of rotating our shit such that now it has a maximum variance" with a unit-norm constraint so the algorithm rotates rather than rescales.

A confuses PCA with K-means (within-cluster sum-of-squares minimization). C describes partial least squares (PLS), which is supervised and uses $Y$ — PCA is unsupervised. D mistakes "direction of largest variance" for "variable with largest variance"; PCA finds linear combinations, not single columns.

Atoms: principal-component-analysis, dimensionality-reduction. Lecture: L21-unsupervised-1.

Question 2 5 points Exam 2025 P2f

PCA is performed on a dataset with 5 standardized variables. The eigenvalues are:

$\lambda_1 = 2.7,\ \lambda_2 = 1.5,\ \lambda_3 = 0.5,\ \lambda_4 = 0.2,\ \lambda_5 = 0.1.$

How many principal components must you keep to retain at least 90% of the total variance?

A 2 components
B 5 components
C 4 components
D 3 components

Show answer

Correct answer: D

Total variance $= 2.7+1.5+0.5+0.2+0.1 = 5$ (which equals $p$, since the variables are standardized). Cumulative PVE: $2.7/5 = 0.54$; $0.54+0.30 = 0.84$ (still under 0.90); $0.84+0.10 = 0.94$ — threshold crossed at $M = 3$.

A stops at 0.84 and reads "almost 90%" as enough — the threshold is 0.90, not 0.85. B and C add components past the threshold; once 0.94 is reached, more components do not earn their keep.

Atoms: explained-variance-and-scree-plot, principal-component-analysis. Lecture: L27-summary.

Question 3 5 points Exam 2025 P2f

Continuing the PCA above, the loadings of the first principal component are $\phi_1 = (0.85,\ 0.2,\ 0.2,\ 0.1,\ 0.05)^\top$. For an observation with standardized values $x = (1,\ 0.5,\ -0.5,\ -1,\ 0)$, what is the score $z_1 = \phi_1^\top x$ on PC1?

A $0.55$
B $0.75$
C $0.95$
D $1.40$

Show answer

Correct answer: B

$z_1 = 0.85(1) + 0.2(0.5) + 0.2(-0.5) + 0.1(-1) + 0.05(0) = 0.85 + 0.1 - 0.1 - 0.1 + 0 = 0.75$.

A drops the $0.85 \cdot 1$ contribution and keeps only the rest. C drops the negative terms (sign error on $X_3$ or $X_4$). D adds $|0.85| + |0.1| + |0.1| + |0.1| + 0$, ignoring signs entirely — a common "absolute-value" mistake.

Atoms: principal-component-analysis. Lecture: L27-summary.

Question 4 4 points

Mark each statement as true or false.

a) PCA is scale-invariant: rescaling one column by a factor of 1000 does not change the loadings. True False
b) Standardizing predictors before K-means clustering is recommended whenever the variables are measured in different units. True False
c) Hierarchical clustering with Euclidean distance is unaffected by the units of the input variables. True False
d) Multiple linear regression (OLS) requires standardized predictors to fit the model correctly. True False

Show answer

False — PCA is not scale invariant: the largest-unit variable dominates the first PC purely because its numbers are bigger. The prof's slide bullet: "PCA is not scale invariant."
True — Euclidean distance is dominated by the largest-scale variable; standardize so all coordinates contribute on equal footing.
False — same Euclidean-distance issue applies to hierarchical clustering as to K-means.
False — OLS is scale-invariant in the fit ($\hat\beta_j$ rescales to compensate); standardization for OLS is cosmetic, useful for comparing $\hat\beta$'s but not required.

Atoms: standardization, principal-component-analysis, k-means-clustering. Lecture: L21-unsupervised-1.

Question 5 3 points

Suppose the variance of PC1 is $\lambda_1 = 6$ and the total variance of the centered data is 12. What is the proportion of variance explained by PC1?

A $0.25$
B $2.00$
C $0.71$
D $0.50$

Show answer

Correct answer: D

$\text{PVE}_1 = \lambda_1 / \sum_k \lambda_k = 6 / 12 = 0.5$.

A divides $\lambda_1^2 / \text{total}^2$ or otherwise squares the wrong quantity. C uses $\sqrt{\lambda_1 / \text{total}}$ — confuses variance with standard deviation. B inverts the ratio (total / $\lambda_1$); a PVE cannot exceed 1.

Atoms: explained-variance-and-scree-plot.

Question 6 5 points

Mark each statement about principal component analysis as true or false.

Show answer

True — loading vectors are unit norm by construction; this is the constraint that prevents PCA from cheating by inflating variance through scaling.
False — PCA is unsupervised and never sees $Y$; max-variance directions of $X$ may be unrelated to $Y$. This is precisely PCR's failure mode that motivates PLS.
True — orthogonal loadings yield uncorrelated scores; the prof's framing is that PCA "removes all these annoying correlations in your data."
False — PCs are unique up to a sign flip. "The sign flip is boring; it just means which direction it is." Different software packages may report opposite signs for the same component.
False — keeping all $p$ PCs is a rotation of coordinates, not a reduction. The reduction comes from truncating to the first $M < p$.

Atoms: principal-component-analysis, dimensionality-reduction. Lecture: L21-unsupervised-1.

Question 7 4 points

You run PCA on the USArrests dataset with four columns: Murder (per 100,000, sample variance ≈ 19), Assault (per 100,000, variance ≈ 6945), UrbanPop (% of population, variance ≈ 210), and Rape (per 100,000, variance ≈ 88). You forget to standardize. Which of the following is the most likely outcome?

A The first PC will load almost entirely on Assault, because its raw variance dwarfs the others.
B The first PC will load most heavily on UrbanPop, because urbanization is qualitatively the strongest signal in the data.
C The result will be the same as with standardization, since PCA is invariant to linear rescaling of the columns.
D The first PC will load roughly equally on all four variables, since PCA implicitly standardizes its inputs before computing loadings.

Show answer

Correct answer: A

PCA chases total variance, and unscaled Assault has variance ≈ 6945, which is ~35× larger than the next biggest column. The first PC's loading on Assault approaches 1 and the substantive "overall criminality" axis is lost. Standardize, and PC1 picks up the meaningful all-three-crime axis instead.

B confuses qualitative interpretability with the numerical objective; PCA does not know what variables "mean." C contradicts the prof's slide bullet "PCA is not scale invariant" — this is the canonical scope trap. D states a common but false belief: PCA does not standardize for you. You either pass the covariance matrix (raw, scale-sensitive) or the correlation matrix (standardized); the choice is yours.

Atoms: principal-component-analysis, standardization. Lecture: L21-unsupervised-1.

Question 8 4 points

A PCA on $p = 8$ standardized variables yields a scree plot in which all eight bars are roughly equal in height. What is the most defensible conclusion?

A The first two PCs together explain about 90% of the variance, so a 2-D summary is appropriate.
B The variables are perfectly collinear, and the original data lies on a 1-D line.
C PCA has revealed no real low-dimensional structure; dimensionality reduction will not help here.
D The standardization step inflated the small eigenvalues; rescaling each column by its kurtosis would expose the true low-dim structure.

Show answer

Correct answer: C

A flat scree plot means each PC explains roughly $1/p$ of the variance — i.e., the variables were already nearly orthogonal. The prof: "if the variables were already orthogonal then adding more PCs is just the same as adding another variable." There is no low-dim structure to exploit.

A miscounts: each of eight equal bars carries $\approx 12.5\%$, so PC1+PC2 cover only $\approx 25\%$ — not 90%. B describes the opposite extreme (one PC explaining ≈ 100%); perfect collinearity gives a steep, not flat, scree. D invents a "rescale by kurtosis" fix that is not part of PCA — the input was already standardized, and standardization is exactly why each PC is equal-share.

Atoms: explained-variance-and-scree-plot. Lecture: L15-modelsel-4.

Question 9 6 points Exam 2025 P3b

You have four observations with the following Euclidean dissimilarity matrix:

$$D = \begin{pmatrix} 0 & 6.5 & 5 & 7 \\ 6.5 & 0 & 6 & 4 \\ 5 & 6 & 0 & 2 \\ 7 & 4 & 2 & 0 \end{pmatrix}.$$

Perform hierarchical clustering with complete linkage. At which fusion heights do the three merges occur, in order?

A $2,\ 5,\ 7$
B $2,\ 4,\ 6.5$
C $2,\ 6,\ 7$
D $2,\ 4,\ 7$

Show answer

Correct answer: C

Smallest entry is $d_{34} = 2$ → fuse $\{3, 4\}$ at height 2. Recompute with complete linkage (max): $D(\{3,4\}, 1) = \max(5, 7) = 7$; $D(\{3,4\}, 2) = \max(6, 4) = 6$; $d_{12} = 6.5$ unchanged. Smallest in the 3×3 is $D(\{3,4\}, 2) = 6$ → fuse $\{2, 3, 4\}$ at height 6. Final: $D(\{2,3,4\}, 1) = \max(6.5, 5, 7) = 7$ → fuse at height 7.

A takes the second merge as $d_{13} = 5$, ignoring the recomputation step (forgets that observation 2 has been incorporated into the small cluster after the recomputed distances). B uses single linkage (min) instead of complete (max) — common linkage mix-up. D uses single linkage at step 2 ($\min(6,4) = 4$) but complete at step 3, mixing the two rules within one dendrogram.

Atoms: hierarchical-clustering. Lecture: L27-summary. Note: −1 point per mistake on this question type.

Question 10 5 points

Using the same dissimilarity matrix as Q9, perform hierarchical clustering with single linkage instead. What are the three fusion heights, in order?

A $2,\ 4,\ 5$
B $2,\ 5,\ 6.5$
C $2,\ 6,\ 7$
D $2,\ 4,\ 6.5$

Show answer

Correct answer: A

First fusion is the same: $d_{34} = 2$ → fuse $\{3, 4\}$ at height 2. Single linkage (min): $D(\{3,4\}, 1) = \min(5, 7) = 5$; $D(\{3,4\}, 2) = \min(6, 4) = 4$; $d_{12} = 6.5$. Smallest = $D(\{3,4\}, 2) = 4$ → fuse $\{2, 3, 4\}$ at height 4. Final: $D(\{2,3,4\}, 1) = \min(6.5, 5, 7) = 5$ → fuse at height 5. Note that single-linkage merges typically occur at lower heights than complete-linkage merges on the same data.

B mixes max for the third merge ($\max(6.5, 5, 7) = 7$ would give 7 not 6.5; here 6.5 is just the unchanged $d_{12}$, but at this step $\{1,2\}$ is no longer a possible fusion). C is the complete-linkage answer (Q9). D uses single for the second merge but complete for the third.

Atoms: hierarchical-clustering.

Question 11 4 points

Mark each statement about hierarchical clustering linkage as true or false.

a) Complete and average linkage tend to produce more balanced dendrograms than single linkage. True False
b) Single linkage is prone to "chaining" — long, snaking clusters that grow one observation at a time. True False
c) For two clusters $A$ and $B$, complete-linkage dissimilarity is always at least as large as single-linkage dissimilarity on the same data. True False
d) Average linkage requires fewer pairwise distance computations than single or complete linkage. True False

Show answer

True — prof verbatim: "Average and complete tend to yield more balanced clusters."
True — single linkage's $\min$ rule means a single near-neighbor pulls a whole cluster in; the result is a chain rather than a compact group.
True — for any sets $A, B$: $\max_{i \in A, j \in B} d(x_i, x_j) \ge \min_{i \in A, j \in B} d(x_i, x_j)$. They are equal only when $A$ and $B$ are singletons, where all three linkages collapse to the same point-to-point distance.
False — all three linkages are computed from the same $\binom{n}{2}$ pairwise distances; complete uses the max, single the min, average the mean. Computational cost is the same.

Atoms: hierarchical-clustering. Lecture: L22-unsupervised-2.

Question 12 4 points

On a hierarchical-clustering dendrogram, observations 9 and 2 sit far apart on the horizontal (x) axis but their vertical fusion line meets at a low height. Observations 9 and 7 sit next to each other on the x-axis but only fuse at a much higher height. Which pair is most similar?

A Observations 9 and 7, since they are adjacent on the x-axis.
B Observations 9 and 2, since their fusion height is lower.
C The dendrogram does not contain enough information to compare these two pairs.
D Observation 9's similarity to 7 versus to 2 depends on the linkage rule used to draw the dendrogram, so neither pair is unambiguously closer.

Show answer

Correct answer: B

Only the fusion height (vertical axis) carries information. The prof: "Don't assume that 9 and 2 are somehow closer together for some reason. Don't interpret this [horizontal layout] as being distances between things." There are $2^{n-1}$ valid horizontal orderings of the same dendrogram (children can be swapped at every fusion); horizontal proximity is arbitrary.

A is the canonical dendrogram-reading trap. C is overly cautious — fusion height does encode similarity, you just have to read the right axis. D drags in a real concern (linkage choice does shift fusion heights) but irrelevantly: once a single dendrogram is drawn, that linkage is already fixed, and pair similarities are read directly off the y-axis.

Atoms: hierarchical-clustering. Lecture: L22-unsupervised-2.

Question 13 3 points

Which optimization criterion does K-means clustering minimize?

A $\displaystyle \sum_{k=1}^K \frac{1}{|C_k|} \sum_{i, i' \in C_k} \sum_{j=1}^p (x_{ij} - x_{i'j})^2$
B $\displaystyle \sum_{k=1}^K \max_{i, i' \in C_k} \sum_{j=1}^p (x_{ij} - x_{i'j})^2$
C $\displaystyle \sum_{i=1}^n (y_i - \mathbf{x}_i^\top \boldsymbol{\beta})^2 + \lambda \sum_{j=1}^p |\beta_j|$
D $\displaystyle \sum_{i: x_i \in R_1} (y_i - \hat y_{R_1})^2 + \sum_{i: x_i \in R_2} (y_i - \hat y_{R_2})^2$

Show answer

Correct answer: A

K-means minimizes the within-cluster sum of pairwise squared Euclidean distances, normalized by cluster size (book Eq. 12.17).

B replaces the sum with a max — that would be a "diameter" objective, not K-means. C is the lasso objective. D is the regression-tree splitting criterion. Each is the right answer to a different question.

Atoms: k-means-clustering.

Question 14 5 points

Mark each statement about the K-means algorithm as true or false.

Show answer

True — both the centroid step (cluster mean is the minimizer of squared deviations) and the assign step (each point moves to its closest centroid) weakly decrease the same nonnegative objective.
True — random init is the prof's headline pitfall: "this random initialization can really screw you." Standard fix: run many times (e.g. nstart = 20) and keep the best.
False — there is no nesting between K-means partitions across $K$. The $K = 4$ and $K = 5$ solutions can be totally unrelated. This is a core contrast with hierarchical clustering, where the dendrogram makes partitions nested by construction.
False — K-means converges to a local minimum only. Brute-force is infeasible: $K^n$ partitions exist (e.g. $10^{1000}$ for $n = 1000, K = 10$).
True — every observation is its own cluster, so the within-cluster pairwise sum is empty and the objective is 0. This is the trivial degenerate solution; you have learned nothing.

Atoms: k-means-clustering, hierarchical-clustering. Lecture: L22-unsupervised-2.

Question 15 5 points

You run K-means with $K = 2$ on three observations in 1-D: $x_1 = 0,\ x_2 = 2,\ x_3 = 8$. The random init assigns $x_1$ and $x_3$ to cluster A and $x_2$ to cluster B. After one centroid step plus one reassignment step, which cluster is each observation in?

A $x_1 \in A$, $x_2 \in B$, $x_3 \in A$ (no change from initialization).
B $x_1 \in A$, $x_2 \in B$, $x_3 \in B$.
C $x_1 \in A$, $x_2 \in A$, $x_3 \in B$.
D $x_1 \in B$, $x_2 \in B$, $x_3 \in A$.

Show answer

Correct answer: D

Centroid step: $\bar x_A = (0 + 8)/2 = 4$, $\bar x_B = 2$. Reassignment: distance from $x_1 = 0$ to centroid A is 4, to centroid B is 2 → move to B. $x_2 = 2$ stays in B (distance 0 < 2). $x_3 = 8$ stays in A (distance 4 < 6). Final: $x_1 \in B, x_2 \in B, x_3 \in A$.

A misses the reassignment step entirely. C swaps the labels (treats $x_3$ as the outlier that gets isolated and pulls the rest together — but the centroid update has happened, and $x_3$ stays nearest to its old centroid). B moves $x_3$ rather than $x_1$, which would require $x_3$ to be closer to centroid B = 2 than to centroid A = 4, but $|8 - 2| = 6 > |8 - 4| = 4$.

Atoms: k-means-clustering. Lecture: L27-summary (k-means hand-computation flagged as fair game).

Question 16 4 points

An online retailer wants to cluster shoppers so that customers with similar tastes (the relative mix of products in their basket) end up together, regardless of how much they spend overall. Which dissimilarity measure best fits this goal?

A Euclidean distance on the raw counts of items per category.
B Squared Euclidean distance with column standardization, no other change.
C Correlation distance ($1 - \rho$) between each pair of shopper profiles.
D The number of categories in which both shoppers made any purchase, divided by the total number of categories.

Show answer

Correct answer: C

Correlation distance measures whether the shapes of two profiles match. Two shoppers with identical proportional preferences but different overall volumes have $\rho \approx 1$ → distance ≈ 0. The prof's verbatim example: "Euclidean distance might group together infrequent shoppers — people just don't shop a lot — whereas the correlation distance would find people who have similar preferences."

A groups infrequent shoppers together regardless of taste, the canonical wrong choice. B handles units but not the volume-vs-shape distinction; standardizing columns does not make Euclidean distance shape-sensitive. D is the Jaccard-style overlap measure — it captures "did both buy in the same categories" but ignores the proportions, missing the "similar tastes" signal.

Atoms: distance-metrics, hierarchical-clustering. Lecture: L22-unsupervised-2.

Question 17 4 points

Which of the following best distinguishes hierarchical clustering from K-means clustering?

A Hierarchical clustering produces a dendrogram from which any $K$ can be read by cutting at different heights, while K-means returns one partition for one fixed $K$.
B Hierarchical clustering is supervised because it uses the response $Y$ to weight pairwise distances, while K-means is unsupervised and ignores $Y$.
C Hierarchical clustering is guaranteed to find the global optimum of a well-defined objective, while K-means is only guaranteed to find a local optimum.
D Hierarchical clustering scales linearly in $n$ and is preferred for very large datasets, while K-means requires recomputing all pairwise distances at every iteration.

Show answer

Correct answer: A

Hierarchical clustering produces a tree (the dendrogram) you can cut at any height — one fit, all $K$'s available simultaneously. K-means must be re-run for each $K$, and partitions across $K$ are not nested.

B is wrong on both counts — both methods are unsupervised. C invents a non-existent guarantee; hierarchical merges are deterministic but the result depends on the linkage and dissimilarity choices, neither method "globally optimizes" anything in a useful sense. D inverts the cost story: hierarchical agglomerative clustering builds and updates a full $n\times n$ dissimilarity matrix (at least $O(n^2)$ memory), while K-means costs only $O(nKp)$ per iteration and is the cheaper option for large $n$.

Atoms: hierarchical-clustering, k-means-clustering.

Question 18 3 points

You suspect that a long, snake-like cluster will form in your data if the wrong linkage is used, and you want to discourage that. Which linkage rule should you choose?

A Single linkage
B Centroid linkage
C Complete linkage
D Single linkage with correlation distance — chaining is a Euclidean-only artifact

Show answer

Correct answer: C

Single linkage's $\min$ rule is what produces chaining: a single near-neighbor pulls a whole cluster forward one step at a time. Complete linkage uses $\max$, so a candidate merge has to be close to every point in the existing cluster, producing balanced compact clusters. (Average is also fine; it isn't offered here.)

A is the linkage that causes chaining — the opposite of what was asked. B (centroid) can produce dendrogram inversions and is generally not preferred; the prof name-checks it but doesn't recommend it. D mislocates the cause of chaining: it comes from single linkage's $\min$ rule (one near-neighbor pulls the cluster forward), not from the choice of metric. Switching Euclidean for correlation does not fix it; switching $\min$ for $\max$ does.

Atoms: hierarchical-clustering. Lecture: L22-unsupervised-2.

Question 19 5 points ISLP §12 Q4

For a particular dataset, you build two dendrograms — one with single linkage, one with complete linkage. Mark each statement as true or false.

a) If clusters $\{1,2,3\}$ and $\{4,5\}$ fuse on both dendrograms, the complete-linkage fusion will occur at a height greater than or equal to the single-linkage fusion. True False
b) When two singleton clusters $\{5\}$ and $\{6\}$ fuse, the height equals $d(x_5, x_6)$ on both dendrograms. True False
c) Cutting either dendrogram at the same height $h$ always produces the same number of clusters. True False
d) Cutting either dendrogram at the same height $h$ always produces the same partition of the observations. True False

Show answer

True — when the same two clusters fuse, complete linkage uses the max pairwise distance and single uses the min, so complete ≥ single (with equality only in degenerate cases).
True — for singletons, all linkages collapse to the point-to-point distance $d(x_5, x_6)$. Linkage formulas only differ for set-to-set distances.
False — different linkages produce different fusion heights, so the same horizontal cut crosses different numbers of vertical lines, yielding different cluster counts in general.
False — even when the cluster counts coincide, the partitions can differ because the merging order differs.

Atoms: hierarchical-clustering.

Question 20 4 points

A colleague runs K-means once on a dataset with $K = 3$ and reports the final cluster assignments. You re-run the same code and get a different answer. What is the most likely cause and the standard fix?

A The colleague forgot to standardize the predictors; the standard fix is to $z$-score each column and rerun once with the new data.
B Different random initializations land in different local minima; rerun with many starts (e.g. nstart = 20) and keep the lowest-objective run.
C The dataset is too high-dimensional for Euclidean distance to separate clusters; the standard fix is to apply PCA first, then run K-means once on the reduced data.
D The objective has multiple global minima of equal value; the standard practice is to pick whichever solution is reported on the first run.

Show answer

Correct answer: B

K-means converges to a local minimum determined by the random initial assignment. Different seeds → different basins of attraction → different final clusterings. The standard fix is to run many initializations and keep the one with the smallest within-cluster sum of squares.

A is a real concern in general but does not explain run-to-run variation when the only thing changing is the seed. C addresses the curse of dimensionality, a separate failure mode unrelated to randomness. D is wrong: the objective generically has multiple local minima, and not all are global; picking arbitrarily forfeits the lowest-objective solution.

Atoms: k-means-clustering. Lecture: L22-unsupervised-2.

Question 21 4 points

Among PCA, partial least squares (PLS), lasso, and LDA, which one performs variable selection — i.e. produces a model in which some original predictors $X_j$ have coefficient exactly zero?

A PCA
B PLS
C Lasso
D LDA

Show answer

Correct answer: C

The lasso's $\ell_1$ penalty has corners on the coordinate axes, so the optimum lies on these corners for large enough $\lambda$ — driving some $\hat\beta_j$ to exactly zero. The other three are dimensionality-reduction methods: each new variable $Z_m = \sum_j \phi_{jm} X_j$ is a linear combination of all $X_j$'s; the back-implied $\hat\beta_j = \sum_m \theta_m \phi_{jm}$ is generically nonzero for every $j$.

A: PCA picks max-variance directions of $X$, no variable is dropped. B: PLS uses $Y$ to pick directions but still mixes all variables. D: LDA gives discriminant scores that are linear combinations of all features. None of these zero out individual $X_j$'s.

Atoms: dimensionality-reduction, principal-component-analysis. Lecture: L15-modelsel-4.

Question 22 4 points

Mark each statement about dimensionality reduction as true or false.

a) PCA chooses its directions using only $X$, not the response $Y$. True False
b) PLS chooses its directions to maximize covariance with the response $Y$. True False
c) PCA can always recover the underlying low-dimensional structure of data lying on a curved manifold (e.g. an arc or a circle). True False
d) When PCA is used inside a supervised pipeline (PCR), the number $M$ of components to keep should be chosen by cross-validating the downstream model. True False

Show answer

True — PCA is unsupervised; loadings are eigenvectors of the sample covariance of $X$, with no $Y$ involvement. This is exactly why PCR can fail when the high-variance directions of $X$ are unrelated to $Y$.
True — PLS is the supervised counterpart of PCR. Each direction maximizes $\text{Cov}(Z_m, Y)$ subject to orthogonality.
False — PCA is linear and "cannot do anything non-linear." On a curve, the first two PCs cut through the cloud rather than wrapping along the manifold; clustering or nonlinear dim-reduction is needed.
True — for supervised use, treat $M$ as a tuning parameter and pick the value minimizing CV error on the downstream model. Scree-plot heuristics are for pure visualization, not for PCR.

Atoms: dimensionality-reduction, principal-component-analysis, explained-variance-and-scree-plot. Lecture: L21-unsupervised-1.

Question 23 4 points Exam 2024 P2g

A PCA biplot of a four-variable standardized dataset shows the first two PCs explaining 62% and 25% of the total variance. The arrow for variable $X_2$ points sharply along the PC2 axis with large magnitude; the arrows for $X_1$ and $X_3$ point along PC1 with similar magnitudes; $X_4$'s arrow is short and ambiguous. Which interpretation is best supported?

A $X_2$ has a high loading on PC2, $X_1$ and $X_3$ load similarly on PC1, and $X_4$ contributes little to either PC1 or PC2.
B $X_2$ is uncorrelated with the response variable, $X_1$ and $X_3$ are the most predictive of the response, and $X_4$ should be discarded as a predictor.
C $X_2$ on its own explains 25% of the data's total variance, while $X_1$ and $X_3$ together explain 62% of the total variance.
D The arrows show that $X_1$ and $X_3$ have the lowest sample variance, while $X_2$ has the highest sample variance, among the four variables.

Show answer

Correct answer: A

Biplot arrows show variable loadings projected onto the first two PCs. Arrow direction = which PC the variable loads on; arrow length = magnitude of those loadings. A short arrow means the variable contributes little to PC1 or PC2 (most of its variance lives in higher PCs).

B drags in the response $Y$, but PCA never sees $Y$; the biplot says nothing about prediction. C confuses an individual variable's loading with the variance the PC explains: 25% is the share PC2 captures across all four variables, not what $X_2$ alone explains. D inverts the geometry — the sample variance of each standardized variable is exactly 1; arrow length reflects how that variance is distributed across PCs, not its absolute magnitude.

Atoms: principal-component-analysis, explained-variance-and-scree-plot.

Question 24 3 points

In which of the following situations is hierarchical clustering most likely to perform worse than K-means?

A When the true groupings naturally nest (e.g. species within genera within families), so that finer partitions refine the coarser ones.
B When the analyst does not know in advance how many clusters $K$ to look for, and wants any $K$ available from a single fit.
C When the analyst wants to visualize the entire merging structure as a single dendrogram rather than commit to a single partition.
D When the true groupings split along two non-nested axes (e.g. by gender into 2 groups and by nationality into 3 groups).

Show answer

Correct answer: D

Hierarchical clustering imposes nested structure by construction. When the true partitions don't nest — gender vs. nationality is the prof's example — hierarchical clustering forces the data into a tree it doesn't have, and K-means at the appropriate $K$ may recover the structure better. "If you're trying to cluster data that doesn't necessarily have a hierarchical structure to it, then probably bad."

A is precisely where hierarchical shines (the prof's animal-behavior data, taxonomic data). C is a strength, not a weakness — the dendrogram is the visualization. B is also a strength: cutting the dendrogram at any height gives any $K$, no need to commit in advance.

Atoms: hierarchical-clustering, k-means-clustering. Lecture: L22-unsupervised-2.