Module 01 — Introduction

Question 1 6 points

A quant team fits a deep neural network on intraday Bitcoin tick data. The fund uses the model to place trades the following morning; nobody on the team can describe what the network has learned, and they are content with that as long as the next-day forecast is accurate and well-calibrated. How is this task best classified along the prediction/inference axis?

A Inference, because the team relies on the model to make decisions about the future.
B Inference, because deep neural networks have many parameters whose meaning could be interpreted post hoc.
C Inference, because next-day forecasts are downstream decisions, and reasoning under uncertainty about future outcomes is the textbook definition of statistical inference.
D Prediction, because they care about $\hat Y$ being accurate but not about the form of $f$.

Show answer

Correct answer: D

The defining tell is "nobody can describe what the network has learned, and they are content with that as long as the forecast is accurate." That is the prof's canonical prediction setup: "you don't really care if the model is right. That's secondary. Being right is not as important as being able to predict well… with knowing your uncertainty."

A confuses "make decisions" with "infer structure" — quants make decisions on a black-box forecast every day. B inverts the rule: parameter count is irrelevant to the prediction-vs-inference axis; what matters is whether the analyst cares about the form of $f$. C abuses the everyday word "inference" — in this course "inference" specifically means caring about the form of $f$ (which knobs matter, by how much), not "reasoning about the future." Forecasting tomorrow's price under uncertainty is the canonical prediction task, regardless of how downstream decisions feel.

Atoms: prediction-vs-inference, supervised-vs-unsupervised. Lecture: L01-intro.

Question 2 6 points

Researchers fit a linear regression of systolic blood pressure on age, sex, BMI, and smoking status from a Framingham-style cohort. The $R^2$ is small (the predictor cloud is "blob-grade") but the paper is celebrated as seminal. What is the most accurate way to characterise the goal of the analysis?

A Prediction, because regression always serves prediction; the low $R^2$ just means the model needs more flexibility.
B Prediction, because the team forecasts each patient's future blood pressure from their covariates.
C Inference, because the team wants to know which risk factors drive blood pressure so patients can act on them.
D Inference, because a low $R^2$ is by definition the signature of an inference study.

Show answer

Correct answer: C

The prof's stock illustration: "It's not that you want to predict your death. You're just trying to change it. You're trying to make decisions based off of a model that gives you an understanding." Goal: identify which knobs (BMI, smoking) move blood pressure so doctors can prescribe action. That is inference.

A reverses the prof's framing — the same regression can serve either goal; "regression $\Rightarrow$ prediction" is not a rule. B misreads the paper: the cohort is studied to identify risk factors at population scale, not to forecast individual patients' future blood pressure. D inverts cause and effect: low $R^2$ is compatible with inference (Framingham), but it does not define a study as inference — a high-$R^2$ study can also be inference if you care about the coefficients.

Atoms: prediction-vs-inference. Lecture: L01-intro.

Question 3 6 points

A large language model is trained on raw text scraped from the internet. No human-written labels are attached to any sentence. According to the prof's framing, is the training task supervised or unsupervised, and why?

A Supervised, because each token in the corpus serves as the target for the previous tokens, giving a fully defined loss.
B Unsupervised, because no human attached labels; "supervised" by definition requires human annotation.
C Unsupervised, because the goal is to discover hidden structure in the text, like clustering documents.
D Neither — language models are a third paradigm called "self-supervised" that lies outside the supervised/unsupervised axis as taught in this course.

Show answer

Correct answer: A

The prof's distinctive point: "Often what looks like unsupervised is really supervised in disguise. Best example: large language models. The training task is just predict the next word — perfectly supervised, perfectly defined." The label comes from the data itself (next token), not a human, but the loss is fully supervised.

B equates "supervised" with "human-labelled," which is the standard misconception the prof is pushing back against — what makes a task supervised is having a target $y$, regardless of who/what produced it. C describes a different kind of task (clustering documents) and projects it onto LLMs; LLM training is not "discover groups," it is "predict the next word." D invents a third axis the prof explicitly rejected — within this course's vocabulary, predict-next-word is supervised, no third category needed.

Atoms: supervised-vs-unsupervised. Lecture: L02-statlearn-1.

Question 4 7 points

Which feature most precisely defines the boundary between supervised and unsupervised learning, as the prof framed the distinction in lecture?

A Whether the data is high-dimensional: unsupervised methods are the ones designed to summarise data with many predictors.
B Whether the practitioner has a parametric model in mind: parametric methods are supervised, non-parametric methods are unsupervised.
C Whether the data set has many observations: large $n$ enables supervised, small $n$ forces unsupervised exploration.
D Whether the data has a target / response variable that says what right looks like: supervised has one, unsupervised does not.

Show answer

Correct answer: D

The prof's verbatim framing: "Importantly, it's supervised in the sense that you know what you want to classify." The defining feature is whether $y$ exists, not how the data looks or how the method is built.

A confuses motivation with definition — high-dim data is a common setting for unsupervised methods (PCA, clustering for visualisation), but supervised methods also handle high-dim data (lasso, neural nets). B mixes up an orthogonal axis (parametric vs non-parametric) with the supervised/unsupervised axis — KNN is non-parametric and supervised; PCA is parametric in spirit and unsupervised. C invents a sample-size threshold the prof never claimed; supervised methods work fine with small $n$ (and unsupervised methods often need more data to find structure honestly).

Atoms: supervised-vs-unsupervised. Lecture: L01-intro.

Question 5 7 points

How does the prof position statistical learning against machine learning and classical statistics?

A The statistician's angle on machine learning — same algorithms, but emphasising uncertainty, bias-variance, and misspecification.
B A strictly larger field that contains both classical statistics and machine learning as proper subsets, by definition of "learning."
C A subfield of classical statistics that differs by using likelihood-based inference, whereas classical stats relies only on least squares.
D A rebranding of classical statistics under a more fashionable name — same hypothesis tests and confidence intervals, with no real change in tools or perspective beyond the label.

Show answer

Correct answer: A

Verbatim: "Statistical learning was a way for statisticians to get in on machine learning… same models as ML but with the statistician's care about uncertainty, bias-variance, distributional assumptions." Plus: classical stats states the model first; statistical learning has the data come first and operates in the misspecified-model regime.

B inflates the field — the prof treats stat learning, ML, and classical stats as overlapping with different emphasis, not as nested sets. C miscasts the contrast — both classical stats and statistical learning use likelihoods and least squares; the real split is "model-first vs data-first" / "well-specified vs misspecified." D collapses the distinction the prof works hard to draw: statistical learning is not just classical stats with a new sticker — it operates in the misspecified-model / data-first regime and inherits algorithms (trees, boosting, neural nets) that classical stats never owned.

Atoms: statistical-learning. Lectures: L01-intro, L02-statlearn-1.

Question 6 7 points Ex2.1

A hospital wants to use patient records (age, prior diagnoses, lab values, vital signs) to label each newly admitted patient as likely-to-be-readmitted-within-30-days or not, so the discharge nurse can decide whether to schedule a follow-up call. Pick the most accurate four-tag description of the task.

A Supervised classification, prediction. Response = age; predictors = readmission status, lab values, diagnoses.
B Supervised classification, prediction. Response = readmission status (binary); predictors = age, prior diagnoses, lab values, vital signs.
C Unsupervised clustering, inference. Response = readmission status; predictors = the patient covariates clustered into risk groups.
D Supervised regression, inference. Response = follow-up-call indicator; predictors = patient covariates, modelled as a continuous risk score.

Show answer

Correct answer: B

Walk the four tags. (1) Response is "readmitted within 30 days" — categorical with two levels — so this is classification, not regression. (2) Past records carry the readmission outcome, so the response is observed, hence supervised. (3) The hospital uses the model to flag the next patient, not to interpret which lab value drives readmission, so the goal is prediction. (4) Predictors are the covariates that aren't the outcome.

A swaps response and predictors — readmission status is the target, age is just one covariate. C drops the response variable on the floor and reads the task as "find groups," which would mean ignoring the labelled outcomes — a clear waste of supervision. D miscodes the response (the indicator is a binary outcome, not a continuous risk score) and reframes the goal as inference, but the hospital is making per-patient decisions, not investigating which risk factor is causal.

Atoms: supervised-vs-unsupervised, prediction-vs-inference.

Question 7 7 points

Of the four tasks below, which is genuinely unsupervised in the prof's sense?

A Predicting tomorrow's electricity demand at a substation from the last seven days of consumption history.
B Classifying handwritten digits 0–9 from $28 \times 28$ pixel images, using a training set of many human-labelled examples.
C Grouping 5000 anonymous online shoppers into segments based only on purchase history, with no a priori segment labels.
D Training a language model that learns to fill in masked words from a large corpus of unlabelled text.

Show answer

Correct answer: C

"Group anonymous shoppers with no a priori segment labels" is the textbook setup for unsupervised clustering: only $X$, no $y$, no objective measure of what right looks like. Once you have the clusters you can then evaluate them against a downstream supervised task (e.g. do these segments lift recommendation accuracy?) — the prof's recommended healthy pattern.

A is a regression / time-series prediction task — supervised, with tomorrow's demand as $y$. B is the canonical supervised classification example; the digit labels are the response. D is the prof's running trap example: masked-word prediction looks unsupervised but is "supervised in disguise" — the model is trained to predict tokens it temporarily hides, fully supervised loss.

Atoms: supervised-vs-unsupervised. Lectures: L01-intro, L02-statlearn-1.

Question 8 7 points

The prof flagged one common piece of regression vocabulary as misleading and recommended dropping it. Which term, and why?

A "Independent variables," because most predictors are correlated in practice and "independent" overstates a property you usually do not have.
B "Predictors," because the term suggests a causal direction the data does not support and pre-commits to a prediction-style framing.
C "Features," because the term is an engineering import from machine learning that obscures the statistical interpretation.
D "Covariates," because it implies the variables co-vary over time, which only holds in time-series settings.

Show answer

Correct answer: A

Verbatim: "I wouldn't typically use the word independent variables. I would say this has more meaning than the others… because most things are not independent." Daylight hours and temperature are correlated; sex and age in a clinical study are correlated; calling them "independent" attaches a property they don't have.

B is a term the prof recommends ("predictors / regressors / covariates / features / variables" are all fine in his lecture). C is also a term he tolerates — feature is fine. D inverts the prof's actual aside on covariates: he noted "covariates often implies time or space" as a stylistic flavour, not as a reason to drop the word.

Atoms: statistical-learning. Lecture: L02-statlearn-1.

Question 9 8 points

Mark each statement about the prediction / inference distinction as true or false.

a) The same fitted regression model can serve either prediction or inference depending on what the analyst cares about; the goal is determined by the question, not by the model class. True False
b) When the goal is prediction, p-values and standard errors of individual coefficients are usually the primary statistics to report. True False
c) Even when inference is the real goal, prediction quality on held-out data is a useful sanity check on the model. True False
d) A study with low $R^2$ on the training data automatically fails as an inference study, because the coefficients cannot be trusted. True False

Show answer

True — prof's recurring framing: "same model, two uses." The prediction-vs-inference axis is set by the analyst's question, not by which method is fit.
False — for prediction the headline metric is held-out accuracy (test MSE / AUC). p-values and coefficient SEs are inference statistics; you can have a great predictor with no individually significant coefficients (collinearity), or a model with huge $t$-stats but terrible test MSE.
True — the prof's note: "even when prediction isn't your real goal, it's often a good way of evaluating a model." If the inference model can't predict at all, the inferred coefficients are probably noise too.
False — Framingham is the prof's stock counterexample: low $R^2$, "blob-grade" correlation, still a seminal inference paper. $R^2$ is a prediction-quality summary; for inference what matters is whether the slope estimates are interpretable and reasonably stable.

Each sub-statement is scored independently for $8/4 = 2$ points.

Atoms: prediction-vs-inference. Lectures: L01-intro, L02-statlearn-1.

Question 10 8 points

For each method below, mark the statement "this method is a supervised method" as true or false.

a) K-means clustering on customer purchase histories with no labels. True False
b) Linear discriminant analysis fit on Fisher's iris data with three labelled species. True False
c) Principal component analysis on a $5000$-gene expression matrix to produce a 2D scatter for visual inspection. True False
d) A neural network trained to predict the next word in a corpus, using the corpus itself to provide the targets. True False

Show answer

False — k-means is unsupervised; only $X$ is used. The "labels" the algorithm produces are cluster ids, not a target it was trained to match.
True — LDA needs the class labels to estimate per-class means and the pooled covariance; without $y$ there is no LDA.
False — PCA uses only the $X$ matrix; no response variable enters the loss. Classic unsupervised dimension-reduction.
True — the LLM trap: the next word in the corpus serves as $y$, so the loss is fully supervised even though no human attached labels. The prof's headline "supervised in disguise" example.

Each sub-statement is scored independently for $8/4 = 2$ points.

Atoms: supervised-vs-unsupervised. Lecture: L02-statlearn-1.

Question 11 8 points

Mark each statement about the prof's framing of statistical learning as true or false.

a) Statistical learning typically operates in the misspecified-model regime: the truth is unknown, the predictor set is incomplete, and the analyst still has to say something useful. True False
b) Statistical learning covers all six steps of the data-science pipeline, including data acquisition and cleaning. True False
c) The course's three umbrella problem types are regression (continuous $Y$), classification (categorical $Y$), and unsupervised (no $Y$). True False
d) In the supervised setup $Y = f(X) + \varepsilon$, the noise term $\varepsilon$ is assumed to have non-zero mean so the model can absorb baseline shifts. True False

Show answer

True — the prof's working definition: "our model is misspecified because they're missing a lot of things… and we just assume they're not. And then we try to work from there."
False — data science spans roughly six steps (hypothesise → scrape → structure → model → analyse → communicate); statistical learning owns the bottom three (modelling, fitting, communication). Acquisition and cleaning are out of scope.
True — verbatim from L01: "Three umbrella problem types for the whole course: regression, classification, unsupervised."
False — the assumption is $\mathbb{E}[\varepsilon] = 0$ and $\varepsilon$ independent of $X$. Baseline shifts are absorbed by the intercept $\beta_0$ in $f$, not by the noise term.

Each sub-statement is scored independently for $8/4 = 2$ points.

Atoms: statistical-learning. Lectures: L01-intro, L02-statlearn-1, L03-statlearn-2.

Question 12 8 points

The prof repeatedly framed unsupervised analysis as "dangerous statistics." Mark each statement about that danger as true or false.

a) Without a response variable, there is no objective measure of success, so "is this clustering reasonable?" is a subjective judgement. True False
b) Reporting p-values from a clustering result on the same data used to discover the clustering produces well-calibrated significance levels because clustering is a model-free procedure. True False
c) Unsupervised exploration becomes more defensible when it feeds into a downstream supervised task that can validate the discovered structure. True False
d) If you "explore and explore" enough, you will eventually find some pattern in any dataset, even pure noise — which is part of the prof's reason for warning about unsupervised inference. True False

Show answer

True — verbatim: "Unsupervised methods are going to be more subjective because you're not training for a specific goal… it's very hard to assess results."
False — the prof's exact pet peeve. "All of the statistical arguments for your finding tend to fall apart after you've done all this subjective exploring. But people don't know that and they just lie." The p-values ignore the search you did.
True — the healthy pattern: unsupervised exploration → hypothesis-driven supervised study. Cluster shoppers, then check whether the clusters lift recommendation accuracy. CV / a held-out task closes the loop.
True — verbatim: "If you explore and explore and explore, eventually you will find something no matter what." The point is that finding something is the default outcome, not evidence of structure.

Each sub-statement is scored independently for $8/4 = 2$ points.

Atoms: supervised-vs-unsupervised, cross-validation. Lecture: L01-intro.

Question 13 7 points

The course previews the bias-variance theme already in module 1, even though the formal decomposition is derived in module 2. Which statement best reflects the prof's preview-level framing of why the bias-variance idea matters at all?

A It is a feature-selection rule of thumb: keep predictors with low variance and drop those with high bias on training residuals.
B It is the statistician's distinctive lens on flexible ML models — a way to reason about why high flexibility fits training data well but generalises poorly without restraint.
C It is a normative rule that says one should always pick the model that minimises bias, since variance can later be controlled by collecting more data.
D It is essentially a restatement of the central limit theorem applied to estimators, and therefore mainly a result about asymptotic distributions of $\hat{\beta}$ rather than test-set behaviour.

Show answer

Correct answer: B

The prof's framing in L01–L03: bias-variance is the running theme of the course because it is the lens through which statisticians look at flexible ML models. "Statistical learning was a way for statisticians to get in on machine learning" — bias-variance is one of the main perspectives they brought, and the prof flagged it as guaranteed exam material starting day one.

A invents a feature-selection rule that does not exist; bias-variance is a property of predictions, not individual predictors. C states a one-sided rule the prof explicitly rejects — chasing zero bias is exactly what produces overfitting; the whole point of the trade-off is that you cannot just push bias down and clean up variance later. D miscategorises the result: bias-variance is a decomposition of expected test error at a fixed point, not an asymptotic statement about $\hat\beta$ distributions; it speaks directly about test-set behaviour, which is why the prof flags it as the running theme.

Atoms: statistical-learning, bias-variance-tradeoff. Lectures: L01-intro, L03-statlearn-2.

Question 14 8 points Exam 2025 P1

Read the passage and pick the correct fill-in for the highlighted blank. The passage: "In supervised learning there are two main purposes. In one we want to learn from data and build a model that relates a set of variables to an outcome, and we care which variables matter and how — we want to interpret the coefficients. In the other case we also build a model relating variables to an outcome, but we do not care about the actual model parameters, because we do not want to interpret them." The latter case is best described as:

A inference
B unsupervised learning
C prediction
D regression

Show answer

Correct answer: C

The trap is the prof's classic Q1 hook: the answer is buried in subtle wording, not the obvious word at the top of the paragraph. The phrase "we do not care about the actual model parameters" is the canonical signature of prediction — black-box forecasting where only $\hat Y$ matters.

A inverts the definition: inference is precisely "we care which parameters matter and how to interpret them," which the passage explicitly contrasts with the latter case. B is wrong on the first word of the passage — the setup says "in supervised learning," and unsupervised learning has no response to relate variables to anyway. D confuses an outcome type (continuous-Y problem) with a purpose: regression and classification are model classes, while the passage is asking which of the two purposes of supervised learning is being described — interpretation or forecasting.

Atoms: prediction-vs-inference, supervised-vs-unsupervised. Lecture: L27-summary (Q1 walkthrough).