← Back to wiki

Module 01 — Introduction

14 questions · 100 points · ~30 min

Click an option to lock the answer; the explanation auto-opens. Score tracker bottom-left.

Question 1 6 points

A quant team fits a deep neural network on intraday Bitcoin tick data. The fund uses the model to place trades the following morning; nobody on the team can describe what the network has learned, and they are content with that as long as the next-day forecast is accurate and well-calibrated. How is this task best classified along the prediction/inference axis?

Show answer
Correct answer: D

The defining tell is "nobody can describe what the network has learned, and they are content with that as long as the forecast is accurate." That is the prof's canonical prediction setup: "you don't really care if the model is right. That's secondary. Being right is not as important as being able to predict well… with knowing your uncertainty."

A confuses "make decisions" with "infer structure" — quants make decisions on a black-box forecast every day. B inverts the rule: parameter count is irrelevant to the prediction-vs-inference axis; what matters is whether the analyst cares about the form of $f$. C abuses the everyday word "inference" — in this course "inference" specifically means caring about the form of $f$ (which knobs matter, by how much), not "reasoning about the future." Forecasting tomorrow's price under uncertainty is the canonical prediction task, regardless of how downstream decisions feel.

Atoms: prediction-vs-inference, supervised-vs-unsupervised. Lecture: L01-intro.

Question 2 6 points

Researchers fit a linear regression of systolic blood pressure on age, sex, BMI, and smoking status from a Framingham-style cohort. The $R^2$ is small (the predictor cloud is "blob-grade") but the paper is celebrated as seminal. What is the most accurate way to characterise the goal of the analysis?

Show answer
Correct answer: C

The prof's stock illustration: "It's not that you want to predict your death. You're just trying to change it. You're trying to make decisions based off of a model that gives you an understanding." Goal: identify which knobs (BMI, smoking) move blood pressure so doctors can prescribe action. That is inference.

A reverses the prof's framing — the same regression can serve either goal; "regression $\Rightarrow$ prediction" is not a rule. B misreads the paper: the cohort is studied to identify risk factors at population scale, not to forecast individual patients' future blood pressure. D inverts cause and effect: low $R^2$ is compatible with inference (Framingham), but it does not define a study as inference — a high-$R^2$ study can also be inference if you care about the coefficients.

Atoms: prediction-vs-inference. Lecture: L01-intro.

Question 3 6 points

A large language model is trained on raw text scraped from the internet. No human-written labels are attached to any sentence. According to the prof's framing, is the training task supervised or unsupervised, and why?

Show answer
Correct answer: A

The prof's distinctive point: "Often what looks like unsupervised is really supervised in disguise. Best example: large language models. The training task is just predict the next word — perfectly supervised, perfectly defined." The label comes from the data itself (next token), not a human, but the loss is fully supervised.

B equates "supervised" with "human-labelled," which is the standard misconception the prof is pushing back against — what makes a task supervised is having a target $y$, regardless of who/what produced it. C describes a different kind of task (clustering documents) and projects it onto LLMs; LLM training is not "discover groups," it is "predict the next word." D invents a third axis the prof explicitly rejected — within this course's vocabulary, predict-next-word is supervised, no third category needed.

Atoms: supervised-vs-unsupervised. Lecture: L02-statlearn-1.

Question 4 7 points

Which feature most precisely defines the boundary between supervised and unsupervised learning, as the prof framed the distinction in lecture?

Show answer
Correct answer: D

The prof's verbatim framing: "Importantly, it's supervised in the sense that you know what you want to classify." The defining feature is whether $y$ exists, not how the data looks or how the method is built.

A confuses motivation with definition — high-dim data is a common setting for unsupervised methods (PCA, clustering for visualisation), but supervised methods also handle high-dim data (lasso, neural nets). B mixes up an orthogonal axis (parametric vs non-parametric) with the supervised/unsupervised axis — KNN is non-parametric and supervised; PCA is parametric in spirit and unsupervised. C invents a sample-size threshold the prof never claimed; supervised methods work fine with small $n$ (and unsupervised methods often need more data to find structure honestly).

Atoms: supervised-vs-unsupervised. Lecture: L01-intro.

Question 5 7 points

How does the prof position statistical learning against machine learning and classical statistics?

Show answer
Correct answer: A

Verbatim: "Statistical learning was a way for statisticians to get in on machine learning… same models as ML but with the statistician's care about uncertainty, bias-variance, distributional assumptions." Plus: classical stats states the model first; statistical learning has the data come first and operates in the misspecified-model regime.

B inflates the field — the prof treats stat learning, ML, and classical stats as overlapping with different emphasis, not as nested sets. C miscasts the contrast — both classical stats and statistical learning use likelihoods and least squares; the real split is "model-first vs data-first" / "well-specified vs misspecified." D collapses the distinction the prof works hard to draw: statistical learning is not just classical stats with a new sticker — it operates in the misspecified-model / data-first regime and inherits algorithms (trees, boosting, neural nets) that classical stats never owned.

Atoms: statistical-learning. Lectures: L01-intro, L02-statlearn-1.

Question 6 7 points Ex2.1

A hospital wants to use patient records (age, prior diagnoses, lab values, vital signs) to label each newly admitted patient as likely-to-be-readmitted-within-30-days or not, so the discharge nurse can decide whether to schedule a follow-up call. Pick the most accurate four-tag description of the task.

Show answer
Correct answer: B

Walk the four tags. (1) Response is "readmitted within 30 days" — categorical with two levels — so this is classification, not regression. (2) Past records carry the readmission outcome, so the response is observed, hence supervised. (3) The hospital uses the model to flag the next patient, not to interpret which lab value drives readmission, so the goal is prediction. (4) Predictors are the covariates that aren't the outcome.

A swaps response and predictors — readmission status is the target, age is just one covariate. C drops the response variable on the floor and reads the task as "find groups," which would mean ignoring the labelled outcomes — a clear waste of supervision. D miscodes the response (the indicator is a binary outcome, not a continuous risk score) and reframes the goal as inference, but the hospital is making per-patient decisions, not investigating which risk factor is causal.

Atoms: supervised-vs-unsupervised, prediction-vs-inference.

Question 7 7 points

Of the four tasks below, which is genuinely unsupervised in the prof's sense?

Show answer
Correct answer: C

"Group anonymous shoppers with no a priori segment labels" is the textbook setup for unsupervised clustering: only $X$, no $y$, no objective measure of what right looks like. Once you have the clusters you can then evaluate them against a downstream supervised task (e.g. do these segments lift recommendation accuracy?) — the prof's recommended healthy pattern.

A is a regression / time-series prediction task — supervised, with tomorrow's demand as $y$. B is the canonical supervised classification example; the digit labels are the response. D is the prof's running trap example: masked-word prediction looks unsupervised but is "supervised in disguise" — the model is trained to predict tokens it temporarily hides, fully supervised loss.

Atoms: supervised-vs-unsupervised. Lectures: L01-intro, L02-statlearn-1.

Question 8 7 points

The prof flagged one common piece of regression vocabulary as misleading and recommended dropping it. Which term, and why?

Show answer
Correct answer: A

Verbatim: "I wouldn't typically use the word independent variables. I would say this has more meaning than the others… because most things are not independent." Daylight hours and temperature are correlated; sex and age in a clinical study are correlated; calling them "independent" attaches a property they don't have.

B is a term the prof recommends ("predictors / regressors / covariates / features / variables" are all fine in his lecture). C is also a term he tolerates — feature is fine. D inverts the prof's actual aside on covariates: he noted "covariates often implies time or space" as a stylistic flavour, not as a reason to drop the word.

Atoms: statistical-learning. Lecture: L02-statlearn-1.

Question 9 8 points

Mark each statement about the prediction / inference distinction as true or false.

Show answer
  1. True — prof's recurring framing: "same model, two uses." The prediction-vs-inference axis is set by the analyst's question, not by which method is fit.
  2. False — for prediction the headline metric is held-out accuracy (test MSE / AUC). p-values and coefficient SEs are inference statistics; you can have a great predictor with no individually significant coefficients (collinearity), or a model with huge $t$-stats but terrible test MSE.
  3. True — the prof's note: "even when prediction isn't your real goal, it's often a good way of evaluating a model." If the inference model can't predict at all, the inferred coefficients are probably noise too.
  4. False — Framingham is the prof's stock counterexample: low $R^2$, "blob-grade" correlation, still a seminal inference paper. $R^2$ is a prediction-quality summary; for inference what matters is whether the slope estimates are interpretable and reasonably stable.

Each sub-statement is scored independently for $8/4 = 2$ points.

Atoms: prediction-vs-inference. Lectures: L01-intro, L02-statlearn-1.

Question 10 8 points

For each method below, mark the statement "this method is a supervised method" as true or false.

Show answer
  1. False — k-means is unsupervised; only $X$ is used. The "labels" the algorithm produces are cluster ids, not a target it was trained to match.
  2. True — LDA needs the class labels to estimate per-class means and the pooled covariance; without $y$ there is no LDA.
  3. False — PCA uses only the $X$ matrix; no response variable enters the loss. Classic unsupervised dimension-reduction.
  4. True — the LLM trap: the next word in the corpus serves as $y$, so the loss is fully supervised even though no human attached labels. The prof's headline "supervised in disguise" example.

Each sub-statement is scored independently for $8/4 = 2$ points.

Atoms: supervised-vs-unsupervised. Lecture: L02-statlearn-1.

Question 11 8 points

Mark each statement about the prof's framing of statistical learning as true or false.

Show answer
  1. True — the prof's working definition: "our model is misspecified because they're missing a lot of things… and we just assume they're not. And then we try to work from there."
  2. False — data science spans roughly six steps (hypothesise → scrape → structure → model → analyse → communicate); statistical learning owns the bottom three (modelling, fitting, communication). Acquisition and cleaning are out of scope.
  3. True — verbatim from L01: "Three umbrella problem types for the whole course: regression, classification, unsupervised."
  4. False — the assumption is $\mathbb{E}[\varepsilon] = 0$ and $\varepsilon$ independent of $X$. Baseline shifts are absorbed by the intercept $\beta_0$ in $f$, not by the noise term.

Each sub-statement is scored independently for $8/4 = 2$ points.

Atoms: statistical-learning. Lectures: L01-intro, L02-statlearn-1, L03-statlearn-2.

Question 12 8 points

The prof repeatedly framed unsupervised analysis as "dangerous statistics." Mark each statement about that danger as true or false.

Show answer
  1. True — verbatim: "Unsupervised methods are going to be more subjective because you're not training for a specific goal… it's very hard to assess results."
  2. False — the prof's exact pet peeve. "All of the statistical arguments for your finding tend to fall apart after you've done all this subjective exploring. But people don't know that and they just lie." The p-values ignore the search you did.
  3. True — the healthy pattern: unsupervised exploration → hypothesis-driven supervised study. Cluster shoppers, then check whether the clusters lift recommendation accuracy. CV / a held-out task closes the loop.
  4. True — verbatim: "If you explore and explore and explore, eventually you will find something no matter what." The point is that finding something is the default outcome, not evidence of structure.

Each sub-statement is scored independently for $8/4 = 2$ points.

Atoms: supervised-vs-unsupervised, cross-validation. Lecture: L01-intro.

Question 13 7 points

The course previews the bias-variance theme already in module 1, even though the formal decomposition is derived in module 2. Which statement best reflects the prof's preview-level framing of why the bias-variance idea matters at all?

Show answer
Correct answer: B

The prof's framing in L01–L03: bias-variance is the running theme of the course because it is the lens through which statisticians look at flexible ML models. "Statistical learning was a way for statisticians to get in on machine learning" — bias-variance is one of the main perspectives they brought, and the prof flagged it as guaranteed exam material starting day one.

A invents a feature-selection rule that does not exist; bias-variance is a property of predictions, not individual predictors. C states a one-sided rule the prof explicitly rejects — chasing zero bias is exactly what produces overfitting; the whole point of the trade-off is that you cannot just push bias down and clean up variance later. D miscategorises the result: bias-variance is a decomposition of expected test error at a fixed point, not an asymptotic statement about $\hat\beta$ distributions; it speaks directly about test-set behaviour, which is why the prof flags it as the running theme.

Atoms: statistical-learning, bias-variance-tradeoff. Lectures: L01-intro, L03-statlearn-2.

Question 14 8 points Exam 2025 P1

Read the passage and pick the correct fill-in for the highlighted blank. The passage: "In supervised learning there are two main purposes. In one we want to learn from data and build a model that relates a set of variables to an outcome, and we care which variables matter and how — we want to interpret the coefficients. In the other case we also build a model relating variables to an outcome, but we do not care about the actual model parameters, because we do not want to interpret them." The latter case is best described as:

Show answer
Correct answer: C

The trap is the prof's classic Q1 hook: the answer is buried in subtle wording, not the obvious word at the top of the paragraph. The phrase "we do not care about the actual model parameters" is the canonical signature of prediction — black-box forecasting where only $\hat Y$ matters.

A inverts the definition: inference is precisely "we care which parameters matter and how to interpret them," which the passage explicitly contrasts with the latter case. B is wrong on the first word of the passage — the setup says "in supervised learning," and unsupervised learning has no response to relate variables to anyway. D confuses an outcome type (continuous-Y problem) with a purpose: regression and classification are model classes, while the passage is asking which of the two purposes of supervised learning is being described — interpretation or forecasting.

Atoms: prediction-vs-inference, supervised-vs-unsupervised. Lecture: L27-summary (Q1 walkthrough).