K-nearest neighbors (regression)

The regression sibling of knn-classification: at a test point , find its nearest training points by Euclidean distance and predict the average of their -values. Same K-as-flexibility story, same bias-variance picture, same curse-of-dimensionality death. The prof uses it as the standing example when explaining how to pick a hyperparameter via cross-validation in M5.

Definition (prof’s framing)

“Recall KNN regression: where is the K nearest training points. K small → high complexity (jagged, K=1 hits every training point); K large → low complexity (smooth, K = number-of-points = horizontal mean).” - L10-resample-1

The prof literally uses KNN regression as the running model-selection example in M5, “K in KNN, polynomial degree” are the two prototype hyperparameters CV is meant to tune.

Notation & setup

  • Training data with , (continuous response).
  • = index set of the training points closest to in Euclidean distance.
  • Loss for tuning: MSE = on a held-out / CV set.

Formula(s) to know cold

Validation MSE (e.g. inside a CV loop):

where is the KNN fit using all data except the validation fold . CE1 problem 4a wants this written as 10-fold CV pseudocode.

Insights & mental models

KNN regression is the same mechanism as KNN classification, with mean swapping in for majority vote. Same flexibility knob (), same trade-off, same fix (CV for K).

The K-flexibility picture, regression flavor (slide in L10-resample-1 with , , , repeated times):

“1 is going to suck, it’s going to jump to every single point, and then 25 is going to be a lot, because there’s 61 points … you’re using like almost more than a third of the data, so that’s going to be way too smooth.” - L10-resample-1

  • : exactly on training points → training MSE = 0. Test MSE high (variance).
  • : everywhere, horizontal line at the global mean. Pure underfit.
  • Optimal is intermediate, recovered by CV (Exercise CE1.4a is the canonical procedural drill).

The bias-variance plot for KNN regression over (Figure 1 of CE1, used in problem 1e): squared bias rises with (smoother fit can’t track wiggles → bias), variance falls with (averaging more points lowers Var), irreducible error is flat, total error is U-shaped, minimum somewhere in the middle. Reading this plot is the exam-relevant skill: identify the U, identify the minimum, name what each curve is.

One subtlety the prof flagged (L10-resample-1): in the slide’s plot, “bias” is computed on the fit data, not from the true model. “In other parts, we talk about the bias as how biased it is from the true model, not from the fit data.” In real settings you don’t know the truth, so you have to estimate these from data, that’s why we need resampling.

Why KNN regression is the canonical CV target: it has exactly one tuning parameter (), the loss (MSE) is unambiguous, it works for any continuous , and it’s the simplest nonparametric regressor. So in M5 the worked CV example is “tune in KNN regression by 10-fold CV”, write the loop, compute MSE per fold, average, pick the at the minimum. (L10-resample-1 / Exercise 5.1.)

The pseudocode template (CE1 problem 4a)

Input: training data (x_i, y_i), candidate K values
For each K in candidate set:
  Randomly partition training data into 10 folds C_1, ..., C_10
  For j = 1, ..., 10:
    Train: hold out fold C_j; the rest is the training subset.
    For each x_i in C_j: predict ŷ_i = (1/K) * sum of y over the K nearest x's in the training subset.
    Compute MSE_j = (1/|C_j|) * sum_{i in C_j} (y_i - ŷ_i)^2
  CV(K) = (1/10) * sum_j MSE_j
Return the K minimizing CV(K)

Math notation, English, or pseudocode all acceptable per scope. The 2026 form would not require R syntax.

Standardization

Same warning as KNN classification: Euclidean distance is scale-sensitive. Standardize predictors first (z-score) before fitting KNN. See standardization, KNN regression is on the list of methods that are not scale-invariant.

Pitfalls

  • Same K-direction trap as classification. Small = more flexible. Easy T/F mistake.
  • No standardization. Distances dominated by the largest-scale predictor; results garbage.
  • No extrapolation. for outside the training range is the average of the nearest training points, pulled toward whatever boundary points exist, not extrapolated along any trend.
  • Curse of dimensionality. Same as classification: in high , no point is meaningfully “near” any other. KNN’s mechanism breaks.
  • K must be an integer. Doesn’t matter for the math, but a candidate sweep is over .
  • Tie-breaking for “the K nearest”: if multiple points are equidistant, libraries vary; doesn’t affect exam-style hand calculations.

Scope vs ISLP

  • In scope: the formula, as flexibility, bias-variance interpretation, picking by CV, the use of KNN regression as the running model-selection example in M5.
  • Look up in ISLP: §3.5 (“Comparison of Linear Regression with K-Nearest Neighbors”), the KNN regression formula, comparison with linear regression, the curse-of-dimensionality discussion. Equation (3.39) is the formula above; Figures 3.16–3.20 show the contrast with linear models in 1D and as grows.
  • Skip in ISLP: weighted KNN (out of scope). Distance metrics other than Euclidean (M10 distance-metrics discusses options for clustering, but KNN here is Euclidean only).

Exercise instances

  • CE1 problem 4a: write 10-fold CV pseudocode for KNN regression with MSE, including the formula for the validation error. Procedural template.
  • CE1 problem 1e: given a bias-variance plot of KNN regression over (Figure 1 of CE1), evaluate four T/F statements: (i) decreasing increases flexibility; (ii) squared bias always contributes the most; (iii) you have enough info to pick ; (iv) for the plotted range, would overfitting or underfitting be preferable. The classic plot-reading + concept-application combo.

How it might appear on the exam

  • Pseudocode question (the prof’s flagged 2026 format from scope §3): “Write out how you’d choose for KNN regression by 10-fold cross-validation.” Math, English, or pseudocode all OK. CE1 problem 4a is the template.
  • Bias-variance plot reading. Given a curve of bias², variance, irreducible, total over , identify the U-shape minimum, name each curve, decide which side of the U is overfit and which is underfit.
  • Method comparison. “When would KNN regression beat linear regression?” → strongly nonlinear , low , lots of data. (When would linear win? → low , high , additive structure.)
  • Direction-of-effect T/F. Same family as KNN classification: small = high flexibility, low bias, high variance.
  • Hand calculation (less common, but plausible): given a small dataset and a test point, compute by averaging the nearest -values. Direct analog of Exercise 4.1.