← Back to wiki
Module 11 — Neural Networks
29 questions · 100 points · ~45 min
Click an option to lock the answer; the explanation auto-opens.
Score tracker bottom-left.
A feedforward network has $p$ inputs, one hidden layer of width $M$, and $C$ output units, with biases
on every non-input unit. Which expression equals the total parameter count?
- A $pM + MC + p + C$
- B $(p+1)(M+1)(C+1)$
- C $(p+1)M + (M+1)C$
- D $p(M+1) + M(C+1)$
Show answer
Correct answer: C
Each hidden unit has $p$ input weights plus one bias, so input→hidden contributes $(p+1)M$. Each output unit has $M$ weights from hidden plus one bias, so hidden→output contributes $(M+1)C$. The prof's headline formula.
A starts from the bias-less weight count and adds $p + C$ instead of the correct $M + C$ — the canonical "added the wrong receiving-side count" slip the prof flagged. B multiplies the three layer counts as if they composed, conflating "how many connections in series" with "how many parameters total"; the formula triple-counts every layer. D puts the bias on the sending side instead of the receiving side; the $+1$ lives on whichever layer is actually receiving the bias unit, never on the sender.
Atoms: nn-parameter-count, feedforward-network. Lecture: L23-nnet-1.
Question 2
4 points
Exam 2025 P2c
A fully-connected feedforward network has 3 inputs, one hidden layer with 4 ReLU neurons, and a single
output unit (regression). Biases on all hidden and output units. How many parameters does the network estimate?
- A $16$
- B $17$
- C $20$
- D $21$
Show answer
Correct answer: D
Hidden layer: each of $4$ neurons has $3$ input weights $+ 1$ bias $= 4$ parameters → $4 \cdot 4 = 16$. Output: $4$ weights from the hidden units $+ 1$ bias $= 5$. Total $= 16 + 5 = 21$.
A counts only the hidden-layer weights and biases ($4 \cdot 4$) — forgets the output unit entirely. B drops the output bias, giving only $4$ extra. C drops the hidden-layer biases ($3 \cdot 4 + 4 \cdot 1 + 1 = 17$? no — drops all biases gives $3 \cdot 4 + 4 \cdot 1 = 16$; $20$ is the half-bias slip $4 \cdot 4 + 4 = 20$). The bias-omission distractors are the canonical trap the prof flagged.
Atoms: nn-parameter-count. Lecture: L27-summary (Q3b walkthrough).
Question 3
4 points
Exam 2023 P5e
A fully-connected feedforward network for 5-class classification has input dimension $128$, two hidden
layers of widths $32$ and $64$ (in that order), and a softmax output layer. Biases everywhere; dropout
of $20\%$ is applied in each hidden layer during training. How many parameters are estimated?
- A $128 \cdot 32 + 32 \cdot 64 + 64 \cdot 5 = 6528$
- B $129 \cdot 32 + 33 \cdot 64 + 65 \cdot 5 = 6565$
- C $0.8 \cdot (129 \cdot 32 + 33 \cdot 64 + 65 \cdot 5) = 5252$
- D $129 \cdot 32 + 32 \cdot 64 + 64 \cdot 5 = 6485$
Show answer
Correct answer: B
Layer-by-layer with biases on every receiving layer: $128\to 32$ has $(128+1) \cdot 32 = 4128$; $32 \to 64$ has $(32+1) \cdot 64 = 2112$; $64 \to 5$ softmax has $(64+1) \cdot 5 = 325$. Sum $= 4128 + 2112 + 325 = 6565$.
A drops every bias — the canonical wrong answer the prof said students always pick. C applies the dropout rate to the parameter count; dropout zeros activations during training, it does not delete parameters, so all weights are still estimated. D drops the biases on the second and third layers but keeps them on the first.
Atoms: nn-parameter-count, nn-regularization. Lecture: L27-summary.
Question 4
4 points
Exam 2025 P2c(ii)
A single hidden ReLU neuron receives inputs $x_1 = -1,\ x_2 = 2,\ x_3 = 0$ with weights
$w_1 = 2,\ w_2 = 0,\ w_3 = 1$ and bias $b = 0$. What is the neuron's output?
Show answer
Correct answer: D
Pre-activation $z = w_1 x_1 + w_2 x_2 + w_3 x_3 + b = (-1)(2) + (2)(0) + (0)(1) + 0 = -2$. ReLU clamps negatives to zero: $\max(0, -2) = 0$.
A reports the raw pre-activation, forgetting to apply ReLU. B uses the absolute value of the pre-activation, ignoring that ReLU is one-sided. C flips a sign on $x_1$ (treats it as $+1$, then clamps a hypothetical positive). The canonical trap is forgetting that ReLU outputs exactly zero whenever the pre-activation is negative.
Atoms: activation-functions, feedforward-network. Lecture: L27-summary.
A single ReLU hidden neuron has weights $w = (0.3,\ -0.7,\ 0.5,\ 0.2)$, bias $b = 0.4$, and
inputs $x = (2,\ 1,\ -1,\ 3)$. What is the output of the neuron?
- A $0.4$
- B $0$
- C $1.0$
- D $-0.4$
Show answer
Correct answer: A
Dot product: $(0.3)(2) + (-0.7)(1) + (0.5)(-1) + (0.2)(3) = 0.6 - 0.7 - 0.5 + 0.6 = 0.0$. Add bias: $0.0 + 0.4 = 0.4$. ReLU is the identity on positives: $\max(0, 0.4) = 0.4$.
B treats the pre-activation as if it were negative — the canonical "forgot to add the bias" or "flipped a sign" trap that yields exactly $0$ after ReLU. C uses $|w \cdot x|$ wrongly (or treats the third term as $(-0.5)(-1) = 0.5$ then adds bias incorrectly). D reports a negative pre-activation as the output — forgets that ReLU clamps negatives to zero (and the pre-activation here is in fact positive after the bias).
Atoms: activation-functions, feedforward-network. Lecture: L24-nnet-2.
Question 6
3 points
Exam 2025 P2g
Why are nonlinear activation functions necessary in a feedforward neural network?
- A They reduce the number of parameters the model has to estimate.
- B They make the network fully connected between consecutive layers.
- C They let the network represent complex nonlinear functions of the input.
- D They guarantee convergence of stochastic gradient descent to the global optimum.
Show answer
Correct answer: C
The prof's verbatim point: a linear sum of linear things is linear, so without a nonlinear hidden activation the entire stacked network collapses to ordinary linear regression. Nonlinearity is what gives the network expressive power and makes the universal-approximation result possible.
A is wrong: activation choice is independent of parameter count — swapping ReLU for sigmoid does not change how many weights you estimate. B confuses activation with topology — full connectivity is a separate architectural choice. D is wrong: gradient descent on a non-convex NN loss is not guaranteed to find the global optimum, regardless of activation. The prof flagged the "reduces parameters" and "makes fully connected" answers as canonical wrong picks.
Atoms: activation-functions, universal-approximation. Lecture: L23-nnet-1.
Mark each statement about activation functions as true or false.
Show answer
- True — sigmoid + BCE is the natural pairing because BCE is the negative log-likelihood of a Bernoulli model, exactly the GLM identification the prof emphasised.
- False — a stack of linear layers composes to a single linear function. Linear hidden activation collapses the network to linear regression regardless of width.
- True — softmax normalises by $\sum_k e^{z_k}$ so the outputs lie on the probability simplex by construction.
- False — ReLU is $\max(0, z)$: zero on the negative side, identity on the positive side. Distinct from $|z|$.
Atoms: activation-functions, nn-loss-functions, feedforward-network.
A network for $C = 10$-class image classification (e.g. CIFAR-10) is being designed. Which output-activation
+ loss pairing is appropriate?
- A Linear output with mean-squared-error loss.
- B Sigmoid output with binary cross-entropy loss.
- C ReLU output with mean-absolute-error loss.
- D Softmax output with categorical cross-entropy loss.
Show answer
Correct answer: D
Multi-class output with one-hot labels uses softmax across $C$ output nodes (probabilities summing to 1) paired with categorical cross-entropy (the negative log-likelihood of a multinomial). This matches the prof's slide-deck pairing table.
A treats the class label as a continuous response — wrong family of loss; gives nonsensical gradients for classification. B is the binary pairing and forces class probabilities into a single number, which can't represent 10 mutually-exclusive classes. C mismatches an output-only activation (ReLU is for hidden layers) with a regression loss, breaking the classification interpretation entirely.
Atoms: nn-loss-functions, activation-functions, convolutional-neural-network.
Given the formula $\hat y(\mathbf x) = \beta_0 + \sum_{m=1}^5 \beta_m \cdot \max\!\left(\alpha_{0m} + \sum_{j=1}^{10} \alpha_{jm} x_j, 0\right)$,
identify the architecture and count its parameters.
- A 1 hidden layer, ReLU activation, linear output; $50$ parameters.
- B 1 hidden layer, ReLU activation, linear output; $55$ parameters.
- C 1 hidden layer, ReLU activation, linear output; $61$ parameters.
- D 2 hidden layers, sigmoid activation, linear output; $61$ parameters.
Show answer
Correct answer: C
The $\max(\cdot, 0)$ identifies ReLU; the outer sum has no nonlinearity so the output activation is linear; one level of nesting means one hidden layer. With $p = 10,\ M = 5,\ C = 1$: parameters $= (10+1) \cdot 5 + (5+1) \cdot 1 = 55 + 6 = 61$.
A drops both biases ($10 \cdot 5 + 0 = 50$). B drops only the output bias and keeps hidden biases ($55$). D misreads $\max(\cdot, 0)$ as a sigmoid and counts an extra layer the formula does not contain — there is only one $\max$ nesting.
Atoms: nn-parameter-count, feedforward-network, activation-functions.
Which of the following is a sufficient condition for a feedforward neural network to satisfy the
universal-approximation property (Borel-measurable functions, arbitrary $\varepsilon$)?
- A A skip connection from the input layer to the output, an identity hidden activation, and arbitrarily large width.
- B Linear output, a squashing (nonlinear) hidden activation, sufficient width.
- C Recurrent connections from the output back to the input layer of the network.
- D Sigmoid output, sigmoid hidden activation, with exactly two hidden layers stacked.
Show answer
Correct answer: B
The theorem requires (i) a linear output layer, (ii) at least one hidden layer with a "squashing" / nonlinear activation, and (iii) enough hidden units. Depth is not required — one wide hidden layer suffices.
A combines a skip connection (which UAT does not require and is out of scope here) with an identity hidden activation, so the network collapses to a linear function regardless of width — no squashing means no universal approximation. C invokes recurrence, which is unrelated to UAT (recurrent networks add a different capability). D over-specifies with sigmoid output and a fixed depth — the theorem is silent on output activation choice and does not require a particular depth.
Atoms: universal-approximation, activation-functions. Lecture: L23-nnet-1.
Mark each statement about the universal-approximation theorem as true or false.
Show answer
- True — that's the existence statement.
- False — UAT is purely an existence result. It says nothing about width bounds, optimisation, or whether SGD will find the weights.
- True — depth and width trade off; modern practice favours depth.
- False — the proof requires measure theory and the prof explicitly excluded it: "we don't talk about that in this class."
Atoms: universal-approximation. Lecture: L23-nnet-1.
The mini-batch SGD update for parameters $\boldsymbol\theta$ with learning rate $\lambda$ uses a
gradient estimated from a random subset of $m \ll N$ samples. Which statement best describes mini-batch SGD?
- A Its gradient estimate is biased, so it converges to a different minimum than full-batch GD does.
- B It is slower per step than full-batch GD because it has to randomly sample the data each iteration.
- C Its gradient is unbiased and the noise from a small batch provides implicit L2 regularisation.
- D It is mathematically equivalent to full-batch GD whenever the batch size is a power of two units.
Show answer
Correct answer: C
A random mini-batch is an unbiased estimator of the full gradient (same expectation, larger variance). The prof's headline NN fact: in the over-parameterised regime mini-batch SGD picks the minimum-norm interpolator among the infinitely many zero-loss solutions — that is the implicit L2 regularisation, "and it has been proven."
A confuses noise with bias — mini-batch gradients are noisy but unbiased. B reverses the speed claim: small batches are faster per step (less computation, parallelisable). D conflates a hardware convention (powers of two for GPU efficiency) with a mathematical equivalence; batch size has nothing to do with statistical equivalence to full-batch GD.
Atoms: gradient-descent-and-sgd, regularization. Lecture: L23-nnet-1.
Mark each statement about gradient descent and mini-batch SGD as true or false.
Show answer
- False — direction reversed. The prof's words: "$\eta = 2$ is horrible, bouncing everywhere; $\eta \approx 0.1$ is fine." Too-large learning rates overshoot.
- True — bigger batch → less noise → less implicit regularisation. The §4b direction-of-effect row in the exam analysis.
- True — verbatim prof framing: "powers of two because that's what you always do in machine learning, hardware-efficiency."
- False — NN losses are highly non-convex; the prof noted that the local minima are forgivingly connected, but they are not unique global minima.
Atoms: gradient-descent-and-sgd. Lecture: L24-nnet-2.
Backpropagation is the standard way to compute gradients for training a feedforward network. Which
of the following best describes what backpropagation does?
- A It performs the parameter update by stepping in the direction of the negative gradient and shrinking the step size with a learning-rate schedule.
- B It computes the loss on a held-out validation set after each epoch to detect overfitting and decide when to stop training.
- C It is a specialised second-order optimiser that replaces stochastic gradient descent and is the default in modern deep-learning libraries.
- D It applies the chain rule efficiently by reusing intermediate values stored during the forward pass.
Show answer
Correct answer: D
Backprop is the chain rule, organised so that pre-activations and activations stored during the forward pass get reused on the way back. That makes one forward + one backward pass enough to compute the gradient with respect to every parameter — the algorithmic breakthrough that pulled NNs out of the AI winter.
A describes the gradient-descent update, not how the gradient is computed; backprop is the calculator, SGD is the updater. B confuses backprop with validation/early-stopping. C reverses the relationship: backprop computes the gradient, and an optimiser (SGD, Adam, …) consumes it; backprop does not replace optimisers.
Atoms: backpropagation, gradient-descent-and-sgd. Lecture: L24-nnet-2.
Mark each statement about backpropagation as true or false.
Show answer
- True — the prof: "if you had loops you're kind of screwed". The acyclic structure is what lets the backward sweep work.
- True — $\delta^{\text{hid}} = \delta^{\text{out}} \cdot \beta \cdot g'(v)$. For ReLU, $g'(v)$ is 0 or 1, which is what makes the gradient computation cheap.
- False — recurrent networks need a specialised variant ("backpropagation through time", out of scope). The basic algorithm needs modification because the hidden state at $t$ depends on itself at $t-1$.
Atoms: backpropagation, recurrent-neural-network.
Exercise 11.1d: you have a feedforward network with $10\,000$ weights but only $1\,000$ training observations.
Which of the following is the strongest single argument that the model can still generalise well?
- A The universal-approximation theorem guarantees that, with enough hidden units, the trained network attains low test error on any Borel-measurable target.
- B The bias-variance trade-off implies that adding more parameters monotonically reduces test error, since extra flexibility always lowers bias more than it raises variance.
- C Parameter count is irrelevant for generalisation; only the choice of activation function (ReLU vs sigmoid vs tanh) controls how well a trained NN extrapolates to unseen data.
- D Mini-batch SGD with regularisation (L1/L2, dropout, early stopping, augmentation) implicitly selects a low-norm interpolator that generalises.
Show answer
Correct answer: D
The textbook answer to "$10\,000$ weights, $1\,000$ samples" is regularisation — explicit (L1/L2, dropout, early stopping, data augmentation, label smoothing, transfer learning) and implicit (mini-batch SGD picks the minimum-norm interpolator, the double-descent mechanism). The prof's iron rule: never train an NN without regularisation.
A confuses existence (UAT) with generalisation: UAT promises a network exists that approximates any function on training data, but says nothing about test performance. B inverts the trade-off — more flexibility raises variance and typically hurts test error in the under-parameterised regime. C ignores parameter count entirely; choice of activation does not by itself prevent overfitting.
Atoms: nn-regularization, double-descent, gradient-descent-and-sgd.
Mark each statement about NN regularisation as true or false.
Show answer
- True — the Goodfellow definition the prof read aloud: "a modification intended to reduce its generalisation error but not its training error." It can hurt training fit by design.
- False — direction reversed. The prof's words: "$20\%$ is very common, never use $50\%$." Fifty percent is too aggressive.
- True — dropout is a training-time noise injection; at inference you evaluate the intact network (or scale outputs appropriately).
- False — that's L1 (lasso-style sparsity). L2 (ridge-style) shrinks weights smoothly toward zero but rarely produces exact zeros.
- True — that is the definition of early stopping. The prof confessed it "feels like cheating", but it is the most commonly used regulariser.
Atoms: nn-regularization, regularization. Lecture: L24-nnet-2.
Question 18
3 points
Exam 2024 P1
In the prof's framing of regularisation across modules, tree-based models perform shrinkage-type
regularisation via tree pruning. Which of the following lists is the analogous menu for neural networks?
- A Bagging, random forests, gradient boosting, and post-hoc cost-complexity pruning of classification trees.
- B Forward stepwise, backward stepwise, best-subset selection, AIC, BIC, and adjusted-$R^2$ for choosing the predictor set.
- C Data augmentation, label smoothing, dropout, and early stopping (plus L1/L2 weight decay).
- D One-hot encoding of categorical inputs, batch normalisation between layers, and tuning of advanced optimisers like Adam or RMSProp.
Show answer
Correct answer: C
The 2024 exam answer key gives this menu verbatim: data augmentation, label smoothing, dropout, early stopping (the slide bracket); plus the explicit weight-penalty options L1/L2 from the lecture. Together these are the prof's regularisation toolbox for NNs.
A lists ensembling tools — those are regularisation, but they belong to the tree side of the analogy, not the NN side. B is model-selection on linear models (information criteria, stepwise), unrelated to NN training tricks. D mixes encoding (a preprocessing step, not regularisation) with batch normalisation and optimisers — both of which are explicitly out of scope per the prof's statements about "advanced optimizers" and "batch normalisation".
Atoms: nn-regularization, regularization. Lecture: L24-nnet-2.
For the Boston Housing regression problem (13 numerical predictors, continuous response, 506 observations),
Exercise 11.3 trains a Keras feedforward network with two ReLU hidden layers (64 → 32 → 1, linear output).
What preprocessing step is essential before fitting, and why?
- A One-hot encode every numerical predictor as if it were categorical, so the network learns a separate weight per observed value of each variable.
- B Standardise each predictor (subtract its training-set mean, divide by its training-set sd) so no single variable dominates the gradient by scale.
- C Apply principal-component analysis to the 13 predictors and feed the network only the top 5 components, discarding the remaining variance to control flexibility.
- D Convert the continuous response into one-hot indicator vectors of fixed-width housing-price bins and fit the network as a multi-class classifier instead.
Show answer
Correct answer: B
The prof's words on this exact exercise: "you don't want one variable to basically suck up all the variance, just like in PCA." Standardisation is mandatory before NNs, ridge, lasso, PCA, k-means, hierarchical clustering, and KNN — every method that is not scale-invariant.
A treats continuous predictors as categorical and explodes the input dimension nonsensically. C is allowable but discards information and is not the standard preprocessing for an NN regression; the exercise solution does not call for it. D destroys the regression target — Boston housing is a continuous-response regression problem, not a classification problem.
Atoms: standardization, feedforward-network. Lecture: L24-nnet-2.
Mark each statement about CNNs as true or false.
Show answer
- True — the prof's headline: "CNN is just a neural network… still no loops backwards. So backprop drops in." The trick is weight sharing across spatial locations.
- False — that's the pre-CNN classical-vision approach. CNNs learn the filter weights from data, which is the algorithmic shift LeCun introduced.
- True — the canonical conv → ReLU → max-pool block in Exercise 11.4.
- True — same image, perturbed pose, same label — directly applicable to CNNs and a core part of Exercise 11.4.2.
Atoms: convolutional-neural-network, nn-regularization, backpropagation. Lecture: L24-nnet-2.
Question 21
4 points
Ex11.4.1b
After training a CIFAR-10 CNN, the test confusion matrix shows $7\,200$ correctly classified and
$2\,800$ misclassified out of $10\,000$ test images. What is the misclassification rate?
- A $0.10$
- B $0.28$
- C $0.39$
- D $0.72$
Show answer
Correct answer: B
Misclassification rate $= \frac{\text{incorrect}}{\text{total}} = \frac{2800}{10000} = 0.28$.
A is a round-number trap unrelated to the data. C divides incorrect by correct ($2800 / 7200 \approx 0.39$) — that's the odds, not the rate. D reports the accuracy ($7200/10000 = 0.72$), which is one minus the misclassification rate. The trap is mixing up rate, accuracy, and odds.
Atoms: convolutional-neural-network.
A practitioner has a sequential dataset (e.g. NYSE daily volume, returns, volatility for the past
$L$ days, predicting tomorrow's volume). Which architecture is the most appropriate match for the
sequential structure within the scope of this course?
- A A recurrent network that propagates a hidden state $A_t$ across the sequence with shared weights $W, U, B$.
- B A standard 2-D convolutional network (CNN with $3 \times 3$ filters).
- C An ordinary feedforward network with one large hidden layer trained on each day independently.
- D Principal component analysis followed by linear regression on the leading components.
Show answer
Correct answer: A
RNNs are designed exactly for sequential data: the hidden state $A_t = \sigma(b + W X_t + U A_{t-1})$ carries information forward through the sequence, and the weights $W, U, B$ are shared across positions (the prof's emphasised weight-sharing point).
B is for 2-D image-like spatial structure; the NYSE setup is one-dimensional and sequential, not spatial. C ignores ordering by training each day in isolation — exactly what the prof flagged as the limitation an RNN is meant to overcome ("if there was no order, there would be no obvious ordering"). D is unsupervised dimensionality reduction; it does not address sequence structure at all.
Atoms: recurrent-neural-network, feedforward-network. Lecture: L26-nnet-3.
Mark each statement about RNNs as true or false.
Show answer
- True — verbatim from the prof: "we don't have… the weights changing every time" — same $W, U, B$ at every step is what keeps training tractable.
- True — the prof: "if $L$ is a billion, even the first thing in your sequence is affecting the billionth output." (Information may degrade in practice — the motivation for LSTMs/attention, out of scope here.)
- False — RNNs require a specialised variant ("backpropagation through time") because the recurrence creates an implicit loop in the dependency graph. BPTT itself is out of scope; the answer required is just "no, not unmodified".
Atoms: recurrent-neural-network, backpropagation. Lecture: L26-nnet-3.
Question 24
4 points
Exam 2024 P3b
Consider an additive-error regression model $Y_i = f_\theta(X_i) + \varepsilon_i$ with $\varepsilon_i \sim
\mathcal N(0, \sigma^2)$, i.i.d. The log-likelihood as a function of $\theta$ is
$\ell(\theta) = -\tfrac{n}{2}\log(2\pi\sigma^2) - \tfrac{1}{2\sigma^2}\sum_i (y_i - f_\theta(x_i))^2$.
Which conclusion follows directly?
- A Maximising $\ell(\theta)$ in $\theta$ is equivalent to minimising $\sum_i (y_i - f_\theta(x_i))^2$, the least-squares objective.
- B The MLE of $\theta$ depends on the noise variance $\sigma^2$ through the multiplicative factor in front of the sum, so changing $\sigma^2$ shifts the argmin.
- C Maximising $\ell(\theta)$ in $\theta$ is equivalent to maximising the residual sum of squares $\sum_i (y_i - f_\theta(x_i))^2$.
- D Without further distributional assumptions on $\varepsilon_i$, MLE and ordinary least squares give different estimates of $\theta$ even under Gaussian noise.
Show answer
Correct answer: A
The first term $-\tfrac{n}{2}\log(2\pi\sigma^2)$ does not depend on $\theta$. The second term contains $\theta$ only through $\sum_i (y_i - f_\theta(x_i))^2$, with a negative sign and a positive coefficient $\tfrac{1}{2\sigma^2}$. So maximising $\ell$ in $\theta$ is the same as minimising the residual sum of squares. Under additive Gaussian noise, MLE = LS — the mathy 2024 question the prof flagged as a likely template.
B is wrong: $\sigma^2$ enters as a positive multiplicative constant in front of $\sum (y - f)^2$, so it scales the objective but does not change the argmin in $\theta$ — the MLE depends on the residuals through this very sum. C reverses the sign — maximising the log-likelihood means minimising, not maximising, the SSE. D is incorrect: under Gaussian errors with constant $\sigma^2$, the two estimators coincide; the equivalence is the whole point of this derivation.
Atoms: nn-loss-functions, least-squares-and-mle. Lecture: L27-summary (mathy question).
Mark each statement about double descent as true or false.
Show answer
- False — the prof was emphatic: bias and variance still always add up; double descent just changes the shape of the curve, not the identity.
- True — the peak is at the interpolation threshold ($\#\text{params} \approx \#\text{samples}$) where the variance explodes, and test error often comes down a second time past that threshold.
- True — this is the prof's preferred mechanism. Past the interpolation point the optimisation effectively becomes "minimise $\sum \beta^2$ subject to fitting all data exactly" — the min-norm interpolator.
- False — the prof and the slide explicitly say most methods covered in the course do not show double descent; it is mostly a deep-learning phenomenon, and the slide says "we typically do not want to rely on this behaviour."
Atoms: double-descent, bias-variance-tradeoff, gradient-descent-and-sgd. Lecture: L26-nnet-3.
How does the prof reconcile the bias-variance decomposition with the second descent in the
over-parameterised regime?
- A Past the interpolation point, the model has infinitely many zero-training-loss fits and SGD picks the minimum-norm one, which keeps variance low even at extreme flexibility.
- B Past the interpolation point both the bias and the variance of the estimator vanish to zero simultaneously, so the bias-variance trade-off disappears entirely in the over-parameterised regime.
- C The bias-variance decomposition is a theorem about ordinary least squares only, so it breaks down for any nonlinear model and need not apply to neural networks at all.
- D The double-descent curve is observed only for ridge regression with vanishing penalty and has never been documented for neural networks trained with stochastic gradient descent.
Show answer
Correct answer: A
The prof's preferred framing: past the interpolation point the optimiser is no longer balancing fit and penalty — every model fits the training data exactly. So you are choosing among interpolators, and SGD's implicit bias selects the minimum-norm one, which has low variance and generalises well.
B is wrong: bias does not go to zero — the training fit is exact but bias is about expectations across training sets, not training residuals. C is wrong: the decomposition is a general identity for squared-error loss, not specific to linear regression. D is wrong: double descent was popularised in the NN context; it appears in over-parameterised polynomial fits, neural networks, and many flexible models.
Atoms: double-descent, bias-variance-tradeoff. Lecture: L26-nnet-3.
You want to choose the dropout rate, the L2 weight-decay strength, and the hidden width of a feedforward
network for a small tabular dataset. The prof's preferred selection strategy in this course is:
- A Pick whichever combination of dropout rate, weight-decay strength, and width minimises the training-set loss after enough epochs to drive that loss to zero.
- B Use AIC or BIC computed on the training set to penalise complexity automatically, leveraging the fact that they are derived from the same Gaussian likelihood as the loss.
- C Tune the hyperparameters using $k$-fold cross-validation (or a held-out validation set), comparing CV/validation error across candidate values.
- D Rely on SGD's implicit minimum-norm-interpolator selection in the over-parameterised regime, which removes the need for any explicit hyperparameter tuning whatsoever.
Show answer
Correct answer: C
The prof's repeated stance: hyperparameters (dropout rate, weight-decay strength, depth/width) are chosen via cross-validation or a held-out validation set. He explicitly distrusts AIC/BIC ("they're making assumptions that probably won't hold") and prefers CV.
A overfits: training error is monotone non-increasing in flexibility, so it always pushes toward the most complex model. B contradicts the prof's distrust of information criteria, and AIC/BIC are also flagged as having out-of-scope derivations. D misuses the min-norm-interpolator concept — it is a description of what SGD does in the post-interpolation regime, not a hyperparameter-tuning recipe.
Atoms: cross-validation, nn-regularization.
Mark each statement about feedforward networks vs. classical methods as true or false.
Show answer
- True — Exercise 11.1c: with linear hidden activation, the composition reduces to a single linear predictor passed through a sigmoid, the logistic-regression form.
- True — Exercise 11.2c. GAMs are additive and per-feature; FNNs mix features inside each hidden unit and so capture interactions natively (at the cost of interpretability).
- False — direction reversed. The Hitters comparison (linear ≈ 0.56, lasso ≈ 0.50, unregularised NN slightly worse on 263 observations) and the prof's verbatim "if you constructed your model this way you doing it wrong": small data + unregularised NN = bad. The prof: "if you don't have a lot of data and need interpretability, probably don't use neural networks at all. Use trees."
- False — the prof's words: "if you fit a complicated neural network, in the end you don't know what you have." Hidden units are not directly interpretable; this is precisely why the prof recommends trees / classical models when interpretability matters.
Atoms: feedforward-network, nn-regularization. Lecture: L26-nnet-3.
A hidden-layer ReLU neuron has 4 inputs and one bias. Inputs are
$x = (1,\ -2,\ 3,\ 0)$, weights $w = (1,\ 1,\ -2,\ 2)$, bias $b = 1$.
What is the neuron's output?
Show answer
Correct answer: A
Pre-activation: $(1)(1) + (1)(-2) + (-2)(3) + (2)(0) + 1 = 1 - 2 - 6 + 0 + 1 = -6$. ReLU clamps negatives to zero: $\max(0, -6) = 0$.
B reports the raw pre-activation, forgetting to apply ReLU. C reports the absolute value of the pre-activation, treating ReLU as $|z|$. D is the sign-slip distractor: flipping the sign on the $w_3 x_3 = -6$ term gives $+6$, then applying bias $+1$ minus the ReLU step yields $4$ if you also drop the $-2$ contribution. The canonical trap is exactly the "forgot to apply $\max(0,\cdot)$" or sign error in the $w_3 x_3$ term.
Atoms: activation-functions, feedforward-network.