Module 11 — Neural Networks

Question 1 3 points

A feedforward network has $p$ inputs, one hidden layer of width $M$, and $C$ output units, with biases on every non-input unit. Which expression equals the total parameter count?

A $pM + MC + p + C$
B $(p+1)(M+1)(C+1)$
C $(p+1)M + (M+1)C$
D $p(M+1) + M(C+1)$

Show answer

Correct answer: C

Each hidden unit has $p$ input weights plus one bias, so input→hidden contributes $(p+1)M$. Each output unit has $M$ weights from hidden plus one bias, so hidden→output contributes $(M+1)C$. The prof's headline formula.

A starts from the bias-less weight count and adds $p + C$ instead of the correct $M + C$ — the canonical "added the wrong receiving-side count" slip the prof flagged. B multiplies the three layer counts as if they composed, conflating "how many connections in series" with "how many parameters total"; the formula triple-counts every layer. D puts the bias on the sending side instead of the receiving side; the $+1$ lives on whichever layer is actually receiving the bias unit, never on the sender.

Atoms: nn-parameter-count, feedforward-network. Lecture: L23-nnet-1.

Question 2 4 points Exam 2025 P2c

A fully-connected feedforward network has 3 inputs, one hidden layer with 4 ReLU neurons, and a single output unit (regression). Biases on all hidden and output units. How many parameters does the network estimate?

A $16$
B $17$
C $20$
D $21$

Show answer

Correct answer: D

Hidden layer: each of $4$ neurons has $3$ input weights $+ 1$ bias $= 4$ parameters → $4 \cdot 4 = 16$. Output: $4$ weights from the hidden units $+ 1$ bias $= 5$. Total $= 16 + 5 = 21$.

A counts only the hidden-layer weights and biases ($4 \cdot 4$) — forgets the output unit entirely. B drops the output bias, giving only $4$ extra. C drops the hidden-layer biases ($3 \cdot 4 + 4 \cdot 1 + 1 = 17$? no — drops all biases gives $3 \cdot 4 + 4 \cdot 1 = 16$; $20$ is the half-bias slip $4 \cdot 4 + 4 = 20$). The bias-omission distractors are the canonical trap the prof flagged.

Atoms: nn-parameter-count. Lecture: L27-summary (Q3b walkthrough).

Question 3 4 points Exam 2023 P5e

A fully-connected feedforward network for 5-class classification has input dimension $128$, two hidden layers of widths $32$ and $64$ (in that order), and a softmax output layer. Biases everywhere; dropout of $20\%$ is applied in each hidden layer during training. How many parameters are estimated?

A $128 \cdot 32 + 32 \cdot 64 + 64 \cdot 5 = 6528$
B $129 \cdot 32 + 33 \cdot 64 + 65 \cdot 5 = 6565$
C $0.8 \cdot (129 \cdot 32 + 33 \cdot 64 + 65 \cdot 5) = 5252$
D $129 \cdot 32 + 32 \cdot 64 + 64 \cdot 5 = 6485$

Show answer

Correct answer: B

Layer-by-layer with biases on every receiving layer: $128\to 32$ has $(128+1) \cdot 32 = 4128$; $32 \to 64$ has $(32+1) \cdot 64 = 2112$; $64 \to 5$ softmax has $(64+1) \cdot 5 = 325$. Sum $= 4128 + 2112 + 325 = 6565$.

A drops every bias — the canonical wrong answer the prof said students always pick. C applies the dropout rate to the parameter count; dropout zeros activations during training, it does not delete parameters, so all weights are still estimated. D drops the biases on the second and third layers but keeps them on the first.

Atoms: nn-parameter-count, nn-regularization. Lecture: L27-summary.

Question 4 4 points Exam 2025 P2c(ii)

A single hidden ReLU neuron receives inputs $x_1 = -1,\ x_2 = 2,\ x_3 = 0$ with weights $w_1 = 2,\ w_2 = 0,\ w_3 = 1$ and bias $b = 0$. What is the neuron's output?

A $-2$
B $2$
C $1$
D $0$

Show answer

Correct answer: D

Pre-activation $z = w_1 x_1 + w_2 x_2 + w_3 x_3 + b = (-1)(2) + (2)(0) + (0)(1) + 0 = -2$. ReLU clamps negatives to zero: $\max(0, -2) = 0$.

A reports the raw pre-activation, forgetting to apply ReLU. B uses the absolute value of the pre-activation, ignoring that ReLU is one-sided. C flips a sign on $x_1$ (treats it as $+1$, then clamps a hypothetical positive). The canonical trap is forgetting that ReLU outputs exactly zero whenever the pre-activation is negative.

Atoms: activation-functions, feedforward-network. Lecture: L27-summary.

Question 5 3 points

A single ReLU hidden neuron has weights $w = (0.3,\ -0.7,\ 0.5,\ 0.2)$, bias $b = 0.4$, and inputs $x = (2,\ 1,\ -1,\ 3)$. What is the output of the neuron?

A $0.4$
B $0$
C $1.0$
D $-0.4$

Show answer

Correct answer: A

Dot product: $(0.3)(2) + (-0.7)(1) + (0.5)(-1) + (0.2)(3) = 0.6 - 0.7 - 0.5 + 0.6 = 0.0$. Add bias: $0.0 + 0.4 = 0.4$. ReLU is the identity on positives: $\max(0, 0.4) = 0.4$.

B treats the pre-activation as if it were negative — the canonical "forgot to add the bias" or "flipped a sign" trap that yields exactly $0$ after ReLU. C uses $|w \cdot x|$ wrongly (or treats the third term as $(-0.5)(-1) = 0.5$ then adds bias incorrectly). D reports a negative pre-activation as the output — forgets that ReLU clamps negatives to zero (and the pre-activation here is in fact positive after the bias).

Atoms: activation-functions, feedforward-network. Lecture: L24-nnet-2.

Question 6 3 points Exam 2025 P2g

Why are nonlinear activation functions necessary in a feedforward neural network?

A They reduce the number of parameters the model has to estimate.
B They make the network fully connected between consecutive layers.
C They let the network represent complex nonlinear functions of the input.
D They guarantee convergence of stochastic gradient descent to the global optimum.

Show answer

Correct answer: C

The prof's verbatim point: a linear sum of linear things is linear, so without a nonlinear hidden activation the entire stacked network collapses to ordinary linear regression. Nonlinearity is what gives the network expressive power and makes the universal-approximation result possible.

A is wrong: activation choice is independent of parameter count — swapping ReLU for sigmoid does not change how many weights you estimate. B confuses activation with topology — full connectivity is a separate architectural choice. D is wrong: gradient descent on a non-convex NN loss is not guaranteed to find the global optimum, regardless of activation. The prof flagged the "reduces parameters" and "makes fully connected" answers as canonical wrong picks.

Atoms: activation-functions, universal-approximation. Lecture: L23-nnet-1.

Question 7 4 points

Mark each statement about activation functions as true or false.

a) For binary classification with one output node, the standard pairing is sigmoid output activation with binary cross-entropy loss. True False
b) A network whose hidden activation is the identity function $g(z) = z$ can still represent arbitrary nonlinear functions, provided it has enough hidden units. True False
c) Softmax outputs across the $C$ output units always sum to $1$ for any input. True False
d) ReLU outputs the absolute value of the pre-activation: $g(z) = |z|$. True False

Show answer

True — sigmoid + BCE is the natural pairing because BCE is the negative log-likelihood of a Bernoulli model, exactly the GLM identification the prof emphasised.
False — a stack of linear layers composes to a single linear function. Linear hidden activation collapses the network to linear regression regardless of width.
True — softmax normalises by $\sum_k e^{z_k}$ so the outputs lie on the probability simplex by construction.
False — ReLU is $\max(0, z)$: zero on the negative side, identity on the positive side. Distinct from $|z|$.

Atoms: activation-functions, nn-loss-functions, feedforward-network.

Question 8 3 points

A network for $C = 10$-class image classification (e.g. CIFAR-10) is being designed. Which output-activation + loss pairing is appropriate?

A Linear output with mean-squared-error loss.
B Sigmoid output with binary cross-entropy loss.
C ReLU output with mean-absolute-error loss.
D Softmax output with categorical cross-entropy loss.

Show answer

Correct answer: D

Multi-class output with one-hot labels uses softmax across $C$ output nodes (probabilities summing to 1) paired with categorical cross-entropy (the negative log-likelihood of a multinomial). This matches the prof's slide-deck pairing table.

A treats the class label as a continuous response — wrong family of loss; gives nonsensical gradients for classification. B is the binary pairing and forces class probabilities into a single number, which can't represent 10 mutually-exclusive classes. C mismatches an output-only activation (ReLU is for hidden layers) with a regression loss, breaking the classification interpretation entirely.

Atoms: nn-loss-functions, activation-functions, convolutional-neural-network.

Question 9 3 points

Given the formula $\hat y(\mathbf x) = \beta_0 + \sum_{m=1}^5 \beta_m \cdot \max\!\left(\alpha_{0m} + \sum_{j=1}^{10} \alpha_{jm} x_j, 0\right)$, identify the architecture and count its parameters.

A 1 hidden layer, ReLU activation, linear output; $50$ parameters.
B 1 hidden layer, ReLU activation, linear output; $55$ parameters.
C 1 hidden layer, ReLU activation, linear output; $61$ parameters.
D 2 hidden layers, sigmoid activation, linear output; $61$ parameters.

Show answer

Correct answer: C

The $\max(\cdot, 0)$ identifies ReLU; the outer sum has no nonlinearity so the output activation is linear; one level of nesting means one hidden layer. With $p = 10,\ M = 5,\ C = 1$: parameters $= (10+1) \cdot 5 + (5+1) \cdot 1 = 55 + 6 = 61$.

A drops both biases ($10 \cdot 5 + 0 = 50$). B drops only the output bias and keeps hidden biases ($55$). D misreads $\max(\cdot, 0)$ as a sigmoid and counts an extra layer the formula does not contain — there is only one $\max$ nesting.

Atoms: nn-parameter-count, feedforward-network, activation-functions.

Question 10 3 points

Which of the following is a sufficient condition for a feedforward neural network to satisfy the universal-approximation property (Borel-measurable functions, arbitrary $\varepsilon$)?

A A skip connection from the input layer to the output, an identity hidden activation, and arbitrarily large width.
B Linear output, a squashing (nonlinear) hidden activation, sufficient width.
C Recurrent connections from the output back to the input layer of the network.
D Sigmoid output, sigmoid hidden activation, with exactly two hidden layers stacked.

Show answer

Correct answer: B

The theorem requires (i) a linear output layer, (ii) at least one hidden layer with a "squashing" / nonlinear activation, and (iii) enough hidden units. Depth is not required — one wide hidden layer suffices.

A combines a skip connection (which UAT does not require and is out of scope here) with an identity hidden activation, so the network collapses to a linear function regardless of width — no squashing means no universal approximation. C invokes recurrence, which is unrelated to UAT (recurrent networks add a different capability). D over-specifies with sigmoid output and a fixed depth — the theorem is silent on output activation choice and does not require a particular depth.

Atoms: universal-approximation, activation-functions. Lecture: L23-nnet-1.

Question 11 4 points

Mark each statement about the universal-approximation theorem as true or false.

a) The theorem guarantees that for any Borel-measurable target function there exists a feedforward network achieving any desired non-zero approximation error. True False
b) The theorem also tells you the width $M$ needed and how to find the weights via training. True False
c) A single hidden layer is sufficient in theory; in practice, deeper networks can achieve similar accuracy with far fewer total units. True False
d) The proof of the theorem is part of the curriculum and you should be ready to reproduce it on the exam. True False

Show answer

True — that's the existence statement.
False — UAT is purely an existence result. It says nothing about width bounds, optimisation, or whether SGD will find the weights.
True — depth and width trade off; modern practice favours depth.
False — the proof requires measure theory and the prof explicitly excluded it: "we don't talk about that in this class."

Atoms: universal-approximation. Lecture: L23-nnet-1.

Question 12 3 points

The mini-batch SGD update for parameters $\boldsymbol\theta$ with learning rate $\lambda$ uses a gradient estimated from a random subset of $m \ll N$ samples. Which statement best describes mini-batch SGD?

A Its gradient estimate is biased, so it converges to a different minimum than full-batch GD does.
B It is slower per step than full-batch GD because it has to randomly sample the data each iteration.
C Its gradient is unbiased and the noise from a small batch provides implicit L2 regularisation.
D It is mathematically equivalent to full-batch GD whenever the batch size is a power of two units.

Show answer

Correct answer: C

A random mini-batch is an unbiased estimator of the full gradient (same expectation, larger variance). The prof's headline NN fact: in the over-parameterised regime mini-batch SGD picks the minimum-norm interpolator among the infinitely many zero-loss solutions — that is the implicit L2 regularisation, "and it has been proven."

A confuses noise with bias — mini-batch gradients are noisy but unbiased. B reverses the speed claim: small batches are faster per step (less computation, parallelisable). D conflates a hardware convention (powers of two for GPU efficiency) with a mathematical equivalence; batch size has nothing to do with statistical equivalence to full-batch GD.

Atoms: gradient-descent-and-sgd, regularization. Lecture: L23-nnet-1.

Question 13 4 points

Mark each statement about gradient descent and mini-batch SGD as true or false.

a) A learning rate of $\lambda = 2$ is typically reasonable; values near $\lambda \approx 0.1$ tend to cause the optimiser to bounce around without converging. True False
b) Increasing the mini-batch size decreases the variance of the gradient estimate, which weakens the implicit regularisation effect. True False
c) Batch sizes are commonly chosen as powers of two (32, 128, 256, …) for hardware-efficiency reasons, not statistical reasons. True False
d) NN loss surfaces are convex, so any local minimum reached by SGD is also the global minimum. True False

Show answer

False — direction reversed. The prof's words: "$\eta = 2$ is horrible, bouncing everywhere; $\eta \approx 0.1$ is fine." Too-large learning rates overshoot.
True — bigger batch → less noise → less implicit regularisation. The §4b direction-of-effect row in the exam analysis.
True — verbatim prof framing: "powers of two because that's what you always do in machine learning, hardware-efficiency."
False — NN losses are highly non-convex; the prof noted that the local minima are forgivingly connected, but they are not unique global minima.

Atoms: gradient-descent-and-sgd. Lecture: L24-nnet-2.

Question 14 3 points

Backpropagation is the standard way to compute gradients for training a feedforward network. Which of the following best describes what backpropagation does?

A It performs the parameter update by stepping in the direction of the negative gradient and shrinking the step size with a learning-rate schedule.
B It computes the loss on a held-out validation set after each epoch to detect overfitting and decide when to stop training.
C It is a specialised second-order optimiser that replaces stochastic gradient descent and is the default in modern deep-learning libraries.
D It applies the chain rule efficiently by reusing intermediate values stored during the forward pass.

Show answer

Correct answer: D

Backprop is the chain rule, organised so that pre-activations and activations stored during the forward pass get reused on the way back. That makes one forward + one backward pass enough to compute the gradient with respect to every parameter — the algorithmic breakthrough that pulled NNs out of the AI winter.

A describes the gradient-descent update, not how the gradient is computed; backprop is the calculator, SGD is the updater. B confuses backprop with validation/early-stopping. C reverses the relationship: backprop computes the gradient, and an optimiser (SGD, Adam, …) consumes it; backprop does not replace optimisers.

Atoms: backpropagation, gradient-descent-and-sgd. Lecture: L24-nnet-2.

Question 15 3 points

Mark each statement about backpropagation as true or false.

a) Backpropagation works for a feedforward network because the network is acyclic and admits a topological order. True False
b) The hidden-layer error term contains the derivative $g'$ of the activation function at the pre-activation. True False
c) Plain backpropagation as taught in the course applies directly to recurrent networks without modification. True False

Show answer

True — the prof: "if you had loops you're kind of screwed". The acyclic structure is what lets the backward sweep work.
True — $\delta^{\text{hid}} = \delta^{\text{out}} \cdot \beta \cdot g'(v)$. For ReLU, $g'(v)$ is 0 or 1, which is what makes the gradient computation cheap.
False — recurrent networks need a specialised variant ("backpropagation through time", out of scope). The basic algorithm needs modification because the hidden state at $t$ depends on itself at $t-1$.

Atoms: backpropagation, recurrent-neural-network.

Question 16 3 points

Exercise 11.1d: you have a feedforward network with $10\,000$ weights but only $1\,000$ training observations. Which of the following is the strongest single argument that the model can still generalise well?

A The universal-approximation theorem guarantees that, with enough hidden units, the trained network attains low test error on any Borel-measurable target.
B The bias-variance trade-off implies that adding more parameters monotonically reduces test error, since extra flexibility always lowers bias more than it raises variance.
C Parameter count is irrelevant for generalisation; only the choice of activation function (ReLU vs sigmoid vs tanh) controls how well a trained NN extrapolates to unseen data.
D Mini-batch SGD with regularisation (L1/L2, dropout, early stopping, augmentation) implicitly selects a low-norm interpolator that generalises.

Show answer

Correct answer: D

The textbook answer to "$10\,000$ weights, $1\,000$ samples" is regularisation — explicit (L1/L2, dropout, early stopping, data augmentation, label smoothing, transfer learning) and implicit (mini-batch SGD picks the minimum-norm interpolator, the double-descent mechanism). The prof's iron rule: never train an NN without regularisation.

A confuses existence (UAT) with generalisation: UAT promises a network exists that approximates any function on training data, but says nothing about test performance. B inverts the trade-off — more flexibility raises variance and typically hurts test error in the under-parameterised regime. C ignores parameter count entirely; choice of activation does not by itself prevent overfitting.

Atoms: nn-regularization, double-descent, gradient-descent-and-sgd.

Question 17 5 points

Mark each statement about NN regularisation as true or false.

Show answer

True — the Goodfellow definition the prof read aloud: "a modification intended to reduce its generalisation error but not its training error." It can hurt training fit by design.
False — direction reversed. The prof's words: "$20\%$ is very common, never use $50\%$." Fifty percent is too aggressive.
True — dropout is a training-time noise injection; at inference you evaluate the intact network (or scale outputs appropriately).
False — that's L1 (lasso-style sparsity). L2 (ridge-style) shrinks weights smoothly toward zero but rarely produces exact zeros.
True — that is the definition of early stopping. The prof confessed it "feels like cheating", but it is the most commonly used regulariser.

Atoms: nn-regularization, regularization. Lecture: L24-nnet-2.

Question 18 3 points Exam 2024 P1

In the prof's framing of regularisation across modules, tree-based models perform shrinkage-type regularisation via tree pruning. Which of the following lists is the analogous menu for neural networks?

A Bagging, random forests, gradient boosting, and post-hoc cost-complexity pruning of classification trees.
B Forward stepwise, backward stepwise, best-subset selection, AIC, BIC, and adjusted-$R^2$ for choosing the predictor set.
C Data augmentation, label smoothing, dropout, and early stopping (plus L1/L2 weight decay).
D One-hot encoding of categorical inputs, batch normalisation between layers, and tuning of advanced optimisers like Adam or RMSProp.

Show answer

Correct answer: C

The 2024 exam answer key gives this menu verbatim: data augmentation, label smoothing, dropout, early stopping (the slide bracket); plus the explicit weight-penalty options L1/L2 from the lecture. Together these are the prof's regularisation toolbox for NNs.

A lists ensembling tools — those are regularisation, but they belong to the tree side of the analogy, not the NN side. B is model-selection on linear models (information criteria, stepwise), unrelated to NN training tricks. D mixes encoding (a preprocessing step, not regularisation) with batch normalisation and optimisers — both of which are explicitly out of scope per the prof's statements about "advanced optimizers" and "batch normalisation".

Atoms: nn-regularization, regularization. Lecture: L24-nnet-2.

Question 19 3 points

For the Boston Housing regression problem (13 numerical predictors, continuous response, 506 observations), Exercise 11.3 trains a Keras feedforward network with two ReLU hidden layers (64 → 32 → 1, linear output). What preprocessing step is essential before fitting, and why?

A One-hot encode every numerical predictor as if it were categorical, so the network learns a separate weight per observed value of each variable.
B Standardise each predictor (subtract its training-set mean, divide by its training-set sd) so no single variable dominates the gradient by scale.
C Apply principal-component analysis to the 13 predictors and feed the network only the top 5 components, discarding the remaining variance to control flexibility.
D Convert the continuous response into one-hot indicator vectors of fixed-width housing-price bins and fit the network as a multi-class classifier instead.

Show answer

Correct answer: B

The prof's words on this exact exercise: "you don't want one variable to basically suck up all the variance, just like in PCA." Standardisation is mandatory before NNs, ridge, lasso, PCA, k-means, hierarchical clustering, and KNN — every method that is not scale-invariant.

A treats continuous predictors as categorical and explodes the input dimension nonsensically. C is allowable but discards information and is not the standard preprocessing for an NN regression; the exercise solution does not call for it. D destroys the regression target — Boston housing is a continuous-response regression problem, not a classification problem.

Atoms: standardization, feedforward-network. Lecture: L24-nnet-2.

Question 20 4 points

Mark each statement about CNNs as true or false.

a) A CNN is still a feedforward network, so backpropagation applies without modification. True False
b) The filter weights in a convolutional layer are hand-designed (e.g. as Gabor or Sobel edge detectors) and remain fixed during training. True False
c) A max-pooling layer reduces spatial dimension by emitting the maximum activation within each non-overlapping patch. True False
d) Data augmentation (rotation, flip, shift) is a particularly natural form of regularisation for image-classification CNNs. True False

Show answer

True — the prof's headline: "CNN is just a neural network… still no loops backwards. So backprop drops in." The trick is weight sharing across spatial locations.
False — that's the pre-CNN classical-vision approach. CNNs learn the filter weights from data, which is the algorithmic shift LeCun introduced.
True — the canonical conv → ReLU → max-pool block in Exercise 11.4.
True — same image, perturbed pose, same label — directly applicable to CNNs and a core part of Exercise 11.4.2.

Atoms: convolutional-neural-network, nn-regularization, backpropagation. Lecture: L24-nnet-2.

Question 21 4 points Ex11.4.1b

After training a CIFAR-10 CNN, the test confusion matrix shows $7\,200$ correctly classified and $2\,800$ misclassified out of $10\,000$ test images. What is the misclassification rate?

A $0.10$
B $0.28$
C $0.39$
D $0.72$

Show answer

Correct answer: B

Misclassification rate $= \frac{\text{incorrect}}{\text{total}} = \frac{2800}{10000} = 0.28$.

A is a round-number trap unrelated to the data. C divides incorrect by correct ($2800 / 7200 \approx 0.39$) — that's the odds, not the rate. D reports the accuracy ($7200/10000 = 0.72$), which is one minus the misclassification rate. The trap is mixing up rate, accuracy, and odds.

Atoms: convolutional-neural-network.

Question 22 3 points

A practitioner has a sequential dataset (e.g. NYSE daily volume, returns, volatility for the past $L$ days, predicting tomorrow's volume). Which architecture is the most appropriate match for the sequential structure within the scope of this course?

A A recurrent network that propagates a hidden state $A_t$ across the sequence with shared weights $W, U, B$.
B A standard 2-D convolutional network (CNN with $3 \times 3$ filters).
C An ordinary feedforward network with one large hidden layer trained on each day independently.
D Principal component analysis followed by linear regression on the leading components.

Show answer

Correct answer: A

RNNs are designed exactly for sequential data: the hidden state $A_t = \sigma(b + W X_t + U A_{t-1})$ carries information forward through the sequence, and the weights $W, U, B$ are shared across positions (the prof's emphasised weight-sharing point).

B is for 2-D image-like spatial structure; the NYSE setup is one-dimensional and sequential, not spatial. C ignores ordering by training each day in isolation — exactly what the prof flagged as the limitation an RNN is meant to overcome ("if there was no order, there would be no obvious ordering"). D is unsupervised dimensionality reduction; it does not address sequence structure at all.

Atoms: recurrent-neural-network, feedforward-network. Lecture: L26-nnet-3.

Question 23 3 points

Mark each statement about RNNs as true or false.

a) An RNN uses the same weights $W, U, B$ at every position in the sequence; this weight sharing is what keeps the parameter count manageable as sequence length grows. True False
b) Information from the input at position $1$ can in principle propagate through the hidden states and influence the output at position $L$. True False
c) RNNs can be trained with the standard feedforward backpropagation algorithm without modification, because the hidden states are deterministic functions of the inputs. True False

Show answer

True — verbatim from the prof: "we don't have… the weights changing every time" — same $W, U, B$ at every step is what keeps training tractable.
True — the prof: "if $L$ is a billion, even the first thing in your sequence is affecting the billionth output." (Information may degrade in practice — the motivation for LSTMs/attention, out of scope here.)
False — RNNs require a specialised variant ("backpropagation through time") because the recurrence creates an implicit loop in the dependency graph. BPTT itself is out of scope; the answer required is just "no, not unmodified".

Atoms: recurrent-neural-network, backpropagation. Lecture: L26-nnet-3.

Question 24 4 points Exam 2024 P3b

Consider an additive-error regression model $Y_i = f_\theta(X_i) + \varepsilon_i$ with $\varepsilon_i \sim \mathcal N(0, \sigma^2)$, i.i.d. The log-likelihood as a function of $\theta$ is $\ell(\theta) = -\tfrac{n}{2}\log(2\pi\sigma^2) - \tfrac{1}{2\sigma^2}\sum_i (y_i - f_\theta(x_i))^2$. Which conclusion follows directly?

A Maximising $\ell(\theta)$ in $\theta$ is equivalent to minimising $\sum_i (y_i - f_\theta(x_i))^2$, the least-squares objective.
B The MLE of $\theta$ depends on the noise variance $\sigma^2$ through the multiplicative factor in front of the sum, so changing $\sigma^2$ shifts the argmin.
C Maximising $\ell(\theta)$ in $\theta$ is equivalent to maximising the residual sum of squares $\sum_i (y_i - f_\theta(x_i))^2$.
D Without further distributional assumptions on $\varepsilon_i$, MLE and ordinary least squares give different estimates of $\theta$ even under Gaussian noise.

Show answer

Correct answer: A

The first term $-\tfrac{n}{2}\log(2\pi\sigma^2)$ does not depend on $\theta$. The second term contains $\theta$ only through $\sum_i (y_i - f_\theta(x_i))^2$, with a negative sign and a positive coefficient $\tfrac{1}{2\sigma^2}$. So maximising $\ell$ in $\theta$ is the same as minimising the residual sum of squares. Under additive Gaussian noise, MLE = LS — the mathy 2024 question the prof flagged as a likely template.

B is wrong: $\sigma^2$ enters as a positive multiplicative constant in front of $\sum (y - f)^2$, so it scales the objective but does not change the argmin in $\theta$ — the MLE depends on the residuals through this very sum. C reverses the sign — maximising the log-likelihood means minimising, not maximising, the SSE. D is incorrect: under Gaussian errors with constant $\sigma^2$, the two estimators coincide; the equivalence is the whole point of this derivation.

Atoms: nn-loss-functions, least-squares-and-mle. Lecture: L27-summary (mathy question).

Question 25 4 points

Mark each statement about double descent as true or false.

a) Double descent contradicts the bias-variance decomposition: bias and variance no longer sum to the test MSE in the over-parameterised regime. True False
b) The interpolation point — where the test error peaks before descending again — is typically near $\#\text{parameters} \approx \#\text{samples}$. True False
c) Past the interpolation point, mini-batch SGD picks the minimum-norm solution among the infinitely many zero-training-loss interpolators. True False
d) Most statistical learning methods covered in this course (ridge, lasso, GAMs, trees with CV-tuned hyperparameters) routinely exhibit double descent in practice. True False

Show answer

False — the prof was emphatic: bias and variance still always add up; double descent just changes the shape of the curve, not the identity.
True — the peak is at the interpolation threshold ($\#\text{params} \approx \#\text{samples}$) where the variance explodes, and test error often comes down a second time past that threshold.
True — this is the prof's preferred mechanism. Past the interpolation point the optimisation effectively becomes "minimise $\sum \beta^2$ subject to fitting all data exactly" — the min-norm interpolator.
False — the prof and the slide explicitly say most methods covered in the course do not show double descent; it is mostly a deep-learning phenomenon, and the slide says "we typically do not want to rely on this behaviour."

Atoms: double-descent, bias-variance-tradeoff, gradient-descent-and-sgd. Lecture: L26-nnet-3.

Question 26 3 points

How does the prof reconcile the bias-variance decomposition with the second descent in the over-parameterised regime?

A Past the interpolation point, the model has infinitely many zero-training-loss fits and SGD picks the minimum-norm one, which keeps variance low even at extreme flexibility.
B Past the interpolation point both the bias and the variance of the estimator vanish to zero simultaneously, so the bias-variance trade-off disappears entirely in the over-parameterised regime.
C The bias-variance decomposition is a theorem about ordinary least squares only, so it breaks down for any nonlinear model and need not apply to neural networks at all.
D The double-descent curve is observed only for ridge regression with vanishing penalty and has never been documented for neural networks trained with stochastic gradient descent.

Show answer

Correct answer: A

The prof's preferred framing: past the interpolation point the optimiser is no longer balancing fit and penalty — every model fits the training data exactly. So you are choosing among interpolators, and SGD's implicit bias selects the minimum-norm one, which has low variance and generalises well.

B is wrong: bias does not go to zero — the training fit is exact but bias is about expectations across training sets, not training residuals. C is wrong: the decomposition is a general identity for squared-error loss, not specific to linear regression. D is wrong: double descent was popularised in the NN context; it appears in over-parameterised polynomial fits, neural networks, and many flexible models.

Atoms: double-descent, bias-variance-tradeoff. Lecture: L26-nnet-3.

Question 27 3 points

You want to choose the dropout rate, the L2 weight-decay strength, and the hidden width of a feedforward network for a small tabular dataset. The prof's preferred selection strategy in this course is:

A Pick whichever combination of dropout rate, weight-decay strength, and width minimises the training-set loss after enough epochs to drive that loss to zero.
B Use AIC or BIC computed on the training set to penalise complexity automatically, leveraging the fact that they are derived from the same Gaussian likelihood as the loss.
C Tune the hyperparameters using $k$-fold cross-validation (or a held-out validation set), comparing CV/validation error across candidate values.
D Rely on SGD's implicit minimum-norm-interpolator selection in the over-parameterised regime, which removes the need for any explicit hyperparameter tuning whatsoever.

Show answer

Correct answer: C

The prof's repeated stance: hyperparameters (dropout rate, weight-decay strength, depth/width) are chosen via cross-validation or a held-out validation set. He explicitly distrusts AIC/BIC ("they're making assumptions that probably won't hold") and prefers CV.

A overfits: training error is monotone non-increasing in flexibility, so it always pushes toward the most complex model. B contradicts the prof's distrust of information criteria, and AIC/BIC are also flagged as having out-of-scope derivations. D misuses the min-norm-interpolator concept — it is a description of what SGD does in the post-interpolation regime, not a hyperparameter-tuning recipe.

Atoms: cross-validation, nn-regularization.

Question 28 4 points

Mark each statement about feedforward networks vs. classical methods as true or false.

a) A 1-hidden-layer FNN with linear hidden activation and sigmoid output is essentially equivalent to logistic regression (up to reparameterisation). True False
b) A GAM models the response as a sum of nonlinear functions of each individual covariate, while an FNN's hidden units apply nonlinearities to linear combinations of covariates — so FNNs can capture interactions that an additive GAM cannot. True False
c) The prof recommends preferring an unregularised neural network over lasso when the training set is small (e.g. a few hundred observations). True False
d) NNs are attractive when interpretability is a top priority because each hidden unit has a directly interpretable meaning. True False

Show answer

True — Exercise 11.1c: with linear hidden activation, the composition reduces to a single linear predictor passed through a sigmoid, the logistic-regression form.
True — Exercise 11.2c. GAMs are additive and per-feature; FNNs mix features inside each hidden unit and so capture interactions natively (at the cost of interpretability).
False — direction reversed. The Hitters comparison (linear ≈ 0.56, lasso ≈ 0.50, unregularised NN slightly worse on 263 observations) and the prof's verbatim "if you constructed your model this way you doing it wrong": small data + unregularised NN = bad. The prof: "if you don't have a lot of data and need interpretability, probably don't use neural networks at all. Use trees."
False — the prof's words: "if you fit a complicated neural network, in the end you don't know what you have." Hidden units are not directly interpretable; this is precisely why the prof recommends trees / classical models when interpretability matters.

Atoms: feedforward-network, nn-regularization. Lecture: L26-nnet-3.

Question 29 3 points

A hidden-layer ReLU neuron has 4 inputs and one bias. Inputs are $x = (1,\ -2,\ 3,\ 0)$, weights $w = (1,\ 1,\ -2,\ 2)$, bias $b = 1$. What is the neuron's output?

A $0$
B $-6$
C $6$
D $4$

Show answer

Correct answer: A

Pre-activation: $(1)(1) + (1)(-2) + (-2)(3) + (2)(0) + 1 = 1 - 2 - 6 + 0 + 1 = -6$. ReLU clamps negatives to zero: $\max(0, -6) = 0$.

B reports the raw pre-activation, forgetting to apply ReLU. C reports the absolute value of the pre-activation, treating ReLU as $|z|$. D is the sign-slip distractor: flipping the sign on the $w_3 x_3 = -6$ term gives $+6$, then applying bias $+1$ minus the ReLU step yields $4$ if you also drop the $-2$ contribution. The canonical trap is exactly the "forgot to apply $\max(0,\cdot)$" or sign error in the $w_3 x_3$ term.

Atoms: activation-functions, feedforward-network.