← Back to wiki

Module 11 — Neural Networks

29 questions · 100 points · ~45 min

Click an option to lock the answer; the explanation auto-opens. Score tracker bottom-left.

Question 1 3 points

A feedforward network has $p$ inputs, one hidden layer of width $M$, and $C$ output units, with biases on every non-input unit. Which expression equals the total parameter count?

Show answer
Correct answer: C

Each hidden unit has $p$ input weights plus one bias, so input→hidden contributes $(p+1)M$. Each output unit has $M$ weights from hidden plus one bias, so hidden→output contributes $(M+1)C$. The prof's headline formula.

A starts from the bias-less weight count and adds $p + C$ instead of the correct $M + C$ — the canonical "added the wrong receiving-side count" slip the prof flagged. B multiplies the three layer counts as if they composed, conflating "how many connections in series" with "how many parameters total"; the formula triple-counts every layer. D puts the bias on the sending side instead of the receiving side; the $+1$ lives on whichever layer is actually receiving the bias unit, never on the sender.

Atoms: nn-parameter-count, feedforward-network. Lecture: L23-nnet-1.

Question 2 4 points Exam 2025 P2c

A fully-connected feedforward network has 3 inputs, one hidden layer with 4 ReLU neurons, and a single output unit (regression). Biases on all hidden and output units. How many parameters does the network estimate?

Show answer
Correct answer: D

Hidden layer: each of $4$ neurons has $3$ input weights $+ 1$ bias $= 4$ parameters → $4 \cdot 4 = 16$. Output: $4$ weights from the hidden units $+ 1$ bias $= 5$. Total $= 16 + 5 = 21$.

A counts only the hidden-layer weights and biases ($4 \cdot 4$) — forgets the output unit entirely. B drops the output bias, giving only $4$ extra. C drops the hidden-layer biases ($3 \cdot 4 + 4 \cdot 1 + 1 = 17$? no — drops all biases gives $3 \cdot 4 + 4 \cdot 1 = 16$; $20$ is the half-bias slip $4 \cdot 4 + 4 = 20$). The bias-omission distractors are the canonical trap the prof flagged.

Atoms: nn-parameter-count. Lecture: L27-summary (Q3b walkthrough).

Question 3 4 points Exam 2023 P5e

A fully-connected feedforward network for 5-class classification has input dimension $128$, two hidden layers of widths $32$ and $64$ (in that order), and a softmax output layer. Biases everywhere; dropout of $20\%$ is applied in each hidden layer during training. How many parameters are estimated?

Show answer
Correct answer: B

Layer-by-layer with biases on every receiving layer: $128\to 32$ has $(128+1) \cdot 32 = 4128$; $32 \to 64$ has $(32+1) \cdot 64 = 2112$; $64 \to 5$ softmax has $(64+1) \cdot 5 = 325$. Sum $= 4128 + 2112 + 325 = 6565$.

A drops every bias — the canonical wrong answer the prof said students always pick. C applies the dropout rate to the parameter count; dropout zeros activations during training, it does not delete parameters, so all weights are still estimated. D drops the biases on the second and third layers but keeps them on the first.

Atoms: nn-parameter-count, nn-regularization. Lecture: L27-summary.

Question 4 4 points Exam 2025 P2c(ii)

A single hidden ReLU neuron receives inputs $x_1 = -1,\ x_2 = 2,\ x_3 = 0$ with weights $w_1 = 2,\ w_2 = 0,\ w_3 = 1$ and bias $b = 0$. What is the neuron's output?

Show answer
Correct answer: D

Pre-activation $z = w_1 x_1 + w_2 x_2 + w_3 x_3 + b = (-1)(2) + (2)(0) + (0)(1) + 0 = -2$. ReLU clamps negatives to zero: $\max(0, -2) = 0$.

A reports the raw pre-activation, forgetting to apply ReLU. B uses the absolute value of the pre-activation, ignoring that ReLU is one-sided. C flips a sign on $x_1$ (treats it as $+1$, then clamps a hypothetical positive). The canonical trap is forgetting that ReLU outputs exactly zero whenever the pre-activation is negative.

Atoms: activation-functions, feedforward-network. Lecture: L27-summary.

Question 5 3 points

A single ReLU hidden neuron has weights $w = (0.3,\ -0.7,\ 0.5,\ 0.2)$, bias $b = 0.4$, and inputs $x = (2,\ 1,\ -1,\ 3)$. What is the output of the neuron?

Show answer
Correct answer: A

Dot product: $(0.3)(2) + (-0.7)(1) + (0.5)(-1) + (0.2)(3) = 0.6 - 0.7 - 0.5 + 0.6 = 0.0$. Add bias: $0.0 + 0.4 = 0.4$. ReLU is the identity on positives: $\max(0, 0.4) = 0.4$.

B treats the pre-activation as if it were negative — the canonical "forgot to add the bias" or "flipped a sign" trap that yields exactly $0$ after ReLU. C uses $|w \cdot x|$ wrongly (or treats the third term as $(-0.5)(-1) = 0.5$ then adds bias incorrectly). D reports a negative pre-activation as the output — forgets that ReLU clamps negatives to zero (and the pre-activation here is in fact positive after the bias).

Atoms: activation-functions, feedforward-network. Lecture: L24-nnet-2.

Question 6 3 points Exam 2025 P2g

Why are nonlinear activation functions necessary in a feedforward neural network?

Show answer
Correct answer: C

The prof's verbatim point: a linear sum of linear things is linear, so without a nonlinear hidden activation the entire stacked network collapses to ordinary linear regression. Nonlinearity is what gives the network expressive power and makes the universal-approximation result possible.

A is wrong: activation choice is independent of parameter count — swapping ReLU for sigmoid does not change how many weights you estimate. B confuses activation with topology — full connectivity is a separate architectural choice. D is wrong: gradient descent on a non-convex NN loss is not guaranteed to find the global optimum, regardless of activation. The prof flagged the "reduces parameters" and "makes fully connected" answers as canonical wrong picks.

Atoms: activation-functions, universal-approximation. Lecture: L23-nnet-1.

Question 7 4 points

Mark each statement about activation functions as true or false.

Show answer
  1. True — sigmoid + BCE is the natural pairing because BCE is the negative log-likelihood of a Bernoulli model, exactly the GLM identification the prof emphasised.
  2. False — a stack of linear layers composes to a single linear function. Linear hidden activation collapses the network to linear regression regardless of width.
  3. True — softmax normalises by $\sum_k e^{z_k}$ so the outputs lie on the probability simplex by construction.
  4. False — ReLU is $\max(0, z)$: zero on the negative side, identity on the positive side. Distinct from $|z|$.

Atoms: activation-functions, nn-loss-functions, feedforward-network.

Question 8 3 points

A network for $C = 10$-class image classification (e.g. CIFAR-10) is being designed. Which output-activation + loss pairing is appropriate?

Show answer
Correct answer: D

Multi-class output with one-hot labels uses softmax across $C$ output nodes (probabilities summing to 1) paired with categorical cross-entropy (the negative log-likelihood of a multinomial). This matches the prof's slide-deck pairing table.

A treats the class label as a continuous response — wrong family of loss; gives nonsensical gradients for classification. B is the binary pairing and forces class probabilities into a single number, which can't represent 10 mutually-exclusive classes. C mismatches an output-only activation (ReLU is for hidden layers) with a regression loss, breaking the classification interpretation entirely.

Atoms: nn-loss-functions, activation-functions, convolutional-neural-network.

Question 9 3 points

Given the formula $\hat y(\mathbf x) = \beta_0 + \sum_{m=1}^5 \beta_m \cdot \max\!\left(\alpha_{0m} + \sum_{j=1}^{10} \alpha_{jm} x_j, 0\right)$, identify the architecture and count its parameters.

Show answer
Correct answer: C

The $\max(\cdot, 0)$ identifies ReLU; the outer sum has no nonlinearity so the output activation is linear; one level of nesting means one hidden layer. With $p = 10,\ M = 5,\ C = 1$: parameters $= (10+1) \cdot 5 + (5+1) \cdot 1 = 55 + 6 = 61$.

A drops both biases ($10 \cdot 5 + 0 = 50$). B drops only the output bias and keeps hidden biases ($55$). D misreads $\max(\cdot, 0)$ as a sigmoid and counts an extra layer the formula does not contain — there is only one $\max$ nesting.

Atoms: nn-parameter-count, feedforward-network, activation-functions.

Question 10 3 points

Which of the following is a sufficient condition for a feedforward neural network to satisfy the universal-approximation property (Borel-measurable functions, arbitrary $\varepsilon$)?

Show answer
Correct answer: B

The theorem requires (i) a linear output layer, (ii) at least one hidden layer with a "squashing" / nonlinear activation, and (iii) enough hidden units. Depth is not required — one wide hidden layer suffices.

A combines a skip connection (which UAT does not require and is out of scope here) with an identity hidden activation, so the network collapses to a linear function regardless of width — no squashing means no universal approximation. C invokes recurrence, which is unrelated to UAT (recurrent networks add a different capability). D over-specifies with sigmoid output and a fixed depth — the theorem is silent on output activation choice and does not require a particular depth.

Atoms: universal-approximation, activation-functions. Lecture: L23-nnet-1.

Question 11 4 points

Mark each statement about the universal-approximation theorem as true or false.

Show answer
  1. True — that's the existence statement.
  2. False — UAT is purely an existence result. It says nothing about width bounds, optimisation, or whether SGD will find the weights.
  3. True — depth and width trade off; modern practice favours depth.
  4. False — the proof requires measure theory and the prof explicitly excluded it: "we don't talk about that in this class."

Atoms: universal-approximation. Lecture: L23-nnet-1.

Question 12 3 points

The mini-batch SGD update for parameters $\boldsymbol\theta$ with learning rate $\lambda$ uses a gradient estimated from a random subset of $m \ll N$ samples. Which statement best describes mini-batch SGD?

Show answer
Correct answer: C

A random mini-batch is an unbiased estimator of the full gradient (same expectation, larger variance). The prof's headline NN fact: in the over-parameterised regime mini-batch SGD picks the minimum-norm interpolator among the infinitely many zero-loss solutions — that is the implicit L2 regularisation, "and it has been proven."

A confuses noise with bias — mini-batch gradients are noisy but unbiased. B reverses the speed claim: small batches are faster per step (less computation, parallelisable). D conflates a hardware convention (powers of two for GPU efficiency) with a mathematical equivalence; batch size has nothing to do with statistical equivalence to full-batch GD.

Atoms: gradient-descent-and-sgd, regularization. Lecture: L23-nnet-1.

Question 13 4 points

Mark each statement about gradient descent and mini-batch SGD as true or false.

Show answer
  1. False — direction reversed. The prof's words: "$\eta = 2$ is horrible, bouncing everywhere; $\eta \approx 0.1$ is fine." Too-large learning rates overshoot.
  2. True — bigger batch → less noise → less implicit regularisation. The §4b direction-of-effect row in the exam analysis.
  3. True — verbatim prof framing: "powers of two because that's what you always do in machine learning, hardware-efficiency."
  4. False — NN losses are highly non-convex; the prof noted that the local minima are forgivingly connected, but they are not unique global minima.

Atoms: gradient-descent-and-sgd. Lecture: L24-nnet-2.

Question 14 3 points

Backpropagation is the standard way to compute gradients for training a feedforward network. Which of the following best describes what backpropagation does?

Show answer
Correct answer: D

Backprop is the chain rule, organised so that pre-activations and activations stored during the forward pass get reused on the way back. That makes one forward + one backward pass enough to compute the gradient with respect to every parameter — the algorithmic breakthrough that pulled NNs out of the AI winter.

A describes the gradient-descent update, not how the gradient is computed; backprop is the calculator, SGD is the updater. B confuses backprop with validation/early-stopping. C reverses the relationship: backprop computes the gradient, and an optimiser (SGD, Adam, …) consumes it; backprop does not replace optimisers.

Atoms: backpropagation, gradient-descent-and-sgd. Lecture: L24-nnet-2.

Question 15 3 points

Mark each statement about backpropagation as true or false.

Show answer
  1. True — the prof: "if you had loops you're kind of screwed". The acyclic structure is what lets the backward sweep work.
  2. True — $\delta^{\text{hid}} = \delta^{\text{out}} \cdot \beta \cdot g'(v)$. For ReLU, $g'(v)$ is 0 or 1, which is what makes the gradient computation cheap.
  3. False — recurrent networks need a specialised variant ("backpropagation through time", out of scope). The basic algorithm needs modification because the hidden state at $t$ depends on itself at $t-1$.

Atoms: backpropagation, recurrent-neural-network.

Question 16 3 points

Exercise 11.1d: you have a feedforward network with $10\,000$ weights but only $1\,000$ training observations. Which of the following is the strongest single argument that the model can still generalise well?

Show answer
Correct answer: D

The textbook answer to "$10\,000$ weights, $1\,000$ samples" is regularisation — explicit (L1/L2, dropout, early stopping, data augmentation, label smoothing, transfer learning) and implicit (mini-batch SGD picks the minimum-norm interpolator, the double-descent mechanism). The prof's iron rule: never train an NN without regularisation.

A confuses existence (UAT) with generalisation: UAT promises a network exists that approximates any function on training data, but says nothing about test performance. B inverts the trade-off — more flexibility raises variance and typically hurts test error in the under-parameterised regime. C ignores parameter count entirely; choice of activation does not by itself prevent overfitting.

Atoms: nn-regularization, double-descent, gradient-descent-and-sgd.

Question 17 5 points

Mark each statement about NN regularisation as true or false.

Show answer
  1. True — the Goodfellow definition the prof read aloud: "a modification intended to reduce its generalisation error but not its training error." It can hurt training fit by design.
  2. False — direction reversed. The prof's words: "$20\%$ is very common, never use $50\%$." Fifty percent is too aggressive.
  3. True — dropout is a training-time noise injection; at inference you evaluate the intact network (or scale outputs appropriately).
  4. False — that's L1 (lasso-style sparsity). L2 (ridge-style) shrinks weights smoothly toward zero but rarely produces exact zeros.
  5. True — that is the definition of early stopping. The prof confessed it "feels like cheating", but it is the most commonly used regulariser.

Atoms: nn-regularization, regularization. Lecture: L24-nnet-2.

Question 18 3 points Exam 2024 P1

In the prof's framing of regularisation across modules, tree-based models perform shrinkage-type regularisation via tree pruning. Which of the following lists is the analogous menu for neural networks?

Show answer
Correct answer: C

The 2024 exam answer key gives this menu verbatim: data augmentation, label smoothing, dropout, early stopping (the slide bracket); plus the explicit weight-penalty options L1/L2 from the lecture. Together these are the prof's regularisation toolbox for NNs.

A lists ensembling tools — those are regularisation, but they belong to the tree side of the analogy, not the NN side. B is model-selection on linear models (information criteria, stepwise), unrelated to NN training tricks. D mixes encoding (a preprocessing step, not regularisation) with batch normalisation and optimisers — both of which are explicitly out of scope per the prof's statements about "advanced optimizers" and "batch normalisation".

Atoms: nn-regularization, regularization. Lecture: L24-nnet-2.

Question 19 3 points

For the Boston Housing regression problem (13 numerical predictors, continuous response, 506 observations), Exercise 11.3 trains a Keras feedforward network with two ReLU hidden layers (64 → 32 → 1, linear output). What preprocessing step is essential before fitting, and why?

Show answer
Correct answer: B

The prof's words on this exact exercise: "you don't want one variable to basically suck up all the variance, just like in PCA." Standardisation is mandatory before NNs, ridge, lasso, PCA, k-means, hierarchical clustering, and KNN — every method that is not scale-invariant.

A treats continuous predictors as categorical and explodes the input dimension nonsensically. C is allowable but discards information and is not the standard preprocessing for an NN regression; the exercise solution does not call for it. D destroys the regression target — Boston housing is a continuous-response regression problem, not a classification problem.

Atoms: standardization, feedforward-network. Lecture: L24-nnet-2.

Question 20 4 points

Mark each statement about CNNs as true or false.

Show answer
  1. True — the prof's headline: "CNN is just a neural network… still no loops backwards. So backprop drops in." The trick is weight sharing across spatial locations.
  2. False — that's the pre-CNN classical-vision approach. CNNs learn the filter weights from data, which is the algorithmic shift LeCun introduced.
  3. True — the canonical conv → ReLU → max-pool block in Exercise 11.4.
  4. True — same image, perturbed pose, same label — directly applicable to CNNs and a core part of Exercise 11.4.2.

Atoms: convolutional-neural-network, nn-regularization, backpropagation. Lecture: L24-nnet-2.

Question 21 4 points Ex11.4.1b

After training a CIFAR-10 CNN, the test confusion matrix shows $7\,200$ correctly classified and $2\,800$ misclassified out of $10\,000$ test images. What is the misclassification rate?

Show answer
Correct answer: B

Misclassification rate $= \frac{\text{incorrect}}{\text{total}} = \frac{2800}{10000} = 0.28$.

A is a round-number trap unrelated to the data. C divides incorrect by correct ($2800 / 7200 \approx 0.39$) — that's the odds, not the rate. D reports the accuracy ($7200/10000 = 0.72$), which is one minus the misclassification rate. The trap is mixing up rate, accuracy, and odds.

Atoms: convolutional-neural-network.

Question 22 3 points

A practitioner has a sequential dataset (e.g. NYSE daily volume, returns, volatility for the past $L$ days, predicting tomorrow's volume). Which architecture is the most appropriate match for the sequential structure within the scope of this course?

Show answer
Correct answer: A

RNNs are designed exactly for sequential data: the hidden state $A_t = \sigma(b + W X_t + U A_{t-1})$ carries information forward through the sequence, and the weights $W, U, B$ are shared across positions (the prof's emphasised weight-sharing point).

B is for 2-D image-like spatial structure; the NYSE setup is one-dimensional and sequential, not spatial. C ignores ordering by training each day in isolation — exactly what the prof flagged as the limitation an RNN is meant to overcome ("if there was no order, there would be no obvious ordering"). D is unsupervised dimensionality reduction; it does not address sequence structure at all.

Atoms: recurrent-neural-network, feedforward-network. Lecture: L26-nnet-3.

Question 23 3 points

Mark each statement about RNNs as true or false.

Show answer
  1. True — verbatim from the prof: "we don't have… the weights changing every time" — same $W, U, B$ at every step is what keeps training tractable.
  2. True — the prof: "if $L$ is a billion, even the first thing in your sequence is affecting the billionth output." (Information may degrade in practice — the motivation for LSTMs/attention, out of scope here.)
  3. False — RNNs require a specialised variant ("backpropagation through time") because the recurrence creates an implicit loop in the dependency graph. BPTT itself is out of scope; the answer required is just "no, not unmodified".

Atoms: recurrent-neural-network, backpropagation. Lecture: L26-nnet-3.

Question 24 4 points Exam 2024 P3b

Consider an additive-error regression model $Y_i = f_\theta(X_i) + \varepsilon_i$ with $\varepsilon_i \sim \mathcal N(0, \sigma^2)$, i.i.d. The log-likelihood as a function of $\theta$ is $\ell(\theta) = -\tfrac{n}{2}\log(2\pi\sigma^2) - \tfrac{1}{2\sigma^2}\sum_i (y_i - f_\theta(x_i))^2$. Which conclusion follows directly?

Show answer
Correct answer: A

The first term $-\tfrac{n}{2}\log(2\pi\sigma^2)$ does not depend on $\theta$. The second term contains $\theta$ only through $\sum_i (y_i - f_\theta(x_i))^2$, with a negative sign and a positive coefficient $\tfrac{1}{2\sigma^2}$. So maximising $\ell$ in $\theta$ is the same as minimising the residual sum of squares. Under additive Gaussian noise, MLE = LS — the mathy 2024 question the prof flagged as a likely template.

B is wrong: $\sigma^2$ enters as a positive multiplicative constant in front of $\sum (y - f)^2$, so it scales the objective but does not change the argmin in $\theta$ — the MLE depends on the residuals through this very sum. C reverses the sign — maximising the log-likelihood means minimising, not maximising, the SSE. D is incorrect: under Gaussian errors with constant $\sigma^2$, the two estimators coincide; the equivalence is the whole point of this derivation.

Atoms: nn-loss-functions, least-squares-and-mle. Lecture: L27-summary (mathy question).

Question 25 4 points

Mark each statement about double descent as true or false.

Show answer
  1. False — the prof was emphatic: bias and variance still always add up; double descent just changes the shape of the curve, not the identity.
  2. True — the peak is at the interpolation threshold ($\#\text{params} \approx \#\text{samples}$) where the variance explodes, and test error often comes down a second time past that threshold.
  3. True — this is the prof's preferred mechanism. Past the interpolation point the optimisation effectively becomes "minimise $\sum \beta^2$ subject to fitting all data exactly" — the min-norm interpolator.
  4. False — the prof and the slide explicitly say most methods covered in the course do not show double descent; it is mostly a deep-learning phenomenon, and the slide says "we typically do not want to rely on this behaviour."

Atoms: double-descent, bias-variance-tradeoff, gradient-descent-and-sgd. Lecture: L26-nnet-3.

Question 26 3 points

How does the prof reconcile the bias-variance decomposition with the second descent in the over-parameterised regime?

Show answer
Correct answer: A

The prof's preferred framing: past the interpolation point the optimiser is no longer balancing fit and penalty — every model fits the training data exactly. So you are choosing among interpolators, and SGD's implicit bias selects the minimum-norm one, which has low variance and generalises well.

B is wrong: bias does not go to zero — the training fit is exact but bias is about expectations across training sets, not training residuals. C is wrong: the decomposition is a general identity for squared-error loss, not specific to linear regression. D is wrong: double descent was popularised in the NN context; it appears in over-parameterised polynomial fits, neural networks, and many flexible models.

Atoms: double-descent, bias-variance-tradeoff. Lecture: L26-nnet-3.

Question 27 3 points

You want to choose the dropout rate, the L2 weight-decay strength, and the hidden width of a feedforward network for a small tabular dataset. The prof's preferred selection strategy in this course is:

Show answer
Correct answer: C

The prof's repeated stance: hyperparameters (dropout rate, weight-decay strength, depth/width) are chosen via cross-validation or a held-out validation set. He explicitly distrusts AIC/BIC ("they're making assumptions that probably won't hold") and prefers CV.

A overfits: training error is monotone non-increasing in flexibility, so it always pushes toward the most complex model. B contradicts the prof's distrust of information criteria, and AIC/BIC are also flagged as having out-of-scope derivations. D misuses the min-norm-interpolator concept — it is a description of what SGD does in the post-interpolation regime, not a hyperparameter-tuning recipe.

Atoms: cross-validation, nn-regularization.

Question 28 4 points

Mark each statement about feedforward networks vs. classical methods as true or false.

Show answer
  1. True — Exercise 11.1c: with linear hidden activation, the composition reduces to a single linear predictor passed through a sigmoid, the logistic-regression form.
  2. True — Exercise 11.2c. GAMs are additive and per-feature; FNNs mix features inside each hidden unit and so capture interactions natively (at the cost of interpretability).
  3. False — direction reversed. The Hitters comparison (linear ≈ 0.56, lasso ≈ 0.50, unregularised NN slightly worse on 263 observations) and the prof's verbatim "if you constructed your model this way you doing it wrong": small data + unregularised NN = bad. The prof: "if you don't have a lot of data and need interpretability, probably don't use neural networks at all. Use trees."
  4. False — the prof's words: "if you fit a complicated neural network, in the end you don't know what you have." Hidden units are not directly interpretable; this is precisely why the prof recommends trees / classical models when interpretability matters.

Atoms: feedforward-network, nn-regularization. Lecture: L26-nnet-3.

Question 29 3 points

A hidden-layer ReLU neuron has 4 inputs and one bias. Inputs are $x = (1,\ -2,\ 3,\ 0)$, weights $w = (1,\ 1,\ -2,\ 2)$, bias $b = 1$. What is the neuron's output?

Show answer
Correct answer: A

Pre-activation: $(1)(1) + (1)(-2) + (-2)(3) + (2)(0) + 1 = 1 - 2 - 6 + 0 + 1 = -6$. ReLU clamps negatives to zero: $\max(0, -6) = 0$.

B reports the raw pre-activation, forgetting to apply ReLU. C reports the absolute value of the pre-activation, treating ReLU as $|z|$. D is the sign-slip distractor: flipping the sign on the $w_3 x_3 = -6$ term gives $+6$, then applying bias $+1$ minus the ReLU step yields $4$ if you also drop the $-2$ contribution. The canonical trap is exactly the "forgot to apply $\max(0,\cdot)$" or sign error in the $w_3 x_3$ term.

Atoms: activation-functions, feedforward-network.