Neural network parameter count
The prof’s framing: a hand-calculation the exam will ask, and the canonical wrong answer is forgetting biases. His advice: draw the network, then count layer-by-layer.
Definition (prof’s framing)
“Q3b, Neural network parameter count + forward pass: Given a feed-forward neural network with stated input/hidden/output sizes: ‘How many weights total, including biases?’” - L27-summary
“His advice for the count: draw the network, then count the parameters, including bias terms.” - L27-summary
For a feedforward network, count, for each layer with inputs (from the previous layer) and units in this layer:
- Weights: (one per (incoming, this-unit) pair)
- Biases: (one per receiving unit in the new layer)
- Layer total:
Sum across all layers between input and output.
Notation & setup
For a single-hidden-layer FNN:
- : input dimension (inputs)
- : hidden-layer width
- : output dimension
- matrix: for ( row for bias) and
- matrix: for ( row for bias) and
Formula(s) to know cold
Single hidden layer (memorize)
Breakdown:
- Input → hidden: , weights per hidden unit, for the bias on each of hidden units → that’s the whole matrix.
- Hidden → output: , weights per output unit, for the bias on each of output units → the whole matrix.
Multilayer (general recipe)
For an FNN with input dim , hidden widths , output dim :
Each layer of the form .
Insights & mental models
Why the +1
Every non-input layer has a bias per unit. Think of it as the intercept in regression: each hidden / output unit has its own “intercept” added to the weighted sum. Mechanically, you can imagine a constant input “1” feeding every layer, with a separate weight for each receiving unit, that constant “1” carries the bias.
Draw, don’t compute from memory
The prof’s exam advice is procedural, not formulaic. Draw boxes, draw connections, count edges (weights), count receiving nodes (biases). The formula falls out and you avoid sign mistakes.
Why this is exam-flagged
It’s a one-step, low-conceptual, calculator-free question that everyone should get right but that has a famous failure mode (skip the biases). High discrimination value, the exam writers love these.
Worked examples (from Exercise11)
Exercise11.2a, formula :
- , ReLU hidden, linear output
- Params
Exercise11.2b, formula :
- , ReLU at both hidden layers, sigmoid at output
- Note: has no bias at the first hidden layer ( missing), that’s a quirk of how this exercise is written; double-check whether the formula includes biases explicitly.
- Standard answer (with biases at all layers): .
- If the first hidden layer truly omits biases (as written): . The prof would likely take the standard interpretation, but flag the ambiguity in the formula in your answer if you spot it (per L27 advice).
2025 exam Q3b.i (verbatim cited in L27-summary):
- 3 inputs, 1 hidden layer with 4 neurons, 1 output (regression)
- Hidden: . Output: . Total .
2023 exam Q5e (L27-summary grading reference):
- 128 inputs, hidden layers of 32 and 64, 5 output classes (softmax)
- Layer 1: . Layer 2: . Output: . Total .
Exam signals
“Q3b … given the network with stated input/hidden/output sizes: ‘How many weights total, including biases?’” - L27-summary
“His advice for the count: draw the network, then count the parameters, including bias terms.” - L27-summary
“Each hidden neuron has 3 input weights + 1 bias = 4 parameters. Total for hidden layer 4 × 4 = 16. Output neuron has 4 input weights + 1 bias = 5 parameters. Final total = 16 + 5 = 21.”, 2025 exam solution, L27-summary
The 2024 exam also had an architecture-interpretation MC referring to weight counts; the 2023 exam Q5e (“How many parameters do we need to estimate in total?”) is a near-identical task.
Pitfalls
- Forgetting biases: the canonical wrong answer. The 2023 exam Q5e listed “(ii) ” (no biases) as a distractor; correct answer was “(i) ” (with biases). The prof in the answer key: “remember that there is one bias-node in each layer!”
- Counting “+1” in the wrong place. The bias is on the receiving layer (the new units), not the sending layer. So a transition from inputs to outputs has params, not .
- Adding an extra bias for the input layer. Inputs are observed, no biases.
- Dropout is not a parameter reduction. Dropout multiplies activations by a random mask during training but doesn’t remove parameters. The 2023 exam Q5e had distractor “(iv) ” (apply 20% dropout to the parameter count), wrong. All weights are still estimated.
- Conflating “neurons” with “parameters”. A network with M = 4 hidden neurons does not have 4 parameters; it has . Neurons are the structure; parameters live on the connections + biases.
- Off-by-one when reading multilayer formulas. Trace each summation index back to (input dim, output dim) carefully; missing a layer or miscounting M is easy.
Scope vs ISLP
- In scope: the formula for one hidden layer, layer-by-layer counting for deeper networks, biases included, dropout doesn’t reduce count.
- Look up in ISLP: §10.1 (one hidden layer; the parameter-counting walkthrough is mostly the slides not ISLP).
- Skip in ISLP: parameter-sharing accounting in CNNs (kernel sizes × depth, not on this exam since the prof excluded CNN architecture details, see L27-summary and the convolutional-neural-network atom).
Exercise instances
- Exercise11.2a: 1-hidden-layer ReLU FNN (): parameters.
- Exercise11.2b: 2-hidden-layer ReLU + sigmoid FNN (): standard count .
How it might appear on the exam
- Direct question (2023, 2024, 2025 all had this): “Given network architecture X, how many parameters?” Multiple-choice with the no-bias version as a distractor.
- Drawing-from-formula style (Exercise11.2 verbatim): “Given the formula, identify the architecture and count parameters”, combine identification of activations with the count.
- Trap inside a longer question: a question may casually mention “1049-parameter NN” (Hitters Comparison in L26-nnet-3), don’t try to verify the count under time pressure unless the question demands it.
- Combined with 2025 Q3b.ii style: parameter count as part (i), forward pass through one neuron as part (ii). Standard pairing.
Related
- feedforward-network: where the architecture and the layer widths live
- activation-functions: choice of activation does not change the parameter count
- nn-regularization: Exercise11.1d (“how can a 10000-weight model fit on 1000 obs?”) explicitly references the parameter overage; answer is regularization
- double-descent: the prof’s reason for being okay with