Neural network parameter count

The prof’s framing: a hand-calculation the exam will ask, and the canonical wrong answer is forgetting biases. His advice: draw the network, then count layer-by-layer.

Definition (prof’s framing)

“Q3b, Neural network parameter count + forward pass: Given a feed-forward neural network with stated input/hidden/output sizes: ‘How many weights total, including biases?’” - L27-summary

“His advice for the count: draw the network, then count the parameters, including bias terms.” - L27-summary

For a feedforward network, count, for each layer $ℓ$ with $n_{ℓ - 1}$ inputs (from the previous layer) and $n_{ℓ}$ units in this layer:

Weights: $n_{ℓ - 1} \cdot n_{ℓ}$ (one per (incoming, this-unit) pair)
Biases: $n_{ℓ}$ (one per receiving unit in the new layer)
Layer total: $(n_{ℓ - 1} + 1) \cdot n_{ℓ}$

Sum across all layers between input and output.

Notation & setup

For a single-hidden-layer FNN:

$p$ : input dimension (inputs)
$M$ : hidden-layer width
$C$ : output dimension
$α$ matrix: $α_{j m}$ for $j = 0, \dots, p$ ( $+ 1$ row for bias) and $m = 1, \dots, M$
$β$ matrix: $β_{m c}$ for $m = 0, \dots, M$ ( $+ 1$ row for bias) and $c = 1, \dots, C$

Formula(s) to know cold

Single hidden layer (memorize)

$params = (p + 1) M + (M + 1) C$

Breakdown:

Input → hidden: $(p + 1) \cdot M$ , $p$ weights per hidden unit, $+ 1$ for the bias on each of $M$ hidden units → that’s the whole $α$ matrix.
Hidden → output: $(M + 1) \cdot C$ , $M$ weights per output unit, $+ 1$ for the bias on each of $C$ output units → the whole $β$ matrix.

Multilayer (general recipe)

For an FNN with input dim $p$ , hidden widths $M_{1}, M_{2}, \dots, M_{L}$ , output dim $C$ :

$params = (p + 1) M_{1} + (M_{1} + 1) M_{2} + \dots + (M_{L - 1} + 1) M_{L} + (M_{L} + 1) C$

Each layer of the form $(previous-layer-width + 1) \cdot (this-layer-width)$ .

Insights & mental models

Why the +1

Every non-input layer has a bias per unit. Think of it as the intercept in regression: each hidden / output unit has its own “intercept” added to the weighted sum. Mechanically, you can imagine a constant input “1” feeding every layer, with a separate weight for each receiving unit, that constant “1” carries the bias.

Draw, don’t compute from memory

The prof’s exam advice is procedural, not formulaic. Draw boxes, draw connections, count edges (weights), count receiving nodes (biases). The formula falls out and you avoid sign mistakes.

Why this is exam-flagged

It’s a one-step, low-conceptual, calculator-free question that everyone should get right but that has a famous failure mode (skip the biases). High discrimination value, the exam writers love these.

Worked examples (from Exercise11)

Exercise11.2a, formula $\overset{y}{^}_{1} (x) = β_{01} + \sum_{m = 1}^{5} β_{m 1} \cdot max (α_{0 m} + \sum_{j = 1}^{10} α_{j m} x_{j}, 0)$ :

$p = 10, M = 5, C = 1$ , ReLU hidden, linear output
Params $= (10 + 1) \cdot 5 + (5 + 1) \cdot 1 = 55 + 6 = 61$

Exercise11.2b, formula $\overset{y}{^}_{1} (x) = (1 + exp (- β_{01} - \sum_{m = 1}^{5} β_{m 1} max (γ_{0 m} + \sum_{l = 1}^{10} γ_{l m} max (\sum_{j = 1}^{4} α_{j l} x_{j}, 0), 0)))^{- 1}$ :

$p = 4, M_{1} = 10, M_{2} = 5, C = 1$ , ReLU at both hidden layers, sigmoid at output
Note: $\sum_{j = 1}^{4} α_{j l} x_{j}$ has no bias at the first hidden layer ( $α_{0 l}$ missing), that’s a quirk of how this exercise is written; double-check whether the formula includes biases explicitly.
Standard answer (with biases at all layers): $(4 + 1) \cdot 10 + (10 + 1) \cdot 5 + (5 + 1) \cdot 1 = 50 + 55 + 6 = 111$ .
If the first hidden layer truly omits biases (as written): $4 \cdot 10 + (10 + 1) \cdot 5 + (5 + 1) \cdot 1 = 40 + 55 + 6 = 101$ . The prof would likely take the standard interpretation, but flag the ambiguity in the formula in your answer if you spot it (per L27 advice).

2025 exam Q3b.i (verbatim cited in L27-summary):

3 inputs, 1 hidden layer with 4 neurons, 1 output (regression)
Hidden: $4 \cdot (3 + 1) = 16$ . Output: $1 \cdot (4 + 1) = 5$ . Total $= 21$ .

2023 exam Q5e (L27-summary grading reference):

128 inputs, hidden layers of 32 and 64, 5 output classes (softmax)
Layer 1: $(128 + 1) \cdot 32 = 4128$ . Layer 2: $(32 + 1) \cdot 64 = 2112$ . Output: $(64 + 1) \cdot 5 = 325$ . Total $= 6565$ .

Exam signals

“Q3b … given the network with stated input/hidden/output sizes: ‘How many weights total, including biases?’” - L27-summary

“His advice for the count: draw the network, then count the parameters, including bias terms.” - L27-summary

“Each hidden neuron has 3 input weights + 1 bias = 4 parameters. Total for hidden layer 4 × 4 = 16. Output neuron has 4 input weights + 1 bias = 5 parameters. Final total = 16 + 5 = 21.”, 2025 exam solution, L27-summary

The 2024 exam also had an architecture-interpretation MC referring to weight counts; the 2023 exam Q5e (“How many parameters do we need to estimate in total?”) is a near-identical task.

Pitfalls

Forgetting biases: the canonical wrong answer. The 2023 exam Q5e listed “(ii) $128 \cdot 32 + 32 \cdot 64 + 64 \cdot 5$ ” (no biases) as a distractor; correct answer was “(i) $129 \cdot 32 + 33 \cdot 64 + 65 \cdot 5$ ” (with biases). The prof in the answer key: “remember that there is one bias-node in each layer!”
Counting “+1” in the wrong place. The bias is on the receiving layer (the new units), not the sending layer. So a transition from $a$ inputs to $b$ outputs has $(a + 1) \cdot b$ params, not $a \cdot (b + 1)$ .
Adding an extra bias for the input layer. Inputs are observed, no biases.
Dropout is not a parameter reduction. Dropout multiplies activations by a random mask during training but doesn’t remove parameters. The 2023 exam Q5e had distractor “(iv) $0.8 \cdot \dots$ ” (apply 20% dropout to the parameter count), wrong. All weights are still estimated.
Conflating “neurons” with “parameters”. A network with M = 4 hidden neurons does not have 4 parameters; it has $(p + 1) \cdot M + (M + 1) \cdot C$ . Neurons are the structure; parameters live on the connections + biases.
Off-by-one when reading multilayer formulas. Trace each summation index back to (input dim, output dim) carefully; missing a layer or miscounting M is easy.

Scope vs ISLP

In scope: the $(p + 1) M + (M + 1) C$ formula for one hidden layer, layer-by-layer counting for deeper networks, biases included, dropout doesn’t reduce count.
Look up in ISLP: §10.1 (one hidden layer; the parameter-counting walkthrough is mostly the slides not ISLP).
Skip in ISLP: parameter-sharing accounting in CNNs (kernel sizes × depth, not on this exam since the prof excluded CNN architecture details, see L27-summary and the convolutional-neural-network atom).

Exercise instances

Exercise11.2a: 1-hidden-layer ReLU FNN ( $p = 10, M = 5, C = 1$ ): $(10 + 1) \cdot 5 + (5 + 1) \cdot 1 = 61$ parameters.
Exercise11.2b: 2-hidden-layer ReLU + sigmoid FNN ( $p = 4, M_{1} = 10, M_{2} = 5, C = 1$ ): standard count $(4 + 1) \cdot 10 + (10 + 1) \cdot 5 + (5 + 1) \cdot 1 = 111$ .

How it might appear on the exam

Direct question (2023, 2024, 2025 all had this): “Given network architecture X, how many parameters?” Multiple-choice with the no-bias version as a distractor.
Drawing-from-formula style (Exercise11.2 verbatim): “Given the formula, identify the architecture and count parameters”, combine identification of activations with the count.
Trap inside a longer question: a question may casually mention “1049-parameter NN” (Hitters Comparison in L26-nnet-3), don’t try to verify the count under time pressure unless the question demands it.
Combined with 2025 Q3b.ii style: parameter count as part (i), forward pass through one neuron as part (ii). Standard pairing.

feedforward-network: where the architecture and the layer widths $p, M, C$ live
activation-functions: choice of activation does not change the parameter count
nn-regularization: Exercise11.1d (“how can a 10000-weight model fit on 1000 obs?”) explicitly references the parameter overage; answer is regularization
double-descent: the prof’s reason for being okay with $# params ≫ n$

statistical.dog

Explorer

nn-parameter-count

Neural network parameter count

Definition (prof’s framing)

Notation & setup

Formula(s) to know cold

Single hidden layer (memorize)

Multilayer (general recipe)

Insights & mental models

Why the +1

Draw, don’t compute from memory

Why this is exam-flagged

Worked examples (from Exercise11)

Exam signals

Pitfalls

Scope vs ISLP

Exercise instances

How it might appear on the exam

Graph View

Table of Contents

Backlinks

statistical.dog

Explorer

nn-parameter-count

Neural network parameter count

Definition (prof’s framing)

Notation & setup

Formula(s) to know cold

Single hidden layer (memorize)

Multilayer (general recipe)

Insights & mental models

Why the +1

Draw, don’t compute from memory

Why this is exam-flagged

Worked examples (from Exercise11)

Exam signals

Pitfalls

Scope vs ISLP

Exercise instances

How it might appear on the exam

Related

Graph View

Table of Contents

Backlinks