Universal approximation

The motivation answer to “why feedforward suffices”: a feedforward net with one hidden layer of squashing units and enough width can approximate any Borel-measurable function to arbitrary accuracy. The proof is explicitly out of scope (measure theory). What’s in scope: state the result, know the conditions, understand it’s an existence claim, not a recipe.

Definition (prof’s framing)

“It has been proven that if you have a linear output layer and at least one hidden layer with some kind of squashing activation function, like a ReLU or a Sigmoid or anything, like a non-linearity, and enough hidden units … then you can approximate any Borel measurable function from one finite dimensional space to another with any desired non-zero amount of error.” - L23-nnet-1

“It turns out that a feedforward network can approximate basically any function. So it’s the universal approximation property.” - L23-nnet-1

The conditions on the slide deck (echoing Goodfellow et al. 2016, §6.4.1):

  • A linear output layer
  • At least one hidden layer with a “squashing” activation (ReLU, sigmoid, GELU, …), anything non-linear
  • “Enough” hidden units: width is sufficient

…then the network can approximate any Borel-measurable function from one finite-dimensional space to another with any desired non-zero error.

Notation & setup

Formal claim (informal version): for any Borel-measurable and any , there exists a feedforward network with one hidden layer of squashing units and finite width such that on a compact subset of input space.

The prof did not write this formula on the board; he stated the result in words. Memorize the conditions (linear output + squashing hidden + enough width), not the formal statement.

Insights & mental models

Existence, not construction

“It’s an existence result, not a recipe, says nothing about training.” - L23-nnet-1 paraphrase from the lecture’s three caveats

The theorem guarantees a network exists that approximates any . It does not tell you (a) how wide has to be, (b) what the weights are, (c) how to train to find them, (d) whether SGD will converge to such weights. All those are separate problems.

Width vs. depth: they trade off

“Original proof used ‘ridiculously wide’ hidden layers. You can often compensate for depth with width in some confusing way.” - L23-nnet-1

So while one hidden layer is theoretically enough, in practice deeper networks reach the same approximation quality with far fewer total units. Modern practice favors depth (compositional features stack); the theorem licenses but does not prescribe.

Why this matters for the course

Two takeaways the prof emphasized:

  1. You don’t need loops to be expressive. Brain has loops; feedforward nets don’t; nonetheless, feedforward is universal. This makes backpropagation usable and is why the field could make progress before figuring out RNNs and beyond.

  2. You need non-linearity in the hidden layer. Without a squashing activation, the network collapses to linear regression, not universal. This is the conceptual punchline of activation-functions.

“The surprising thing was you can do this with a feedforward network. … You don’t need loops, you don’t necessarily need anything else, you don’t even need that many hidden layers, it needs to be big enough.” - L23-nnet-1

Exam signals

“It has been proven that if you have a linear output layer and at least one hidden layer with some kind of squashing activation function … you can approximate any Borel measurable function with any desired non-zero amount of error.” - L23-nnet-1

“To actually understand this, you have to go into measure theory and all sorts of hard math that we don’t talk about in this class.” - L23-nnet-1

→ Know the statement and the conditions, do not attempt the proof. If asked, write a one-line statement + the three conditions (linear output, squashing hidden, enough width). Do not invoke measure theory.

Pitfalls

  • Universal ≠ trivial to train. Don’t conflate “exists” with “finds.” Optimization (gradient descent + backprop) is a separate problem and may fail to find the weights the theorem guarantees.
  • Universal ≠ statistically sample-efficient. A network that can approximate may need impractically many samples to generalize. This is why nn-regularization is non-negotiable.
  • Linear hidden activation breaks universality. The theorem requires squashing (non-linear) hidden activation. Linear-everywhere collapses to linear regression, a tiny function class.
  • The Borel-measurable hedge. Not every function is approximable; you need Borel-measurability (basically: any reasonable function you’d encounter in stat learning). Pathological cases (non-measurable sets) are excluded; nobody cares for the exam.
  • One hidden layer is enough in theory; depth helps in practice. Don’t claim “deeper means more powerful” in the theoretical sense, they have the same universality property. Depth helps efficiency, not expressivity.

Scope vs ISLP

  • In scope: the statement of the theorem, the three conditions (linear output, at least one squashing hidden layer, enough width), the high-level intuition that feedforward is enough.
  • Look up in ISLP: §10.2 (multilayer NN context, Goodfellow §6.4.1 cited as the source). ISLP mentions UAT as motivation but doesn’t prove it.
  • Skip in ISLP (and everywhere): the proof itself. Per L23-nnet-1 and scope (NN exclusions): “Universal approximation proof, stated, not proved (measure theory excluded).” Don’t read measure-theoretic deep-learning textbooks for this exam.

Exercise instances

None. No exercise problem asks you to apply or derive UAT directly. The result is conceptual scaffolding, not a calculation tool. (It does implicitly underlie Exercise11.1d, “how can 10000 weights fit 1000 obs?”, by saying any function can be expressed; the answer to how leans on regularization.)

How it might appear on the exam

  • Multiple-choice / short-answer: “Which of the following is required for universal approximation?”, answer mentions a non-linear (squashing) hidden activation, sufficient width, linear output.
  • True/False: “A feedforward network with linear hidden activation can approximate any function.” False. Without a non-linear hidden activation the model collapses to linear regression and cannot represent non-linear .
  • Conceptual question after Exercise11.1d: “Why is it possible for a 10000-parameter NN to fit any function?”, UAT plus regularization story; emphasize existence does not equal trainability.
  • Could appear as a distractor in a “which method is most flexible” question: UAT is the formal articulation of NNs being maximally flexible.

The prof has explicitly said proof-style derivations are out; expect recognition and interpretation questions only.

  • feedforward-network: the architecture the theorem is about
  • activation-functions: “squashing” / non-linear hidden activation is the key condition
  • nn-regularization: UAT says the model can fit anything; regularization is what makes it generalize
  • double-descent: the practical follow-up: in the universal-approximation regime (), how does the model still work?