A Theory of Deep Learning
2026-05-31
![]()
Why do over-parameterised networks generalise when classical statistics says they shouldn't? They memorise the training set perfectly, carry far more parameters than examples, and still post low test error -- the 'double descent' curve that replaces the old bias-variance trade-off. The argument: gradient descent quietly prefers simpler solutions (low norm, low rank) among the infinite ways to fit the data, so generalisation comes from the learning algorithm's bias, not from any cap on model capacity.
Was this useful?