A Theory of Deep Learning

A Theory of Deep Learning

Why do over-parameterised networks generalise when classical statistics says they shouldn't? They memorise the training set perfectly, carry far more parameters than examples, and still post low test error -- the 'double descent' curve that replaces the old bias-variance trade-off. The argument: gradient descent quietly prefers simpler solutions (low norm, low rank) among the infinite ways to fit the data, so generalisation comes from the learning algorithm's bias, not from any cap on model capacity.

⌘K

Start typing to search...

Search across content, newsletters, and subscribers