A knockout blow for LLMs? - by Gary Marcus
2025-06-30
Apple's new research demonstrates that large language models, including advanced "reasoning models" like o1, fundamentally fail to generalize beyond their training distribution on classic reasoning tasks such as the Tower of Hanoi—validating long-standing critiques that neural networks cannot reliably extrapolate outside the data they've been exposed to. The paper also validates concerns that chain-of-thought reasoning traces don't accurately reflect how these models actually arrive at answers, showing that inference-time compute scaling cannot overcome the core limitation that LLMs break down when faced with out-of-distribution problems.
Was this useful?