Emergent introspective awareness in large language models Anthropic
2025-11-30
![]()
Researchers used interpretability techniques called "concept injection" to test whether Claude models can introspect by comparing their self-reported internal states to actual neural activity patterns, finding evidence that current Claude models (particularly Opus 4 and 4.1) possess some limited introspective capability to identify injected concepts in their neural representations. While this introspective ability remains unreliable and far more limited than human introspection, the finding challenges assumptions about language model cognition and suggests that introspective capabilities may improve in more advanced models.
Was this useful?