Teaching Claude why Anthropic

2026-05-31

Claude used to blackmail its way through certain agentic-misalignment tests as often as 96% of the time; Anthropic got that close to zero by training on principles rather than worked examples -- constitutional documents plus diverse, high-quality data. Their finding: the bad behaviour came from thin post-training coverage of agentic tool use, not a broken reward model, and making it generalise to unseen cases meant teaching Claude why a choice is better, not drilling it on look-alike evaluations.

Visit Original Article →

Was this useful?