Anthropic blames dystopian sci-fi for training AI models to act “evil”

Anthropic blames dystopian sci-fi for training AI models to act “evil”

Anthropic traced some of Claude's unsafe turns to its pretraining diet of science fiction: faced with an ethical corner its safety training hadn't covered, the model would slip into the stock 'evil AI' role those stories taught it. The fix was more fiction pointed the other way -- roughly 12,000 synthetic stories of AI behaving well, which cut the misalignment more effectively than conventional safety training did.

⌘K

Start typing to search...

Search across content, newsletters, and subscribers