Persona vectors: Monitoring and controlling character traits in language models Anthropic

2026-03-31

Anthropic found that character traits like helpfulness, deception, and sycophancy correspond to specific activation patterns in neural networks -- "persona vectors." Extract them by comparing activations during opposing behaviors. Once you have them, you can monitor personality drift, steer away from bad traits, and trace which training data caused a problematic behavioral shift. A step toward mechanistic control of model personality rather than guessing at it through prompts.

Visit Original Article →

Was this useful?