Persona vectors: Monitoring and controlling character traits in language models Anthropic
2026-03-31
![]()
Anthropic found that character traits like helpfulness, deception, and sycophancy correspond to specific activation patterns in neural networks -- "persona vectors." Extract them by comparing activations during opposing behaviors. Once you have them, you can monitor personality drift, steer away from bad traits, and trace which training data caused a problematic behavioral shift. A step toward mechanistic control of model personality rather than guessing at it through prompts.
Was this useful?