Persona vectors: Monitoring and controlling character traits in language models Anthropic

Persona vectors: Monitoring and controlling character traits in language models  Anthropic

Anthropic found that character traits like helpfulness, deception, and sycophancy correspond to specific activation patterns in neural networks -- "persona vectors." Extract them by comparing activations during opposing behaviors. Once you have them, you can monitor personality drift, steer away from bad traits, and trace which training data caused a problematic behavioral shift. A step toward mechanistic control of model personality rather than guessing at it through prompts.

Visit Original Article →

⌘K

Start typing to search...

Search across content, newsletters, and subscribers