Refusal in LLMs is mediated by a single direction (lesswrong.com)

Refusal in LLMs is mediated by a single direction (lesswrong.com)

The article discusses a study finding that refusal behavior in large language models (LLMs) is controlled by a single direction in the model's activation space. By modifying this direction, researchers can either bypass or induce the refusal of harmful or harmless instructions, demonstrating the fragility of safety fine-tuning in open-source chat models. This method of controlling refusal behaviors validates interpretability results and highlights potential vulnerabilities in LLMs' safety mechanisms.

⌘K

Start typing to search...

Search across content, newsletters, and subscribers