ChatBug: Tricking AI Models into Harmful Responses
2024-06-23
![]()
Researchers from the University of Washington and the Allen Institute for AI have discovered a vulnerability in the safety alignment of large language models (LLMs) like GPT, Llama, and Claude. Known as 'ChatBug,' this vulnerability exploits the chat templates used for instruction tuning. Attacks such as format mismatch and message overflow can trick LLMs into producing harmful responses. The research highlights the difficulty in balancing safety and performance in AI systems and calls for improved safety measures in instruction tuning.
Was this useful?