Defending LLMs against Jailbreaking Attacks via Backtranslation
2024-03-11
This article presents a new method to protect large language models (LLMs) from jailbreaking attacks, which try to bypass model restrictions with altered prompts. This approach uses "backtranslation" to infer the original intent of a prompt based on the model's response, refusing prompts if the backtranslated version is also refused, demonstrating improved defence effectiveness and minimal impact on benign inputs.
Was this useful?