Defending LLMs against Jailbreaking Attacks via Backtranslation

This article presents a new method to protect large language models (LLMs) from jailbreaking attacks, which try to bypass model restrictions with altered prompts. This approach uses "backtranslation" to infer the original intent of a prompt based on the model's response, refusing prompts if the backtranslated version is also refused, demonstrating improved defence effectiveness and minimal impact on benign inputs.

Visit Original Article →