Defending LLMs against Jailbreaking Attacks via Backtranslation

2024-03-11

This article presents a new method to protect large language models (LLMs) from jailbreaking attacks, which try to bypass model restrictions with altered prompts. This approach uses "backtranslation" to infer the original intent of a prompt based on the model's response, refusing prompts if the backtranslated version is also refused, demonstrating improved defence effectiveness and minimal impact on benign inputs.

LLMs Cybersecurity Backtranslation ArtificialIntelligence MachineLearning

Visit Original Article →

Was this useful?