How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
2024-01-08
This article discusses the development of Persuasive Adversarial Prompts (PAPs) that effectively persuade Large Language Models (LLMs) to perform actions outside their intended use, with a 92% success rate in jailbreaking aligned LLMs like GPT-3.5 and GPT-4. It highlights the increased vulnerability of advanced models to PAPs and explores defence strategies to mitigate these risks.
Was this useful?