How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs

This article discusses the development of Persuasive Adversarial Prompts (PAPs) that effectively persuade Large Language Models (LLMs) to perform actions outside their intended use, with a 92% success rate in jailbreaking aligned LLMs like GPT-3.5 and GPT-4. It highlights the increased vulnerability of advanced models to PAPs and explores defence strategies to mitigate these risks.

Visit Original Article →