How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs

2024-01-08

This article discusses the development of Persuasive Adversarial Prompts (PAPs) that effectively persuade Large Language Models (LLMs) to perform actions outside their intended use, with a 92% success rate in jailbreaking aligned LLMs like GPT-3.5 and GPT-4. It highlights the increased vulnerability of advanced models to PAPs and explores defence strategies to mitigate these risks.

aijailbreaking persuasiveadversarialprompts llmsafety gpt4 advancedaivulnerability

Visit Original Article →

Was this useful?