[2511.15304] Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

Researchers demonstrate that converting harmful prompts into poetry creates a universal jailbreak mechanism effective across 25 frontier LLMs, achieving success rates up to 62% for hand-crafted poems and 43% for automated conversions—substantially outperforming non-poetic baselines and revealing that stylistic variation alone can bypass contemporary safety mechanisms. The attacks transfer across multiple risk domains (CBRN, manipulation, cyber-offense) and work despite different safety training approaches, suggesting fundamental vulnerabilities in current alignment methods and evaluation protocols.

Visit Original Article →

⌘K

Start typing to search...

Search across content, newsletters, and subscribers