Does Refusal Training in LLMs Generalize to the Past Tense?

2024-07-01

This research paper investigates the limitations of refusal training in large language models (LLMs). The study reveals that rephrasing harmful prompts in the past tense can circumvent refusal mechanisms in many state-of-the-art LLMs, highlighting a significant generalization gap. The study's findings raise concerns regarding the robustness of current LLM alignment techniques and suggest that including past tense examples in training data can improve defenses.

AI MachineLearning LLMs CyberSecurity Research

Visit Original Article →

Was this useful?