GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis

The article introduces GradSafe, a method for detecting unsafe prompts in Large Language Models (LLMs) by analysing the gradients of safety-critical parameters. GradSafe outperforms existing methods by efficiently identifying unsafe prompts without requiring extensive data collection or training, demonstrating its effectiveness with Llama-2 against the Llama Guard system across different evaluation datasets.

⌘K

Start typing to search...

Search across content, newsletters, and subscribers