GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis
2024-02-02
The article introduces GradSafe, a method for detecting unsafe prompts in Large Language Models (LLMs) by analysing the gradients of safety-critical parameters. GradSafe outperforms existing methods by efficiently identifying unsafe prompts without requiring extensive data collection or training, demonstrating its effectiveness with Llama-2 against the Llama Guard system across different evaluation datasets.
Was this useful?