GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis

2024-02-02

The article introduces GradSafe, a method for detecting unsafe prompts in Large Language Models (LLMs) by analysing the gradients of safety-critical parameters. GradSafe outperforms existing methods by efficiently identifying unsafe prompts without requiring extensive data collection or training, demonstrating its effectiveness with Llama-2 against the Llama Guard system across different evaluation datasets.

ai machinelearning cybersecurity gradsafe llms

Visit Original Article →

Was this useful?