Direct Preference Optimisation: Your Language Model is Secretly a Reward Model

2024-01-08

This paper introduces Direct Preference Optimisation (DPO), a new method for training language models to align with human preferences without explicit reward modelling or reinforcement learning. DPO offers a simpler, more stable, and efficient alternative to existing methods, performing equally or better in tasks like sentiment modulation, summarisation, and single-turn dialogue.

directpreferenceoptimisation languagemodeltraining humanpreferences airesearch reinforcementlearningalternatives

Visit Original Article →

Was this useful?