Direct Preference Optimisation: Your Language Model is Secretly a Reward Model
2024-01-08
This paper introduces Direct Preference Optimisation (DPO), a new method for training language models to align with human preferences without explicit reward modelling or reinforcement learning. DPO offers a simpler, more stable, and efficient alternative to existing methods, performing equally or better in tasks like sentiment modulation, summarisation, and single-turn dialogue.
Was this useful?