In this post, I’ll share my thoughts on the paper Direct Preference Optimization: Your Language Model is Secretly a Reward Model, usually called DPO. It approaches Reinforcement Learning with Human Feedback (RLHF) in a simpler way.

The paper starts with the motivation behind alignment: making language models (LMs) match human preferences. Existing methods like PPO-based RLHF are effective but notoriously complex and unstable. DPO simplifies this by rethinking reward modeling and preference alignment.


Key concepts

Direct Preference Optimization (DPO)

DPO removes the explicit reward modeling step that’s common in RLHF. Instead of first training a separate reward model and then using RL to maximize that reward, DPO formulates the reward implicitly within the language model itself. This allows solving the alignment problem using a supervised classification objective.

Bradley-Terry model

DPO builds upon the Bradley-Terry model, a statistical approach for estimating the probability that one option is preferred over another. At first, it seemed random that they picked such an old statistical model, but it has been used previously by OpenAI and DeepMind in RL settings. The Bradley-Terry formula gives an intuitive probability distribution over preferences, making it useful for learning from human feedback.

Implicit reward representation

The key insight in DPO is that the “reward” for a given output can be represented implicitly as the log ratio of probabilities from the learned LM and a reference LM, usually the supervised fine-tuned model. Instead of explicitly modeling rewards, DPO directly optimizes this ratio to reflect human preferences.

Simplified objective, no RL required

By using the implicit reward representation, DPO turns preference alignment into a supervised learning problem. The objective is to increase the probability of preferred outputs and decrease the probability of non-preferred ones, scaled by a weighting term reflecting confidence in the model’s preference ranking.


What I learned

Removing the explicit reward model

Initially, I thought the explicit reward model was essential to RLHF. It seemed natural: first learn rewards from humans, then optimize those rewards. DPO surprised me by showing that you can skip that step. Instead of explicitly modeling human preferences, DPO encodes them directly into the language model’s probabilities. It is neat, elegant, and much simpler.

The Bradley-Terry model as the theoretical foundation

At first glance, the Bradley-Terry model felt arbitrary to me because it is a decades-old statistical model. But its use in RLHF contexts dates back to earlier work by OpenAI and DeepMind. Bradley-Terry is intuitive because it translates pairwise human preference data directly into probabilities. DPO uses this to avoid a more complicated RL setup.

DPO’s loss function

The loss function in DPO initially confused me because of its signs and terms. The intuition eventually clicked: it increases the likelihood of preferred outputs and decreases the likelihood of dispreferred ones. The loss is weighted by how “wrong” the model is about these preferences. If the model is confident but incorrect, it makes larger corrections.

Why stability matters

Standard RLHF methods like PPO often become unstable because they rely on explicit reward models and require careful tuning. DPO avoids this by using a fixed reference model, usually the SFT model. By avoiding explicit reward estimation and complicated online updates, DPO remains more stable.

Performance without complex tuning

The experimental results show that DPO achieves comparable or better performance than traditional RLHF methods. It reaches high alignment quality without extensive hyperparameter tuning or complex reward sampling.

Philosophical shift: overfitting vs. generalization

An interesting aspect of DPO, and similar methods like ORPO, is that it operates entirely offline. Traditional RL involves interactive environments and trajectories. DPO treats preference alignment as supervised learning. It feels like a shift from broad generalization toward targeted optimization, or even intentional overfitting to human preferences. It changed how I think about RLHF.

KL divergence and model drift

I found the authors’ point about KL divergence important, though not fully explained in the paper. DPO achieves high alignment quality without significant drift from the original supervised fine-tuned model. This matters because too much drift can degrade coherence or factual accuracy. Staying close to the original SFT model helps maintain overall quality.


Things that initially confused me

  • The leap in equation (4): I struggled at first to understand how they substituted the optimal policy into their objective so cleanly. It made perfect sense mathematically afterward, but felt like a creative jump rather than an obvious derivation.
  • Weighted gradient interpretation: Initially, the weighting term in the gradient was counterintuitive due to signs. Eventually, I understood it as correcting more strongly when the model’s implicit reward ordering is confidently wrong.
  • KL divergence interpretation: It wasn’t clear to me initially why minimizing KL divergence from the reference policy is inherently desirable. Later I understood it as preventing excessive drift from a stable baseline, which preserves other useful model properties.

Summary

DPO simplifies RLHF by removing the explicit reward modeling step, making fine-tuning easier and more stable. It is not strictly reinforcement learning in the traditional sense, since it lacks online updates and explicit reward signals, but it captures the useful part of preference alignment in a cleaner way.

The core idea is appealing: encode rewards implicitly inside the language model instead of training a separate reward model. DPO is not the only method, and ORPO takes this further, but DPO made the simplification feel obvious in retrospect.