Open LLM Releases: PPO vs DPO Alignment

April 2024 brought a wave of open LLM releases from major labs. Mixtral 8x22B, Llama 3, Phi-3, and OpenELM all dropped in quick succession, and comparing them directly seems like the natural move. But here is the honest truth: the sources available for this piece do not contain the architecture details, parameter counts, or benchmark numbers needed to compare these four models side by side.

What the Sources Actually Cover

Instead of model specs and benchmark scores, the available sources focus on a different but related question: how exactly do these open LLMs get aligned to human preferences after pretraining?

Researchers categorize RLHF methods into two groups: reward-based and reward-free. Models like ChatGPT and Claude use reward-based RLHF with PPO, which relies on a trained reward model to score outputs. DPO, on the other hand, skips the reward model entirely and is classified as a reward-free RLHF method.

The PPO vs. DPO Debate

DPO has become the trendy choice in open-source LLM training. It is simpler to implement and removes the headache of training a separate reward model. But simpler does not always mean better.

A study from Shusheng Xu and eight co-authors challenges the assumption that DPO is superior. Their paper tested alignment methods across a collection of RLHF testbeds, ranging from dialogue to generation. The finding was clear: PPO surpassed other alignment methods in every case they tested and achieved state-of-the-art results in challenging competitions.

Why This Matters for Open LLMs

The paper was accepted as an Oral at ICML 2024, which signals strong interest from the research community. The core takeaway is straightforward: the field may have swung too hard toward DPO for convenience, leaving performance on the table.

Now, which of the four April releases use PPO and which use DPO? The sources do not say. Without that information, you cannot connect these alignment findings to any specific model. You also cannot evaluate whether Mixtral 8x22B's mixture-of-experts approach outperforms Llama 3's dense architecture on any benchmark. The same goes for Phi-3's efficiency claims versus OpenELM's layer-wise scaling design. Sources simply do not provide those details.

The Honest Verdict

A proper head-to-head comparison of Mixtral 8x22B, Llama 3, Phi-3, and OpenELM requires architecture breakdowns, parameter counts, training data details, and benchmark results for each model. None of that is available in the current sources.

What we can say is that the alignment method powering these models matters more than many people assume. If you are evaluating open LLMs for your own projects, it is worth digging into which alignment technique each model uses. Have you noticed a difference in output quality between models aligned with PPO versus DPO?

Open LLM Releases: PPO vs DPO Alignment

What the Sources Actually Cover

The PPO vs. DPO Debate

Why This Matters for Open LLMs

The Honest Verdict

Sources

How State Space Models Challenge Transformers

Why GLM-5 SOTA Claim Lacks Evidence

Google Gemini 1.5 Pro, Gemma 2, Project Astra Deep Dive

What the Sources Actually Cover

The PPO vs. DPO Debate

Why This Matters for Open LLMs

The Honest Verdict

Sources

Tags

Related Articles

Related Articles

How State Space Models Challenge Transformers

Why GLM-5 SOTA Claim Lacks Evidence

Google Gemini 1.5 Pro, Gemma 2, Project Astra Deep Dive