DeepSeek-V2 - Review | Hun Tae Kim

In this post, I’ll discuss the DeepSeek-V2 paper, “DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model”, released by DeepSeek-AI. DeepSeek has been steadily releasing and open-sourcing language models, and their recent work on Mixture-of-Experts (MoE) architectures and reinforcement learning methods is worth paying attention to.

DeepSeek-V2 uses a new Multi-head Latent Attention (MLA) mechanism and an optimized MoE architecture called DeepSeek-MoE. It activates only 21B out of its 236B parameters per token, reducing inference and training costs while maintaining performance. This makes DeepSeek-V2 notable in a time when models are becoming increasingly large and computationally demanding.

Key concepts

Mixture-of-Experts (DeepSeek-MoE)

DeepSeek-V2 uses a mixture-of-experts model, where each token activates only a small fraction (21B out of 236B) of the total parameters. Instead of having fewer large experts, they increased the number of experts while reducing their size and introduced a “shared expert.” This expert always participates, based on the observation that some computations are common to all tokens. This helps balance computational efficiency with model performance.

Multi-head Latent Attention (MLA)

The MLA mechanism compresses the Key-Value (KV) cache into a latent representation. This reduces memory usage (by about 93%) and enables faster inference. Instead of storing large sparse key-value pairs explicitly, the MLA creates compact latent vectors, decompressing them only when needed.

Group Relative Policy Optimization (GRPO)

For reinforcement learning alignment, DeepSeek-V2 uses GRPO, a variant of PPO that removes the need for a separate value model. Instead, GRPO samples multiple responses, calculates the average advantage among them, and optimizes directly without an additional critic network. This reduces computational overhead.

Load balancing via auxiliary loss

To distribute computation across multiple GPUs, DeepSeek introduces auxiliary losses for balancing device and communication loads. Although not obvious at first, these losses are essential for training large MoE models with limited GPU resources.

What I learned

Simplicity that works (MLA mechanism)

At first, I didn’t expect such a straightforward compression idea like MLA to matter this much. Compressing the KV cache into a latent representation is simple, but it directly attacks the memory bottleneck.

Sharing experts is a smart move

Using a shared expert, activated by all tokens, initially sounded counterintuitive for an MoE model focused on sparse computation. But it makes sense in practice, since some computations inevitably overlap. This balances specialization, which is MoE’s strength, with necessary shared processing.

Removing the value model from PPO (GRPO)

GRPO was another interesting idea. PPO typically uses a critic (value) model to calculate advantages, but DeepSeek simplified this by averaging multiple sampled outputs to estimate advantages directly. Removing the extra value model cuts computational cost without hurting performance.

Careful load balancing as a necessary trade-off

Using auxiliary losses to balance GPU load and communication initially felt odd because it seemed unrelated to improving model capability directly. But with resource constraints, this trade-off is practical and necessary. DeepSeek’s work is a reminder that model development is not only about ideal algorithms; it is also about making the system trainable.

DeepSeek’s incremental but thorough progression

Following DeepSeek historically, I noticed a consistent pattern of careful incremental improvements. Unlike some models that seem to appear out of nowhere, DeepSeek’s progress reflects systematic changes in both architecture and training methods. DeepSeek-V2 has an impressive balance of performance and efficiency because of that.

Summary

DeepSeek-V2 balances efficiency, performance, and cost through careful engineering choices. MLA, shared MoE experts, and GRPO are not conceptually flashy, but they are practical and well executed.

I appreciated their transparency and documentation. DeepSeek-V2 is a good example of scaling models by attacking concrete bottlenecks rather than adding complexity for its own sake.