DeepSeek GRM - Review
This time, I’m looking at the paper “Inference-Time Scaling for Generalist Reward Modeling” by Liu et al. from DeepSeek-AI. Given DeepSeek’s recent work, I went into this with some anticipation, though my initial feeling was that it might be less of a fundamentally new algorithm and more a detailed concretization of concepts like RLAIF and Constitutional AI (CAI). Still, the paper frames the problem very effectively and the results look promising.
The core challenge addressed is obtaining accurate reward signals for LLMs in general domains, where tasks aren’t easily verifiable like math problems. While methods like RLAIF help scale beyond human feedback, the paper points out issues with existing reward models (RMs): inflexibility to input types, accuracy limitations, and poor scaling with inference-time compute. This work proposes a way to improve generalist RMs, specifically focusing on making them better when given more thinking time (inference compute).
Key Concepts
Pointwise Generative Reward Modeling (GRM)
The authors adopt a GRM approach. Instead of just outputting a scalar score or a pairwise preference, the RM generates textual output (critiques based on principles) and assigns individual (pointwise) scores to each response (e.g., 1-10). This architecture provides flexibility in handling single, paired, or multiple responses using the same format.
Self-Principled Critique Tuning (SPCT)
This is the proposed training methodology. Unlike CAI which uses a fixed, human-defined constitution, SPCT aims to train the GRM to dynamically generate relevant principles and critiques based on the specific input query and responses.
SPCT Stage 1: Rejective Fine-Tuning (RFT)
A “cold start” phase to get the GRM generating principles and critiques in the correct format. It uses existing RM datasets and samples trajectories (principle + critique + score). Trajectories are rejected if the predicted reward is incorrect (doesn’t match ground truth preference) or if the task is “too easy” (all sampled trajectories for a given input are correct).
SPCT Stage 2: Rule-Based Reinforcement Learning
An online RL phase (using a GRPO setup) to further improve the GRM. The key here is that the RL optimizes the reward model itself. The reward signal for this RL process is based on simple accuracy rules – does the GRM’s generated critique and score correctly identify the best response according to the ground truth preference label from the dataset? A KL penalty is used to maintain stability.
Inference-Time Scaling via Sampling & Voting
To leverage more compute at inference, the paper proposes sampling k
times from the trained GRM for the same input. Each sample yields a potentially different set of principles, critiques, and scores. The final score for a response is obtained by summing the scores across all k
samples (“Voting”). This expands the effective reward range, allowing for finer granularity.
Meta Reward Model (Meta RM)
As an enhancement to voting, a separate, smaller RM is trained to evaluate the quality of the principles and critiques generated by the main GRM in each of the k
samples. During inference, this Meta RM filters the k
samples down to the top k_meta
based on critique quality, and voting is performed only on these higher-quality samples.
Key Takeaways (What I Learned)
Problem Framing: Generalist Rewards & Scaling
The paper does a good job setting up the motivation. Getting good rewards for complex, open-ended tasks is hard. While CAI introduced principles, this work focuses on making the principle application dynamic and scaling the RM’s quality with compute. The four challenges identified (flexibility, accuracy, inference scaling, learning scalable behaviors) provide a clear target.
Dynamic Principles vs. Fixed Constitution
The shift from CAI’s static constitution to SPCT’s dynamically generated principles felt like a notable difference. The idea is that the RM learns to adapt its evaluation criteria (the principles) to the specific context, rather than relying on a predefined, potentially inflexible set. This seems like a natural evolution.
Improving the Reward Model, Not Just the Policy
A point that required careful distinction during discussion was the target of the RL. While methods like GRPO use RL to improve the policy LLM based on a reward signal, the rule-based RL in SPCT is used to improve the reward model itself – making it better at generating principles and critiques that align with ground truth preferences. It’s a bit “meta” – training the judge to be a better judge.
The “Rejecting Too Easy” Strategy
The RFT stage’s approach of discarding trajectories where the GRM was correct every time was interesting. There are a few angles: Is it just removing uninformative data where the model already performs perfectly? Or is it actively forcing the model to focus on harder examples where it might struggle, thereby promoting more robust learning? Or perhaps it simplifies downstream ranking if all examples aren’t trivially correct? It seems like a pragmatic way to focus the training signal.
How Inference Scaling Works (Summing, Not Averaging)
The voting mechanism (Eq. 6) initially seemed odd – why sum scores instead of averaging? But, the thing to remember is that the goal is ranking responses. Summing preserves the relative ranking just as averaging would, but directly reflects the expanded granularity achieved by sampling multiple “perspectives” (principles). Since the absolute value isn’t used directly in a loss function later (unlike scalar value learning), maintaining a fixed 1-10 scale via averaging isn’t strictly necessary.
Performance: Scaling Matters
The results (Tables 2, 3, 6, Figure 4) show that the SPCT-trained DeepSeek-GRM performs well, beating baselines and competing with strong models. More importantly, the inference-time scaling demonstrably works. Using more samples (k=32
) or the Meta RM (k=8
or k=16
) allows the 27B GRM to match or exceed the performance of much larger models (like a 671B RFT model) that use less inference compute. This supports the core premise that investing compute at inference time can be highly effective if the model is trained appropriately (via SPCT).
Summary & Final Thoughts
This paper presents Self-Principled Critique Tuning (SPCT) as a method to train Pointwise Generative Reward Models (GRMs) that generate dynamic principles and critiques, enabling effective inference-time scaling. By training the RM itself using a combination of rejective fine-tuning and rule-based online RL, the authors create a system (DeepSeek-GRM) that improves its reward signal quality when given more compute via sampling.
While building on ideas from RLAIF and CAI, the dynamic principle generation and the explicit focus on training the RM for inference-time scalability (including the Meta RM) feel like distinct contributions. The empirical results strongly suggest that scaling inference compute via sampling, especially when guided by a meta-judge, can be a very effective way to boost reward quality, potentially rivaling the gains from simply scaling model size during training. It makes me curious about how these improved reward models will be used to train DeepSeek’s next generation of policy models.