This time, I’m looking at “Inference-Time Scaling for Generalist Reward Modeling” by Liu et al. from DeepSeek-AI. Given DeepSeek’s recent work, I went into this with some anticipation. My initial feeling was that it might be less of a fundamentally new algorithm and more a concrete version of ideas like RLAIF and Constitutional AI (CAI). Still, the paper frames the problem clearly, and the results are interesting.

The core challenge is getting accurate reward signals for LLMs in general domains, where tasks aren’t easily verifiable like math problems. Methods like RLAIF help scale beyond human feedback, but the paper points out problems with existing reward models (RMs): inflexibility to input types, accuracy limitations, and poor scaling with inference-time compute. This work tries to make generalist RMs better when given more thinking time.


Key concepts

Pointwise generative reward modeling

The authors adopt a GRM approach. Instead of only outputting a scalar score or a pairwise preference, the reward model generates textual output, such as critiques based on principles, and assigns individual pointwise scores to each response, for example 1-10. This format can handle single, paired, or multiple responses.

Self-Principled Critique Tuning (SPCT)

This is the proposed training method. Unlike CAI, which uses a fixed, human-defined constitution, SPCT trains the GRM to dynamically generate relevant principles and critiques based on the specific input query and responses.

SPCT Stage 1: rejective fine-tuning

A “cold start” phase gets the GRM to generate principles and critiques in the correct format. It uses existing RM datasets and samples trajectories: principle + critique + score. Trajectories are rejected if the predicted reward is incorrect, meaning it doesn’t match the ground-truth preference, or if the task is “too easy”, meaning all sampled trajectories for a given input are correct.

SPCT Stage 2: rule-based reinforcement learning

An online RL phase, using a GRPO setup, further improves the GRM. The key point is that RL optimizes the reward model itself. The reward signal for this RL process is based on simple accuracy rules: does the GRM’s generated critique and score correctly identify the best response according to the ground-truth preference label from the dataset? A KL penalty is used for stability.

Inference-time scaling via sampling and voting

To use more compute at inference, the paper samples k times from the trained GRM for the same input. Each sample gives a potentially different set of principles, critiques, and scores. The final score for a response is obtained by summing scores across all k samples, which they call “Voting”. This expands the effective reward range and gives finer granularity.

Meta reward model

As an addition to voting, a separate, smaller RM is trained to evaluate the quality of the principles and critiques generated by the main GRM in each of the k samples. During inference, this Meta RM filters the k samples down to the top k_meta based on critique quality, and voting is performed only on these higher-quality samples.


What I learned

Generalist rewards are hard

The paper does a good job setting up the motivation. Getting good rewards for complex, open-ended tasks is hard. CAI introduced principles, but this work focuses on making the principle application dynamic and scaling the reward model’s quality with compute. The four challenges they identify, flexibility, accuracy, inference scaling, and learning scalable behaviors, make the target clear.

Dynamic principles vs. a fixed constitution

The shift from CAI’s static constitution to SPCT’s dynamically generated principles felt important. The idea is that the RM learns to adapt its evaluation criteria to the specific context instead of relying on a predefined set. This seems like a natural direction.

Improving the reward model, not just the policy

A point that required careful distinction during discussion was the target of the RL. Methods like GRPO usually use RL to improve the policy LLM based on a reward signal. Here, rule-based RL is used to improve the reward model itself, making it better at generating principles and critiques that align with ground-truth preferences. It’s a bit meta: training the judge to be a better judge.

Rejecting “too easy” examples

The RFT stage discards trajectories where the GRM was correct every time. I found this interesting. Is it just removing uninformative data where the model already performs perfectly? Is it forcing the model to focus on harder examples where it might struggle? Or does it simplify downstream ranking by avoiding trivially correct examples? In any case, it seems like a pragmatic way to focus the training signal.

Summing, not averaging

The voting mechanism (Eq. 6) initially seemed odd. Why sum scores instead of averaging? The key is that the goal is ranking responses. Summing preserves the relative ranking just as averaging would, but it also reflects the expanded granularity from sampling multiple “perspectives” or principles. Since the absolute value isn’t used directly in a later loss function, maintaining a fixed 1-10 scale via averaging isn’t strictly necessary.

Inference scaling matters

The results show that the SPCT-trained DeepSeek-GRM performs well, beating baselines and competing with strong models. More importantly, inference-time scaling works. Using more samples (k=32) or the Meta RM (k=8 or k=16) allows the 27B GRM to match or exceed much larger models, like a 671B RFT model, that use less inference compute. This supports the basic premise: inference compute can improve reward quality if the model is trained for it.


Summary

This paper presents Self-Principled Critique Tuning (SPCT) as a way to train Pointwise Generative Reward Models (GRMs) that generate dynamic principles and critiques. By training the RM itself with rejective fine-tuning and rule-based online RL, DeepSeek-GRM improves its reward signal quality when given more inference compute through sampling.

The paper builds on RLAIF and CAI, but the dynamic principle generation and explicit focus on reward-model inference scaling feel like the main differences. It makes me curious about how these improved reward models will be used to train DeepSeek’s next generation of policy models.