DeepSeek Collaborates with Tsinghua University: Innovating Reward Model Inference Techniques for Enhanced Scalability
2025-04-06 / Read about 0 minute
Author:小编   

DeepSeek, in collaboration with researchers from Tsinghua University, has introduced the Self-Principle Comment Tuning (SPCT) method and a meta-reward model. These innovations are designed to bolster the scalability of reward model inference. The team has developed the DeepSeek-GRM series of models, incorporating the SPCT method in two distinct phases, which markedly improves the quality and scalability of the Generalized Reward Model (GRM). Experimental outcomes reveal that DeepSeek-GRM-27B excels in performance. By leveraging reward voting and meta-reward model guidance, the researchers have enhanced inference scalability, proving this approach to be more effective than the conventional strategy of merely increasing model size.