In a collaborative effort, Microsoft Research has teamed up with Tsinghua University and Peking University to unveil Reward Reasoning Models (RRMs). These models enhance the evaluation of intricate tasks by dynamically allocating computing resources through a rigorous reasoning process. Leveraging the Qwen2 model, RRMs adopt a Transformer-decoder architecture, effectively transforming reward modeling into a text completion task. In benchmark tests conducted on RewardBench and PandaLM Test, RRMs have demonstrated exceptional performance, particularly in managing complex queries, where they efficiently utilize computing resources during testing, surpassing baseline models by a significant margin. Research suggests that as the model scales up and the reasoning time increases, the accuracy of RRMs is poised for further enhancement.
