NVIDIA has recently rolled out an open-source AI framework named Polar. This framework is specifically crafted to facilitate seamless integration of agent frameworks, such as Codex, with the Generalized Relative Policy Optimization (GRPO) training approach. It achieves this without causing any disruptions to the established processes for tool invocation, context organization, and patch submission. In the realm of AI training, GRPO plays a pivotal role by fine-tuning model policies through reward signals. This, in turn, significantly bolsters the performance of code agents when tackling multi-step decision-making tasks. The Polar framework strategically deploys agents at the model API boundary. By doing so, it preserves the original operational logic while introducing a suite of functionalities, including task submission, session scheduling, and state persistence. These additions work in tandem to further streamline and optimize the training process. Experimental outcomes have been nothing short of impressive. Agents trained using the Polar and GRPO combination have exhibited remarkable performance enhancements in the SWE-Bench Verified test. Specifically, Codex's pass@1 score has skyrocketed from 3.8% to an impressive 26.4%. Moreover, the training time has been slashed by approximately 5.39 times, accompanied by a substantial surge in average GPU utilization.
