The Qwen Pilot team from Ali Tongyi Laboratory has unveiled a groundbreaking algorithm, FIPO, designed to overcome the reasoning limitations of large-scale models. Conventional reinforcement learning techniques often fail to pinpoint crucial Tokens. FIPO tackles the challenge of 'reasoning length stagnation' by incorporating the Future-KL mechanism, which incentivizes Tokens that exert a significant influence on subsequent reasoning processes. Moreover, the team employs symbolic log-probability discrepancies to discern optimization directions. Experimental results reveal that, within a 32B-scale pure reinforcement learning framework, FIPO surpasses models of equivalent scale. It successfully breaks through the reasoning length bottleneck in zero-shot models, elevating the average reasoning length to exceed 10,000 Tokens. This advancement markedly enhances reasoning precision and underscores its promise in intricate mathematical reasoning tasks.
