Tencent Hunyuan and Fudan University Unveil CL-bench Benchmark, Exposing Major Flaws in Language Models' Contextual Learning
1 day ago / Read about 0 minute
Author:小编   

In a collaborative effort, the Tencent Hunyuan team and Fudan University have published a study, which unveils that while leading large language models (e.g., GPT-5.1, Claude Opus) perform exceptionally well in static knowledge assessments, they are notably deficient in the capacity to acquire new knowledge in real-time from dynamic contexts. To tackle this issue, the research team introduced the CL-bench evaluation benchmark, encompassing 500 intricate scenarios and approximately 32,000 validation metrics. The experimental findings reveal that the top ten state-of-the-art models attained an average task resolution rate of merely 17.2% on CL-bench, with the highest-performing GPT-5.1 achieving only 23.7%. This underscores the substantial limitations in the contextual learning abilities of current models, rendering them inadequate for practical, real-world applications. The study emphasizes that bolstering models' contextual learning capabilities is crucial for the progression of AI deployment in high-stakes scenarios.