Large Models Falter in Latest Chinese Web Search Test, GPT-4o Scores a Mere 6.2% Accuracy

2025-05-06 / Read about 0 minute

Author：小编

The newly introduced benchmark test set, BrowseComp-ZH, has posed a significant challenge to mainstream AI models. Jointly released by esteemed institutions such as HKUST (Guangzhou), Peking University, Zhejiang University, Alibaba, ByteDance, and NIO, this test underscores the limitations of current large models. The results indicate that over 20 prominent large models from both China and abroad struggled, with GPT-4o notably achieving a disappointing accuracy rate of only 6.2%. The majority of models fared even worse, scoring below 10% accuracy, while OpenAI DeepResearch, which emerged as the top performer, managed to achieve a score of 42.9%. Specifically tailored for Chinese web search, BrowseComp-ZH highlights the deficiencies of large models in the Chinese information landscape and underscores the need for enhanced implementation of LLMs in this context.

Previous page：Reliance on AI Chatbots for Medical Self-Diagnosis...

Next page：Official Launch of Kimi's Long-Thinking Model API

Return to List

Hot Reading

2 day ago

GM agrees to pay $12.75M in California driver privacy settlement

2 day ago

The new Wild West of AI kids’ toys

2 day ago

Wait! Keep That Old Camera and Hack It to Take Amazing Photos Like These

2 day ago

So you’ve heard these AI terms and nodded along; let’s fix that