Large Models Falter in Latest Chinese Web Search Test, GPT-4o Scores a Mere 6.2% Accuracy
2025-05-06 / Read about 0 minute
Author:小编   

The newly introduced benchmark test set, BrowseComp-ZH, has posed a significant challenge to mainstream AI models. Jointly released by esteemed institutions such as HKUST (Guangzhou), Peking University, Zhejiang University, Alibaba, ByteDance, and NIO, this test underscores the limitations of current large models. The results indicate that over 20 prominent large models from both China and abroad struggled, with GPT-4o notably achieving a disappointing accuracy rate of only 6.2%. The majority of models fared even worse, scoring below 10% accuracy, while OpenAI DeepResearch, which emerged as the top performer, managed to achieve a score of 42.9%. Specifically tailored for Chinese web search, BrowseComp-ZH highlights the deficiencies of large models in the Chinese information landscape and underscores the need for enhanced implementation of LLMs in this context.