Alibaba’s Voice Large Model Claims Top Spot in China, Fifth Place Worldwide on Speech Arena
19 hour ago / Read about 0 minute
Author:小编   

On May 28, Speech Arena rankings—a globally recognized AI evaluation platform operated by Artificial Analysis—were updated. Alibaba’s voice large model, Fun-Realtime-TTS-Preview, secured the fifth position globally and first place in China, achieving an Elo score of 1190. The model also ranked first in China across three key categories: ASR (Speech-to-Text), Chat (Speech Understanding and Dialogue), and TTS (Text-to-Speech).

Previously, Fun-Realtime-ASR and Fun-Realtime-AudioChat—both released at the Alibaba Cloud Summit on May 20—outperformed leading international models in three critical metrics: "Listening Accuracy (Word Error Rate)," "Understanding (Speech Reasoning)," and "Conversational Fluency," earning the top global ranking.

Fun-Realtime-ASR achieves an impressively low word error rate of just 1.8%, supports millisecond-level response times, and is compatible with over 30 languages and seven major Chinese dialects. It can accurately recognize accents from more than 20 regions. Meanwhile, Fun-Realtime-AudioChat scored 97.6% in speech reasoning and 97.8% in conversational fluency, approaching human-level performance.

Currently, these models are deployed in applications such as the Qianwen App, Amap, and DingTalk, providing services including real-time speech-to-text conversion, intelligent navigation interaction, and automated meeting minute generation.