Volcano Engine Unveils Doubao Speech Recognition Model 2.0, Elevating Accuracy in Multi - Language Recognition
2025-12-05 / Read about 0 minute
Author:小编   

Volcano Engine has recently rolled out Doubao Speech Recognition Model 2.0 (Doubao - Seed - ASR - 2.0). This latest iteration brings about a substantial improvement in its inference capabilities, allowing for highly accurate recognition of multiple languages as well as visual information.

The model builds upon the strengths of its forerunner's high - performance audio encoder. It optimizes recognition performance in intricate scenarios, achieving precise recognition through an advanced Proximal Policy Optimization (PPO) algorithm. In Western AI research and development contexts, PPO is a well - regarded and widely used reinforcement learning algorithm known for its stability and effectiveness in training models to make optimal decisions.

Moreover, the model is equipped with multimodal understanding capabilities. This means it can perform speech recognition while also taking into account image content, effectively minimizing recognition errors. For instance, in a situation where a speaker is referring to something visible in an image during a conversation, the model can combine the audio and visual cues to enhance accuracy.

The model boasts support for 13 overseas languages, significantly broadening the scope of cross - language application scenarios. In today's globalized world, where communication across different languages is increasingly common, this feature is highly valuable.

At present, the model has been officially launched and offers API services. Furthermore, there are plans for continuous evolution and improvement in the future. This release not only showcases Volcano Engine's innovative spirit and technical prowess in the realm of speech recognition but is also anticipated to have a far - reaching and positive impact.