Ant Group Makes Full-Modal Large Model Ming-Flash-Omni 2.0 Open-Source
4 day ago / Read about 0 minute
Author:小编   

On February 11, 2026, Ant Group took a significant stride by officially open-sourcing and launching the full-modal large model Ming-Flash-Omni 2.0. This model stands as the first-ever unified audio generation model in the industry that caters to all scenarios. It boasts the remarkable ability to generate speech, ambient sound effects, and music concurrently on a single audio track. Users are empowered to precisely control parameters like voice timbre, speech rate, intonation, volume, emotion, and dialect via natural language instructions.

During the inference stage, the model attains an incredibly low inference frame rate of 3.1Hz, which enables real-time, high-fidelity generation of audio that lasts for minutes. In numerous public benchmark tests, Ming-Flash-Omni 2.0 showcased exceptional performance in crucial capabilities such as visual language understanding, controllable speech generation, image generation, and editing. In fact, some of its metrics even outperformed those of Gemini 2.5 Pro.

At present, the model weights and inference code have been made available on open-source communities like Hugging Face. Moreover, developers have the option to experience and leverage the model online through Ant Group's Ling Studio platform.