Tongyi Lab Unveils PrismAudio: A Pioneering Framework for Crafting Ambient Sounds in Videos
12 hour ago / Read about 0 minute
Author:小编   

On March 24, 2026, Alibaba's Tongyi Lab proudly introduced PrismAudio, a groundbreaking framework designed specifically for generating ambient sounds in videos. This innovative system uniquely combines reinforcement learning with a chain-of-thought methodology. Its primary focus lies in seamlessly synchronizing sound effects—such as the clatter of horse hooves, the whisper of wind, and the patter of rain—with corresponding visuals, while deliberately excluding human voiceovers from its purview.

The framework utilizes a sophisticated, decomposed reasoning process. In this setup, four specialized "teachers"—semantic, temporal, aesthetic, and spatial—work in tandem to evaluate and refine the output. By harnessing the power of the efficient Fast-GRPO algorithm, PrismAudio achieves performance on par with traditional methods that require 600 training steps, but does so in a mere 200 steps. Moreover, the model boasts a compact design with only 518 million parameters, and it can generate a 9-second audio clip in just 0.63 seconds.

This remarkable research has been accepted for presentation at ICLR 2026, and the code behind PrismAudio will soon be made available as open-source, inviting further exploration and innovation in the field.