On February 9, 2026, Apple, in collaboration with Renmin University of China, unveiled a groundbreaking AI model named VSSFlow. This innovative model is capable of, for the first time, synchronously generating both ambient sound effects and voice dialogues directly from silent videos within a unified system. Utilizing a sophisticated 10-layer architecture, VSSFlow seamlessly integrates video frames with text-to-phoneme sequences, and employs stream matching technology to reconstruct high-quality audio from noise. Research findings indicate that the joint training of speech and sound effects yields a 'mutual enhancement' effect, enhancing the overall audio-visual experience. The project's code has been made openly accessible, with plans to subsequently release model weights and inference demonstrations to the public.
