On March 16, 2026, Tongyi Lab made a groundbreaking announcement with the release and open-sourcing of Fun-CineForge, the inaugural multimodal large model designed to support multi-scene voiceovers at a professional film and television level. Alongside this release, Tongyi Lab also shared the methodology for constructing a high-quality dataset. Employing an integrated 'data + model' approach, Fun-CineForge is engineered to tackle the four primary challenges that AI voiceovers encounter in the film and television industry: lip synchronization, emotional nuance, timbre consistency, and temporal precision. Notably, Fun-CineForge pioneers the incorporation of the 'temporal modality,' a novel feature that harmonizes visual, textual, and audio data to deliver flawless voiceovers in intricate scenarios. This model particularly shines in two-person and multi-person dialogue scenes, ensuring a seamless and natural auditory experience. Currently, Fun-CineForge is freely accessible and supports voiceovers in both Chinese and English for video clips up to 30 seconds in length. Developers are invited to explore and experiment with this innovative model on platforms such as GitHub, HuggingFace, and ModelScope.
