Microsoft Unveils Lightweight Real-Time TTS Model: VibeVoice-Realtime-0.5B
4 day ago / Read about 0 minute
Author:小编   

Microsoft has recently introduced a lightweight, real-time text-to-speech model named VibeVoice-Realtime-0.5B. This innovative model is designed to handle streaming text input seamlessly, producing extended speech outputs with an impressively low first-sound latency of about 300 milliseconds. Its capabilities make it particularly well-suited for applications such as interactive agent dialogues and real-time data broadcasting, where quick and accurate responses are crucial.

The model incorporates an ingenious interleaved window design, which contributes to its remarkable performance. On the LibriSpeech test set, it achieves a zero-shot word error rate as low as 2.00%, showcasing its precision and reliability. VibeVoice-Realtime-0.5B is versatile, supporting both Chinese and English languages for transcription and speech generation. Moreover, it can stably produce speech outputs lasting up to 90 minutes, facilitating natural multi-role dialogues and enabling the expression of various emotions.