AI-focused startup Kyutai has just rolled out its innovative Pocket TTS model. This model is remarkable for its compactness, featuring a mere 100 million parameters, and it comes with the capability of voice cloning. With this technology, users only need to submit a 5-second audio clip, and the model can accurately replicate the unique timbre, emotions, and other vocal characteristics of the target voice. What sets Pocket TTS apart is its ability to operate in real-time on a standard laptop CPU. This is made possible through its continuous latent variable architecture and the integration of cutting-edge techniques like Lagrangian self-distillation. In terms of performance, Pocket TTS surpasses several of its larger-parameter counterparts, excelling in both Word Error Rate and audio quality. Moreover, it stands out as the sole high-quality TTS system that can achieve super-real-time generation on a CPU. In a move to foster innovation and collaboration, Kyutai has released Pocket TTS under the MIT license, making it freely available to the public. All the training data used to develop this model was sourced from publicly accessible English corpora, with a total of 88,000 hours of audio material.
