Apple Inc., in collaboration with Tel Aviv University, has released a research paper introducing a novel 'Principled Coarse-Grained' (PCG) methodology for voice synthesis, aiming to overcome the speed limitations inherent in current AI text-to-speech (TTS) technology. The prevailing autoregressive models in the industry rely on a sequential prediction approach, where stringent validation standards impede the speed of voice generation. The innovative PCG technology elevates single-point validation to range validation by creating acoustically similar clusters and implements a speculative decoding tactic, forming a dual-model cooperative framework that harmonizes efficiency with precision. Upon integrating PCG technology, the speed of voice generation surged by roughly 40%, all while maintaining the audio's high quality and exhibiting consistent performance even under rigorous stress tests. PCG stands out as an inference-stage optimization solution that eliminates the need for retraining existing models, demanding merely around 37MB of extra memory to store acoustically similar clusters.
