The Massachusetts Institute of Technology (MIT), in a collaborative effort with NVIDIA and other esteemed institutions, has introduced the groundbreaking 'Taming Long Tails' (TLT) technology. This innovation dramatically enhances the training efficiency of large language models (LLMs) specifically designed for inference tasks. Throughout the reinforcement learning training process, large inference models demand significant computational resources and energy, with the 'inference' stage alone consuming 85% of the total training duration. Furthermore, inconsistencies in task completion times across various processors lead to efficiency bottlenecks.
The TLT solution leverages 'speculative decoding' technology, employing a pre-trained 'draft model' to anticipate the output of the larger model. This anticipated output is subsequently validated in bulk by the larger model. Additionally, the TLT system incorporates an 'adaptive draft trainer' and an 'adaptive inference engine' to guarantee seamless synchronization between the draft model and the larger model, all without imposing additional computational burdens.
Test outcomes reveal that TLT technology accelerates the training speed of multiple large language models for inference by an impressive 70% to 210%, all the while preserving accuracy levels. The trained draft models can be efficiently deployed in subsequent stages. Looking ahead, the research team aims to integrate this technology into a broader range of frameworks, with the ultimate goal of reducing AI development costs and enhancing energy efficiency.
