On March 22, 2026, researchers at NVIDIA introduced KVTC (KV Cache Transformation Coding) technology. This breakthrough compresses the KV cache—the Key and Value data generated during conversation processing by large language models (LLMs)—enabling a remarkable 20-fold reduction in memory usage, all without the need for any alterations to the model code. The KV cache, often likened to the AI model's 'short-term memory,' can balloon to several gigabytes during extended conversations, consuming GPU memory and hindering operational efficiency. NVIDIA senior engineer Adrian Lancucki pointed out that the current challenge in model inference frequently stems from limited GPU memory rather than computational capacity.
Drawing inspiration from JPEG compression techniques, KVTC achieves efficient data compression through a three-step process: principal component analysis, adaptive quantization, and entropy coding. This approach preserves essential information while supporting block-based decompression to maintain real-time model responsiveness. Tests indicate that on models with 1.5 billion to 70 billion parameters (such as the Llama 3 series and R1-Qwen 2.5), KVTC results in less than a 1% loss in accuracy even after 20-fold compression. In contrast, traditional methods suffer significant accuracy declines at just 5-fold compression.
When handling 8,000 tokens on an H100 GPU, KVTC cuts the initial response time from 3 seconds to a mere 380 milliseconds, marking an 8-fold improvement. This technology is particularly well-suited for long-conversation scenarios, such as programming assistants and iterative reasoning tasks. NVIDIA plans to incorporate KVTC into the Dynamo framework, ensuring compatibility with open-source engines like vLLM. Industry experts predict that as conversation lengths continue to grow, KVTC could emerge as a standardized compression tool for AI deployment, significantly reducing hardware costs for businesses.
