NVIDIA Diffusion LLM Hits 2.42x Throughput Without Retraining: Nemotron TwoTower Released - AI

7 x 24 Track global technological trends

Hot Topic

Day

News Topic

NVIDIA Diffusion LLM Hits 2.42x Throughput Without Retraining: Nemotron TwoTower Released

3 hour ago / Read about 30 minute

Source：TechTimes

Visitors gather at Nvidia's AI Factory MGX Ecosystem display during Computex in Taipei on June 2, 2026. CHENG Yu-chen/Getty images

NVIDIA's research team on Wednesday published open weights and training code for Nemotron-Labs-TwoTower, a discrete diffusion language model that generates text 2.42 times faster than standard autoregressive decoding while retaining 98.7 percent of its baseline benchmark quality — and does so without requiring a full re-pretraining run.

That second fact matters as much as the throughput number. The dominant cost in large language model development today is pretraining at scale — the tens of trillions of tokens and months of GPU time required to build a capable foundation model from scratch. TwoTower demonstrates that a lab with an existing autoregressive checkpoint can add diffusion-based parallel generation to it by training only a second denoising network on a fraction of the original data budget. The architecture does not discard what a model has already learned; it layers parallel generation capability on top of it. That shifts the economics of the diffusion LM transition from a cliff to a ramp.

The weights are available on the NVIDIA Nemotron TwoTower collection on Hugging Face under the NVIDIA Nemotron Open Model License, which permits commercial use.

One Bottleneck, Two Towers

Every large language model in wide deployment today generates text autoregressively: one token per forward pass, left to right, with each new word requiring the full weight matrix to be loaded from GPU memory before a single computation can proceed. At low batch sizes — the typical operating condition for interactive chat, agentic loops, and API calls with one or two users — this makes the inference workload memory-bandwidth bound rather than compute-bound. The GPU is spending most of its time waiting on data movement, not doing math.

Discrete diffusion language models address this by replacing one-token-at-a-time commitment with iterative block denoising. Rather than emitting tokens in order, a diffusion LM starts from a block of masked or noisy positions and refines all of them in parallel over several steps, committing confident tokens early and continuing to refine uncertain ones. The result is multiple tokens committed per forward pass, which is what drives the wall-clock speedup.

But prior diffusion LMs assigned a single network two conflicting jobs at once: represent the clean context tokens that have already been committed, and denoise the noisy block being generated. Forcing one network to do both things simultaneously limited how well it could do either.

Read more: Google's DiffusionGemma Generates Text 4x Faster: Diffusion Replaces Token-by-Token Output

TwoTower's central contribution is to split those responsibilities cleanly into two separate networks built on the same backbone: a frozen autoregressive context tower and a trained diffusion denoiser tower.

How the Architecture Actually Works

Both towers share the same base: the Nemotron-3-Nano-30B-A3B backbone, a 30-billion-parameter hybrid model that interleaves three types of layers across 52 total — 23 Mamba-2 layers, 6 self-attention layers, and 23 mixture-of-experts layers. At inference time, only a fraction of those parameters are active for any given token: approximately 3 billion active parameters per token per tower, selected by a routing mechanism across 128 available experts, 6 of which activate per token alongside 2 shared experts.

The first tower — the AR context tower — runs exactly as a standard autoregressive model would. It processes the prompt and all committed output tokens causally, left to right, maintaining a reusable key-value cache for the attention layers and carrying forward the hidden state of the Mamba-2 layers across token positions. It is kept entirely frozen throughout the TwoTower training process.

The second tower — the diffusion denoiser — is where the training happens. It takes the noisy block of tokens being generated and refines them through a masked diffusion objective, with two critical architectural modifications. Within each noisy block, it uses bidirectional attention — meaning every masked position can attend to every other position in the block simultaneously, rather than only looking left. Across blocks and to the prompt, it remains causal. And at every layer, the denoiser receives conditioning from the corresponding layer of the frozen context tower through cross-attention, inheriting the AR tower's rich representation of the prior context.

A small additional mechanism — adaptive layer normalization conditioned on the diffusion timestep, adding roughly 1.5 million parameters to the denoiser — allows each layer to adjust its behavior based on how far into the denoising process the current step falls. Full architectural details appear in arXiv paper 2606.26493.

The practical result: instead of committing one token at a time in sequence, TwoTower commits multiple tokens per refinement step early in decoding, which is the mechanism behind the 2.42× wall-clock speedup over the AR baseline.

The Training Cost That Makes This Viable

The denoiser tower was trained on approximately 2.1 trillion tokens. That is a meaningful number in two directions. It is large enough to recover most of the backbone's quality in a non-sequential generation regime. And it is roughly 8 percent of the 25 trillion tokens used to pretrain the underlying backbone.

The implication is practical: an organization that has already invested in pretraining a large autoregressive model does not need to start over to gain diffusion-based throughput advantages. The paper's methods section explicitly notes that TwoTower is a general approach applicable to any pretrained autoregressive language model, and the released training code supports adapting the method to other backbones.

What the Benchmarks Actually Show

The quality-throughput tradeoff is controlled by a confidence parameter, γ, which determines how many tokens are committed per refinement step. At the default operating point — confidence threshold γ = 0.8 with block size 16, measured on two H100 GPUs in BF16, which produces the 2.42× throughput figure — aggregate benchmark quality across knowledge, reasoning, and language tasks stays within 1.3 percentage points of the AR baseline.

The results are not uniform across task types, and the gaps follow a pattern consistent with what diffusion language model research has documented. Tasks that reward bidirectional context and global coherence — commonsense reasoning and multilingual evaluation — recover to or slightly above AR baseline levels. Tasks that depend on strict left-to-right sequential reasoning are more affected: HumanEval, OpenAI's 164-problem Python code generation benchmark, drops from 79.27 to 75.58. Math benchmarks show similar modest degradation.

This is the known limitation of parallel block generation: when a token's correct value depends strongly on the exact identity of tokens that will immediately follow it in the same block — as is often the case in code and math — bidirectional denoising within the block is not a full substitute for strict causal conditioning. The tradeoff is real and is reported transparently in the paper.

Pushing past 3× throughput is possible by increasing the number of tokens committed per step, but comes with larger quality penalties. The 2.42× figure represents the balanced default operating point.

Read more: Speculative Decoding Bottleneck Broken: DFlash Hits 15x on Blackwell GPUs

What It Takes to Run This Model

The released checkpoint contains both towers, totaling roughly 60 billion parameters and approximately 126 gigabytes of storage. Full two-tower diffusion decoding requires two GPUs with approximately 59 gigabytes each in BF16 precision — typically two 80-gigabyte data-center GPUs such as the NVIDIA H100 or A100.

For teams without the two-GPU setup, the same checkpoint supports two fallback modes: a mock-AR mode designed for compatibility testing and debugging, and standard AR-only decoding, which runs on a single 80-gigabyte GPU and matches the performance of the original autoregressive backbone. All three modes load from the same set of weights; switching between them requires only a configuration change rather than a different model file. The model loads via the standard Hugging Face Transformers library, with NVIDIA providing a place_towers_on_devices utility function to distribute the two towers across GPU devices.

What It Is Not Yet

The released checkpoint is a base model. It has not been instruction-tuned, and it has not been aligned for chat, coding assistant, or safe-deployment use cases. Building a production chat model or coding assistant on TwoTower would require fine-tuning and safety alignment as subsequent steps — the same work that converted the Nemotron backbone into the instruction-following Nemotron variants.

The unaligned base model status is the standard starting point for open research releases of this kind and does not diminish the architectural result. But it defines the gap between the paper's contribution — a demonstrated approach to high-throughput diffusion generation on a large hybrid backbone — and a deployable product.

The paper's authors — Fitsum Reda, John Kamalu, Roger Waleffe, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro — released training code alongside the weights, giving other research teams the material needed to adapt the TwoTower approach to different backbone models.

Frequently Asked Questions

What is a diffusion language model, and how does it differ from standard LLMs?

Standard large language models generate text autoregressively, predicting one token at a time from left to right. A diffusion language model replaces that sequential process with iterative denoising: it starts from a block of masked or noisy token positions and refines all of them in parallel over several steps, committing confident predictions early. This parallel generation is what allows diffusion LLMs to produce multiple tokens per forward pass and achieve substantially higher throughput than token-by-token autoregressive generation, at the cost of some accuracy on tasks requiring strict sequential reasoning.

Why do diffusion models perform worse on code and math tasks?

Code and math generation often requires strict sequential conditioning — the correct value of a token at position N depends critically on exactly what was committed at positions N-1 and N-2. When a diffusion model refines a block of tokens in parallel, it uses bidirectional attention within the block, which allows every token to inform every other token. That bidirectional context is powerful for tasks with global coherence requirements, but it is not a full substitute for strict left-to-right conditioning when token sequences are tightly ordered. For TwoTower, this manifests as a 3.7-point drop on HumanEval at the default operating point — a real limitation the paper reports transparently.

Can the TwoTower approach work on language models other than Nemotron?

According to the paper's authors, yes. The paper explicitly states that TwoTower is a general approach applicable to any pretrained autoregressive language model. The key requirement is that the backbone model's weights can be duplicated into two towers: one kept frozen for context modeling, one trained for denoising. NVIDIA has released the training code alongside the weights, which supports adapting the method to other base models.

What hardware does Nemotron TwoTower require for full diffusion decoding?

Full two-tower diffusion mode requires two GPUs with approximately 59 gigabytes each in BF16 precision — typically two 80-gigabyte data-center GPUs such as the NVIDIA H100 or A100. The same checkpoint also supports AR-only decoding on a single 80-gigabyte GPU, giving teams without dual-GPU setups access to the model in autoregressive mode while the full diffusion configuration remains a two-GPU workload.

Previous page：OpenAI proposed donating 5% of its equity to a US ...

Next page：Claude Code Dynamic Workflows Go GA: Pro Users Can...

Return to List

Hot Reading

2 day ago

The AI jobs debate just got messier

1 day ago

Samsung Patents a New HBM "Dummy Die" Structure for Taller Memory Stacks

2 day ago

AMD confirms low-power CPU cores in Linux kernel patch — Zen 6 chips could follow in Intel's footsteps with new core type for background tasks

2 day ago

Google's new Nano Banana 2 Lite image model is its fastest and cheapest yet