NVIDIA's NVFP4: Breaking the 4-Bit Pretraining Barrier for Large Language Models

By • min read

NVIDIA has introduced a groundbreaking 4-bit floating-point format called NVFP4, enabling pretraining of frontier-scale large language models at unprecedented efficiency. By leveraging the Blackwell Tensor Cores' native support and a novel microscaling approach, they successfully trained a 12-billion-parameter hybrid Mamba-Transformer model on 10 trillion tokens—the longest publicly documented 4-bit training run. The model achieves 62.58% on MMLU-Pro, nearly matching the FP8 baseline of 62.62%, while delivering up to 6x throughput gains over BF16. This Q&A explores the technical innovations behind NVFP4 and its implications for efficient AI training.

What is NVFP4 and how does it differ from standard 4-bit formats?

NVFP4 is a 4-bit microscaling (MX) format designed specifically for NVIDIA Blackwell Tensor Cores. Unlike standard MXFP4, which uses 32-element blocks with UE8M0 scale factors (powers of two only), NVFP4 introduces three key improvements. First, it reduces the block size from 32 to 16 elements, narrowing the dynamic range each scale must cover. Second, block scale factors are stored in E4M3 format instead of UE8M0, trading exponent range for mantissa precision. This allows the per-block absolute maximum (amax) to be mapped much closer to the FP4 maximum representable value. Third, NVFP4 adds a second scaling level: an FP32 per-tensor scale that remaps values so the E4M3 block scales themselves stay in range. The result is that at least 6.25% of values in each block—the per-block amax—are represented at near-FP8 precision, while the remainder sit in FP4. This hybrid precision approach dramatically reduces quantization error compared to vanilla 4-bit formats.

NVIDIA's NVFP4: Breaking the 4-Bit Pretraining Barrier for Large Language Models
Source: www.marktechpost.com

What speed advantages does NVFP4 offer on NVIDIA Blackwell hardware?

On NVIDIA Blackwell GPUs, FP4 GEMMs (General Matrix Multiply operations) run at 4x BF16 throughput on GB200 and 6x on GB300. Compared to FP8, this translates to roughly 2x and 3x speedups, respectively. Additionally, the operand memory footprint is approximately halved compared to FP8, reducing memory bandwidth pressure. These performance gains come from the tensor cores' native support for NVFP4 format, allowing efficient mixed-precision computation without software overhead. For large-scale pretraining, this means faster iteration cycles and reduced energy consumption, making it economically viable to train bigger models on less hardware.

Which parts of the neural network are quantized to NVFP4 and which remain in higher precision?

Only the GEMMs inside linear (fully-connected) layers—Fprop, Dgrad, and Wgrad—actually run in NVFP4. Embeddings, the output projection head, normalization layers, non-linearities, and all attention components (including softmax, query-key and attention score-value batched GEMMs) stay in BF16 or FP32. Model weights, weight gradients used for accumulation across microbatches and data-parallel replicas, and optimizer states are kept in FP32. Tensor parallel reductions run in BF16. This selective quantization ensures that the most compute-intensive operations benefit from NVFP4's speed, while maintaining numerical stability in sensitive components like attention and normalization.

How accurate was the 12B hybrid model trained with NVFP4 compared to FP8?

The 12-billion-parameter hybrid Mamba-Transformer model trained with NVFP4 achieved 62.58% on MMLU-Pro 5-shot, compared to the FP8 baseline's 62.62%. The difference of just 0.04 percentage points is statistically negligible, demonstrating that 4-bit pretraining with NVFP4 incurs no significant accuracy loss. This validates the format's ability to preserve model quality over an extremely long training horizon of 10 trillion tokens. The research also notes that the model is supported in NVIDIA's Transformer Engine, making it production-ready for practitioners.

NVIDIA's NVFP4: Breaking the 4-Bit Pretraining Barrier for Large Language Models
Source: www.marktechpost.com

What challenge did NVIDIA overcome to make 4-bit pretraining feasible?

Pretraining in 4-bit precision has historically been an open research problem because narrower formats compress dynamic range and amplify quantization error, especially at long token horizons. NVIDIA's initial experiments with straightforward NVFP4 quantization (default 1x16 block scaling, round-to-nearest-even, no transforms) diverged early in training. To overcome this, they developed a four-part training methodology (not fully detailed in the paper preview) that likely includes techniques like adaptive scaling, mixed-precision accumulation, or gradient manipulation. The exact details remain under wraps, but the successful 10-trillion-token run proves that careful engineering can stabilize 4-bit training at scale, opening the door to even more aggressive quantization in the future.

Why is this 10 trillion token run considered a milestone for low-precision training?

Training a 12-billion-parameter model on 10 trillion tokens in 4-bit precision is the longest publicly documented training run at this precision. It demonstrates that 4-bit floating-point formats can sustain accurate computations over an extended period without divergence or degradation. This milestone is significant because it proves that low-precision training can be scaled to frontier-sized models and datasets, potentially reducing training costs by a factor of 2x-3x compared to FP8. It also sets a precedent for future research into sub-8-bit training, showing that with the right hardware and methodology, 4-bit pretraining is not only possible but practical.

What are the broader implications of NVFP4 for the AI industry?

NVFP4 could democratize access to large language model training by lowering hardware requirements and energy costs. With 2x-3x speedups over FP8 and near-lossless accuracy, organizations can train larger models faster or reduce the number of GPUs needed. The format's native support in Blackwell Tensor Cores and Transformer Engine means it's ready for immediate deployment. This could accelerate the trend toward custom hybrid architectures like Mamba-Transformer, which are designed to benefit from reduced precision. Additionally, the success of 4-bit pretraining may inspire further research into 2-bit or even 1-bit formats, pushing the boundaries of AI efficiency.

Recommended

Discover More

AWS Weekly Highlights: AgentCore Payments, Agent Toolkit, and New Instances (May 11, 2026)7 Crucial Updates: docs.rs Default Build Targets ExplainedPsyche Spacecraft Snaps Stunning Crescent Mars Image During Gravity Assist ManeuverMarvel's 'Brand New Day' Leak Sparks Fury: Spider-Man's Only 'Friend' Is an AIClimate Crisis Intensifies Allergy Season: Experts Warn of 'Unprecedented' Pollen Surge