NVIDIA Releases NeMo RL FP8 Support for 1.48x Faster RL Training

Details

NVIDIA NeMo RL, an open-source library in the NeMo ecosystem, now supports end-to-end FP8 precision for reinforcement learning (RL) post-training to boost agentic performance.
Achieves 1.48x speedup (48% faster) on Qwen3-8B-Base workloads, enabling quicker iterations for agentic tool use and multi-step reasoning.
Maintains accuracy parity with BF16: validation accuracy of 0.613 on Llama 3.1 8B Instruct vs. 0.616 for BF16; scales to Qwen3-30B MoE models.
Uses block-wise quantized FP8 (E4M3 format) with 128x128 granularity for weights and 1x128 for activations; linear layers at 2x theoretical throughput, attention in BF16.
Integrates TransformerEngine and vLLM with monkey patches for FP8 generation; dynamic recalibration adds minor 2-3% overhead per training step.
Pre-configured recipes for Llama 3.1 8B and Moonlight 16B; requires Hopper GPUs and CUDA >=12.9.

Impact

NVIDIA's FP8 recipe in NeMo RL delivers 48% faster RL training at BF16-equivalent accuracy, reducing compute costs for agentic AI development on Hopper GPUs. This pressures rivals like DeepMind and OpenAI, who rely on higher-precision BF16/FP16 stacks without comparable low-precision RL acceleration. By lowering barriers to rapid policy iteration in tool-use and multi-step agents, it accelerates enterprise adoption of RLHF/RLAIF workflows, narrowing the gap with leaders in frontier model alignment while optimizing NVIDIA's GPU ecosystem for sustained AI infrastructure dominance.

NVIDIA Releases NeMo RL FP8 Support for 1.48x Faster RL Training

Details

Impact

Social

CONTENT

INFO