Details
- NVIDIA NeMo RL, an open-source library in the NeMo ecosystem, now supports end-to-end FP8 precision for reinforcement learning (RL) post-training to boost agentic performance.
- Achieves 1.48x speedup (48% faster) on Qwen3-8B-Base workloads, enabling quicker iterations for agentic tool use and multi-step reasoning.
- Maintains accuracy parity with BF16: validation accuracy of 0.613 on Llama 3.1 8B Instruct vs. 0.616 for BF16; scales to Qwen3-30B MoE models.
- Uses block-wise quantized FP8 (E4M3 format) with 128x128 granularity for weights and 1x128 for activations; linear layers at 2x theoretical throughput, attention in BF16.
- Integrates TransformerEngine and vLLM with monkey patches for FP8 generation; dynamic recalibration adds minor 2-3% overhead per training step.
- Pre-configured recipes for Llama 3.1 8B and Moonlight 16B; requires Hopper GPUs and CUDA >=12.9.
Impact
NVIDIA's FP8 recipe in NeMo RL delivers 48% faster RL training at BF16-equivalent accuracy, reducing compute costs for agentic AI development on Hopper GPUs. This pressures rivals like DeepMind and OpenAI, who rely on higher-precision BF16/FP16 stacks without comparable low-precision RL acceleration. By lowering barriers to rapid policy iteration in tool-use and multi-step agents, it accelerates enterprise adoption of RLHF/RLAIF workflows, narrowing the gap with leaders in frontier model alignment while optimizing NVIDIA's GPU ecosystem for sustained AI infrastructure dominance.
