Details
- NVIDIA AI and Sakana AI Labs collaborated on an ICML 2026 paper introducing sparse transformer kernels and formats optimized for modern NVIDIA GPUs.
- Key innovations include TwELL sparse packing technique and fused CUDA kernels for efficient execution.
- Achieves over 20% speedups in both inference and training at scale compared to prior methods.
- Builds on prior work like Berkeley's 2020 sparse GPU kernels (27% peak on V100) and recent STOF framework for flexible masking.
- Paper and code released openly, enabling community adoption for sparse models like Transformers and MobileNets.
- Targets hardware like NVIDIA Ampere GPUs, which support 2:4 sparse matrix multiplies at 2x dense speed.
Impact
This collaboration advances sparse Transformers on NVIDIA GPUs, delivering 20%+ speedups that pressure rivals like AMD and Intel in AI acceleration. By optimizing for modern architectures with TwELL packing and fused kernels, it lowers compute costs and boosts throughput for large-scale training and inference, aligning with industry shifts toward sparsity for efficiency. Compared to STOF's flexible masking, it emphasizes scale, potentially widening NVIDIA's lead in high-performance AI workloads amid growing demand for memory-efficient models.
