Details
- Cursor announced a multi-agent system that autonomously builds and maintains complex software, applied in partnership with NVIDIA to optimize CUDA kernels for Blackwell 200 GPUs.
- In three weeks, the system optimized 235 problems from SOL-ExecBench, derived from 124 production open-source models like Deepseek, Qwen, Gemma, Kimi, and Stable Diffusion.
- Achieved 38% geomean speedup over baselines, outperforming on 63% of problems (149/235) and delivering over 2x speedup on 19% (45/235).
- The planner agent distributed work across workers, learned to call benchmarking pipelines for continuous testing, debugging, and optimization down to assembly level.
- Optimizations typically require months or years from expert engineers; faster kernels improve GPU utilization for AI training and inference, reducing token costs.
- Validates multi-agent architectures for novel problems; techniques will integrate into Cursor's core product.
Impact
Cursor's multi-agent system delivers concrete efficiency gains in CUDA kernel optimization, achieving 38% geomean speedup on Blackwell GPUs where human experts take months. This pressures rivals in AI infrastructure by automating low-level GPU code, potentially lowering training costs and accelerating model development. Compared to NVIDIA's own agentic research like Nemotron 3 Super and orchestration agents, Cursor's approach demonstrates practical deployment on production kernels from models like Deepseek and Gemma, narrowing the gap in agentic AI for hardware optimization while advancing cheaper inference at scale.
