We've been developing a multi-agent system that builds and maintains complex software autonomously. Recently, we partnered with NVIDIA to apply it to optimizing CUDA kernels. In 3 weeks, it delivered a 38% geomean speedup across 235 problems. https://t.co/0YvbXrzVfe
Cursor Autonomously Optimizes NVIDIA CUDA Kernels for 38 Percent Speedup
· Updated
Cursor partnered with NVIDIA to apply a multi-agent system to CUDA kernel optimization, achieving a 38 percent geomean speedup on Blackwell GPUs. This demonstrates that autonomous agents can solve complex hardware engineering tasks that previously required months of manual effort from human experts.
CUDA kernels (software that executes math on GPUs) for NVIDIA Blackwell 200 hardware. Over three weeks, the system achieved a 38% geomean speedup, with 19% of solutions delivering improvements greater than 2x.GPU performance is often limited by the manual effort required to tune individual operations for new hardware. This research proves that multi-agent architectures can navigate the full solution space of low-level assembly. By automating this optimization, providers can significantly reduce inference (running a model) costs and energy consumption.
While this experiment focused on Blackwell kernels, these multi-agent coordination techniques will soon be integrated into the core Cursor product. You can currently use the editor for agentic coding, but expect future updates to handle more autonomous, long-running optimization workflows that fall outside standard training data distributions.
Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →



