HeadsUpAI

Cursor Autonomously Optimizes NVIDIA CUDA Kernels for 38 Percent Speedup

· Updated

Cursor, an AI-native code editor built by Anysphere, developed a multi-agent system that autonomously optimized 235 CUDA kernels (software that executes math on GPUs) for NVIDIA Blackwell 200 hardware. Over three weeks, the system achieved a 38% geomean speedup, with 19% of solutions delivering improvements greater than 2x.

GPU performance is often limited by the manual effort required to tune individual operations for new hardware. This research proves that multi-agent architectures can navigate the full solution space of low-level assembly. By automating this optimization, providers can significantly reduce inference (running a model) costs and energy consumption.

While this experiment focused on Blackwell kernels, these multi-agent coordination techniques will soon be integrated into the core Cursor product. You can currently use the editor for agentic coding, but expect future updates to handle more autonomous, long-running optimization workflows that fall outside standard training data distributions.

Cursor
Cursor
@cursor_ai
X

We've been developing a multi-agent system that builds and maintains complex software autonomously. Recently, we partnered with NVIDIA to apply it to optimizing CUDA kernels. In 3 weeks, it delivered a 38% geomean speedup across 235 problems. https://t.co/0YvbXrzVfe

79retweets1.2klikes
View on X

Share this update