Cursor Releases Warp Decode for 1.84x Faster MoE Inference on Blackwell GPUs

Cursor

Apr 7, 2026 · Updated Apr 25, 2026

Anysphere rebuilt the Mixture of Experts inference path for NVIDIA Blackwell GPUs, achieving 1.84x faster throughput by assigning GPU warps to individual output neurons. This warp decode approach eliminates the data-shuffling overhead typical of expert-centric models while improving output accuracy by 1.4x.

Cursor, an AI-native code editor built by Anysphere, introduced warp decode to optimize Mixture of Experts (MoE) (a model architecture using specialized sub-networks) inference. Traditional implementations organize computation around experts, requiring five bookkeeping stages to shuffle data. Warp decode flips this axis, assigning each 32-thread GPU warp to a single output value.

Standard MoE paths struggle during single-token generation where overhead isn't amortized. By eliminating intermediate memory buffers and cross-warp synchronization, this method achieves 58% of the Blackwell B200's peak memory bandwidth. It also improves accuracy by keeping activations in BF16, avoiding the rounding errors found in common quantization methods.

These improvements accelerate the training pipeline for Composer, the model powering Cursor. While not a replacement for expert-centric prefilling, it allows the team to ship improved model versions more frequently. You will see these performance and accuracy gains reflected in the editor's responsiveness as the team scales its Blackwell infrastructure.

View the full update on cursor.com

Cursor

@cursor_aiApr 6

We rebuilt how MoE models generate tokens on Blackwell GPUs, resulting in 1.84x faster inference and more accurate outputs. These improvements directly contribute to how we train Composer, allowing us to ship improved versions of the model more often. https://t.co/G7o1ZE29nO

48665

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Cursor →

Keep reading

Cursor Autonomously Optimizes NVIDIA CUDA Kernels for 38 Percent Speedup

Cursor partnered with NVIDIA to apply a multi-agent system to CUDA kernel optimization, achieving a 38 percent geomean speedup on Blackwell GPUs. This demonstrates that autonomous agents can solve complex hardware engineering tasks that previously required months of manual effort from human experts.

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

QwenMay 27

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen achieved a record 580 tokens per second running its Qwen3.5-397B-A17B model on NVIDIA Blackwell GPUs using the TokenSpeed inference engine. The optimization targets agentic workloads, where multi-turn reasoning and tool-calling typically suffer from high latency. By combining a hybrid attention architecture with deep kernel fusion, the system maintains high throughput even as context scales to one million tokens.

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

PerplexityMay 12

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Perplexity published research showing that NVIDIA's GB200 Blackwell architecture nearly halves communication latency for large Mixture-of-Experts models compared to the previous generation. The findings suggest that Blackwell is a primary platform for reducing the cost and latency of serving frontier-scale AI search.

Cohere Integrates W4A8 Inference into vLLM for Faster Hopper Performance

CohereApr 24

Cohere Integrates W4A8 Inference into vLLM for Faster Hopper Performance

Cohere released production-ready W4A8 quantization kernels for dense and Mixture of Experts models, now integrated into the vLLM inference framework. By combining 4-bit weights with 8-bit activations, the update achieves up to 58 percent faster prefill and 45 percent faster decoding on NVIDIA Hopper GPUs.