Together AI Delivers 31% Faster Coding Agent Inference on Blackwell

Together AI

Jun 15, 2026

Together AI published coding agent benchmarks showing its inference engine achieves 31% more tokens per second than the next-fastest open-source engine on NVIDIA Blackwell hardware. These performance gains result from custom kernels targeting Blackwell Tensor Core instructions. Cursor now runs its real-time coding agents on this production stack to maintain low-latency feedback loops during development.

View the full update on together.ai

Together AI

@togethercompute2d ago

The case for Blackwell in production agent infrastructure just got cleaner. @ArtificialAnlys AgentPerf gives the hardware picture. Together's coding agent benchmarks give the inference picture: 31% more TPS than the next-fastest OSS engine on the same hardware, through custom kernels built for Blackwell's Tensor Core instructions. Cursor runs their real-time coding agents on this stack. Learn more about how we built it in the 🧵

115

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Keep reading

Together AI Delivers Real-Time Blackwell Inference Infrastructure for Cursor Agents

Together AI built a real-time inference stack for Cursor’s in-editor coding agents using NVIDIA Blackwell GB200 NVL72 and B200 GPUs. The infrastructure features custom kernels for Blackwell Tensor Core instructions, ARM host optimization, and a quantization pipeline that moves internally trained model weights to production test endpoints within days, ensuring predictable latency for real-time code refactoring.

Cursor Autonomously Optimizes NVIDIA CUDA Kernels for 38 Percent Speedup

CursorApr 15

Cursor Autonomously Optimizes NVIDIA CUDA Kernels for 38 Percent Speedup

Cursor partnered with NVIDIA to apply a multi-agent system to CUDA kernel optimization, achieving a 38 percent geomean speedup on Blackwell GPUs. This demonstrates that autonomous agents can solve complex hardware engineering tasks that previously required months of manual effort from human experts.

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

QwenMay 27

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen achieved a record 580 tokens per second running its Qwen3.5-397B-A17B model on NVIDIA Blackwell GPUs using the TokenSpeed inference engine. The optimization targets agentic workloads, where multi-turn reasoning and tool-calling typically suffer from high latency. By combining a hybrid attention architecture with deep kernel fusion, the system maintains high throughput even as context scales to one million tokens.

LightSeek Foundation Launches TokenSpeed to Optimize Blackwell for Agentic AI

LightSeek FoundationMay 7

LightSeek Foundation Launches TokenSpeed to Optimize Blackwell for Agentic AI

LightSeek Foundation released TokenSpeed, an open-source inference engine designed specifically for the long-context and high-throughput demands of AI coding agents. By optimizing kernels for NVIDIA Blackwell hardware, the system achieves higher performance than TensorRT-LLM on agentic benchmarks while maintaining the usability of vLLM.