Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen

May 27, 2026 · Updated Jun 12, 2026

Qwen achieved a record 580 tokens per second running its Qwen3.5-397B-A17B model on NVIDIA Blackwell GPUs using the TokenSpeed inference engine. The optimization targets agentic workloads, where multi-turn reasoning and tool-calling typically suffer from high latency. By combining a hybrid attention architecture with deep kernel fusion, the system maintains high throughput even as context scales to one million tokens.

Qwen, the AI lab behind the Qwen series of open-source models, achieved 580 tokens per second (tps) for agentic workloads on NVIDIA Blackwell GPUs. This milestone uses TokenSpeed, an open-source inference engine for complex multi-step agent tasks. The setup optimizes the Qwen3.5-397B-A17B FP8 open weights.

Peak throughput: 580 tokens per second
Hardware: NVIDIA Blackwell B200
Context window: 1M tokens
Cache hit rate: 90% plus
Model: Qwen3.5-397B-A17B

Agentic AI requires rapid multi-turn reasoning, but massive models often struggle with the latency of the agent loop. Qwen3.5 uses a hybrid architecture that mixes standard attention with linear layers to reduce complexity. This approach joins NVIDIA's agent-native inference stack in prioritizing throughput for autonomous, long-running AI sessions.

You can deploy these optimizations via the TokenSpeed runner to handle complex agent patterns with 90% cache hit rates. The engine maintains performance across long contexts, showing minimal throughput drop when scaling to one million tokens. Native Flash Attention 4 support for Blackwell is in development.

View the full update on pytorch.org

Qwen

@Alibaba_QwenMay 27

Fast, faster, Qwen. 🚀 Thrilled to see Qwen3.5 reaching a record-breaking 580 tps for agentic workloads on the TokenSpeed engine! This milestone wouldn't be possible without our incredible partners. Huge thanks to @lightseekorg, @NVIDIAAI, the Mooncake team, and @tri_dao for the pioneering FA4 optimization. Together, we are pushing the boundaries of open-source LLM inference. 🤝✨ Dive into the full @PyTorch blog post below! 👇 https://t.co/p04wookcZj #Qwen #Qwen3_5 #TokenSpeed #LLM #Inference #AI #PyTorch #OpenSource #AgenticAI #HighPerformance

83963

View on X

Still wondering? A few quick answers below.

TokenSpeed is a high-performance, open-source inference engine developed by the LightSeek Foundation. It is purpose-built to accelerate agentic workloads, which involve complex multi-step tasks and tool calling. The engine uses a native architecture and static compilation to achieve high throughput and low latency, aiming for performance comparable to proprietary solutions like TensorRT-LLM.

The Qwen3.5 model uses a hybrid attention architecture that interleaves standard Transformer layers with linear attention layers called Gated Delta Networks. This design maintains strong reasoning capabilities while significantly reducing the computational complexity required for long-sequence inference. When paired with the TokenSpeed engine, it eliminates redundant memory copies and uses kernel fusion to keep the GPU saturated.

Yes, the Qwen3.5-397B-A17B model is part of Alibaba's open-source model family. It is a Mixture-of-Experts model that contains 397 billion total parameters but only activates 17 billion parameters per token. This allows the model to deliver frontier-level performance while remaining efficient enough for developers and researchers to self-host on standard high-end GPU infrastructure.

Prefix caching is a technique that stores previously processed conversation history so it does not have to be recomputed. For agentic workloads with multi-turn dialogues, TokenSpeed uses a dual-layer cache that stores both standard KV data and recurrent Mamba states. This system achieves hit rates over 90 percent, significantly reducing the time spent on initial processing.

Qwen3.5 demonstrates high efficiency when handling long contexts up to one million tokens. Benchmarks show that decode throughput only drops by about 16 percent when scaling from 128,000 tokens to one million tokens. This stability is driven by the hybrid architecture, which prevents the linear increase in memory-access costs typically seen in pure Transformer models.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Qwen →

Keep reading

LightSeek Foundation Launches TokenSpeed to Optimize Blackwell for Agentic AI

LightSeek Foundation released TokenSpeed, an open-source inference engine designed specifically for the long-context and high-throughput demands of AI coding agents. By optimizing kernels for NVIDIA Blackwell hardware, the system achieves higher performance than TensorRT-LLM on agentic benchmarks while maintaining the usability of vLLM.

NVIDIA Blackwell Ultra Powers DeepSeek V4 Pro at 150 Tokens Per Second

NVIDIAApr 25

NVIDIA Blackwell Ultra Powers DeepSeek V4 Pro at 150 Tokens Per Second

NVIDIA reported that DeepSeek-V4-Pro achieves over 150 tokens per second on Blackwell Ultra hardware. This performance level makes 1.6-trillion parameter models viable for real-time autonomous agents. Future software updates like Dynamo and NVFP4 are expected to push these speeds even higher.

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

PerplexityMay 12

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Perplexity published research showing that NVIDIA's GB200 Blackwell architecture nearly halves communication latency for large Mixture-of-Experts models compared to the previous generation. The findings suggest that Blackwell is a primary platform for reducing the cost and latency of serving frontier-scale AI search.

Cursor Releases Warp Decode for 1.84x Faster MoE Inference on Blackwell GPUs

CursorApr 7

Cursor Releases Warp Decode for 1.84x Faster MoE Inference on Blackwell GPUs

Anysphere rebuilt the Mixture of Experts inference path for NVIDIA Blackwell GPUs, achieving 1.84x faster throughput by assigning GPU warps to individual output neurons. This warp decode approach eliminates the data-shuffling overhead typical of expert-centric models while improving output accuracy by 1.4x.

What is the TokenSpeed inference engine?

How does Qwen3.5 achieve high speeds for agentic workloads?

Is the Qwen3.5-397B-A17B model open source?

What is prefix caching in the context of Qwen3.5 and TokenSpeed?

How does Qwen3.5 perform with very long context windows?

Keep reading

LightSeek Foundation Launches TokenSpeed to Optimize Blackwell for Agentic AI

LightSeek Foundation Launches TokenSpeed to Optimize Blackwell for Agentic AI

NVIDIA Blackwell Ultra Powers DeepSeek V4 Pro at 150 Tokens Per Second

NVIDIA Blackwell Ultra Powers DeepSeek V4 Pro at 150 Tokens Per Second

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Cursor Releases Warp Decode for 1.84x Faster MoE Inference on Blackwell GPUs

Cursor Releases Warp Decode for 1.84x Faster MoE Inference on Blackwell GPUs

Keep reading

LightSeek Foundation Launches TokenSpeed to Optimize Blackwell for Agentic AI

LightSeek Foundation Launches TokenSpeed to Optimize Blackwell for Agentic AI

NVIDIA Blackwell Ultra Powers DeepSeek V4 Pro at 150 Tokens Per Second

NVIDIA Blackwell Ultra Powers DeepSeek V4 Pro at 150 Tokens Per Second

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Perplexity Benchmarks Blackwell Performance for High Throughput Large Model Inference

Cursor Releases Warp Decode for 1.84x Faster MoE Inference on Blackwell GPUs

Cursor Releases Warp Decode for 1.84x Faster MoE Inference on Blackwell GPUs