Fast, faster, Qwen. ๐ Thrilled to see Qwen3.5 reaching a record-breaking 580 tps for agentic workloads on the TokenSpeed engine! This milestone wouldn't be possible without our incredible partners. Huge thanks to @lightseekorg, @NVIDIAAI, the Mooncake team, and @tri_dao for the pioneering FA4 optimization. Together, we are pushing the boundaries of open-source LLM inference. ๐คโจ Dive into the full @PyTorch blog post below! ๐ https://t.co/p04wookcZj #Qwen #Qwen3_5 #TokenSpeed #LLM #Inference #AI #PyTorch #OpenSource #AgenticAI #HighPerformance
Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs
Qwen, the AI lab behind the Qwen series of open-source models, achieved 580 tokens per second (tps) for agentic workloads on NVIDIA Blackwell GPUs. This milestone uses TokenSpeed, an open-source inference engine for complex multi-step agent tasks. The setup optimizes the Qwen3.5-397B-A17B FP8 open weights.
- Peak throughput
- 580 tokens per second
- Hardware
- NVIDIA Blackwell B200
- Context window
- 1M tokens
- Cache hit rate
- 90% plus
- Model
- Qwen3.5-397B-A17B
Agentic AI requires rapid multi-turn reasoning, but massive models often struggle with the latency of the agent loop. Qwen3.5 uses a hybrid architecture that mixes standard attention with linear layers to reduce complexity. This approach joins NVIDIA's agent-native inference stack in prioritizing throughput for autonomous, long-running AI sessions.
You can deploy these optimizations via the TokenSpeed runner to handle complex agent patterns with 90% cache hit rates. The engine maintains performance across long contexts, showing minimal throughput drop when scaling to one million tokens. Native Flash Attention 4 support for Blackwell is in development.
Qwen
@Alibaba_Qwen
30retweets331likes
View on XStill wondering? A few quick answers below.
TokenSpeed is a high-performance, open-source inference engine developed by the LightSeek Foundation. It is purpose-built to accelerate agentic workloads, which involve complex multi-step tasks and tool calling. The engine uses a native architecture and static compilation to achieve high throughput and low latency, aiming for performance comparable to proprietary solutions like TensorRT-LLM.
The Qwen3.5 model uses a hybrid attention architecture that interleaves standard Transformer layers with linear attention layers called Gated Delta Networks. This design maintains strong reasoning capabilities while significantly reducing the computational complexity required for long-sequence inference. When paired with the TokenSpeed engine, it eliminates redundant memory copies and uses kernel fusion to keep the GPU saturated.
Yes, the Qwen3.5-397B-A17B model is part of Alibaba's open-source model family. It is a Mixture-of-Experts model that contains 397 billion total parameters but only activates 17 billion parameters per token. This allows the model to deliver frontier-level performance while remaining efficient enough for developers and researchers to self-host on standard high-end GPU infrastructure.
Prefix caching is a technique that stores previously processed conversation history so it does not have to be recomputed. For agentic workloads with multi-turn dialogues, TokenSpeed uses a dual-layer cache that stores both standard KV data and recurrent Mamba states. This system achieves hit rates over 90 percent, significantly reducing the time spent on initial processing.
Qwen3.5 demonstrates high efficiency when handling long contexts up to one million tokens. Benchmarks show that decode throughput only drops by about 16 percent when scaling from 128,000 tokens to one million tokens. This stability is driven by the hybrid architecture, which prevents the linear increase in memory-access costs typically seen in pure Transformer models.





