Introducing TokenSpeed, a speed-of-light LLM inference engine. > TensorRT LLM level performance > vLLM level usability > Built by a lean and mission-driven team in two months > MIT license, open-source https://t.co/MJzhCEg7m8 https://t.co/anhoETwwS9 https://t.co/BWn4Me62x7
LightSeek Foundation Launches TokenSpeed to Optimize Blackwell for Agentic AI
· Updated
LightSeek Foundation released TokenSpeed, an open-source inference engine designed specifically for the long-context and high-throughput demands of AI coding agents. By optimizing kernels for NVIDIA Blackwell hardware, the system achieves higher performance than TensorRT-LLM on agentic benchmarks while maintaining the usability of vLLM.
- Throughput gain
- 11% higher than TensorRT-LLM
- Latency reduction
- Nearly 50% for decode workloads
- Hardware support
- NVIDIA Blackwell B200
- License
- MIT
- Availability
- Performance preview on GitHub
Standard engines are often optimized for general chat, but agentic workloads (tasks where AI agents plan and act independently) require responsiveness across contexts exceeding 50,000 tokens. TokenSpeed targets this gap by outperforming TensorRT-LLM on Blackwell hardware, delivering 11% higher throughput. This mirrors Perplexity's ROSE engine and NVIDIA Dynamo optimizations.
Access the source code on GitHub to run models like DeepSeek V4 with optimized attention kernels already being adopted by vLLM. Production hardening is planned for next month. The project is a collaboration including engineers from NVIDIA, AMD, and Alibaba's Qwen team.
Still wondering? A few quick answers below.
Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →






