HeadsUpAI

LightSeek Foundation Launches TokenSpeed to Optimize Blackwell for Agentic AI

· Updated

LightSeek Foundation, a Silver Member of the PyTorch Foundation, introduced TokenSpeed in a performance preview. It is an MIT-licensed inference engine (software that runs trained AI models) built for the unique traffic patterns of autonomous agents. The system uses a compiler-backed modeling layer to automate parallel operations.
Throughput gain
11% higher than TensorRT-LLM
Latency reduction
Nearly 50% for decode workloads
Hardware support
NVIDIA Blackwell B200
License
MIT
Availability
Performance preview on GitHub

Standard engines are often optimized for general chat, but agentic workloads (tasks where AI agents plan and act independently) require responsiveness across contexts exceeding 50,000 tokens. TokenSpeed targets this gap by outperforming TensorRT-LLM on Blackwell hardware, delivering 11% higher throughput. This mirrors Perplexity's ROSE engine and NVIDIA Dynamo optimizations.

Access the source code on GitHub to run models like DeepSeek V4 with optimized attention kernels already being adopted by vLLM. Production hardening is planned for next month. The project is a collaboration including engineers from NVIDIA, AMD, and Alibaba's Qwen team.

LightSeek Foundation
LightSeek Foundation
@lightseekorg
X

Introducing TokenSpeed, a speed-of-light LLM inference engine. > TensorRT LLM level performance > vLLM level usability > Built by a lean and mission-driven team in two months > MIT license, open-source https://t.co/MJzhCEg7m8 https://t.co/anhoETwwS9 https://t.co/BWn4Me62x7

124retweets1.1klikes
View on X

Still wondering? A few quick answers below.

TokenSpeed is an open-source LLM inference engine designed by the LightSeek Foundation specifically for agentic AI workloads. It aims to provide the high performance of NVIDIA TensorRT-LLM with the ease of use found in vLLM, a popular open-source serving framework. The engine uses a compiler-backed modeling mechanism to optimize how models process large volumes of tokens.

TokenSpeed is designed to match or exceed the performance of TensorRT-LLM on NVIDIA Blackwell hardware for specific agentic tasks. Benchmarks show it achieves roughly 11 percent higher throughput than TensorRT-LLM when serving coding agents. It achieves this by using optimized kernels for Multi-head Latent Attention, which nearly halves latency relative to other state-of-the-art engines on typical decode workloads.

Yes, TokenSpeed is released under the MIT license, making it free for both personal and commercial use. The source code is hosted on GitHub by the LightSeek Foundation. While the current release is a performance preview intended for benchmarking and technical evaluation, the team plans to provide production-hardened versions and additional feature updates over the coming month.

TokenSpeed is currently optimized for NVIDIA Blackwell architecture, specifically the B200 GPU. It includes specialized kernels designed to fully utilize Tensor Cores, which are specialized hardware units for fast matrix math. The development team is also working on platform optimizations for NVIDIA Hopper and AMD MI350 hardware, with support for these accelerators planned for future updates.

Agentic workloads refer to the specific demands of AI agents that perform multi-step tasks, such as autonomous coding. These workloads typically involve very long context windows, often exceeding 50,000 tokens, and require high tokens-per-second performance to remain responsive. TokenSpeed is built from first principles to handle these long-running, high-concurrency sessions more efficiently than general-purpose inference engines.

Share this update