LightSeek Foundation Launches TokenSpeed to Optimize Blackwell for Agentic AI

LightSeek Foundation

May 7, 2026 · Updated May 15, 2026

LightSeek Foundation released TokenSpeed, an open-source inference engine designed specifically for the long-context and high-throughput demands of AI coding agents. By optimizing kernels for NVIDIA Blackwell hardware, the system achieves higher performance than TensorRT-LLM on agentic benchmarks while maintaining the usability of vLLM.

LightSeek Foundation, a Silver Member of the PyTorch Foundation, introduced TokenSpeed in a performance preview. It is an MIT-licensed inference engine (software that runs trained AI models) built for the unique traffic patterns of autonomous agents. The system uses a compiler-backed modeling layer to automate parallel operations.

Throughput gain: 11% higher than TensorRT-LLM
Latency reduction: Nearly 50% for decode workloads
Hardware support: NVIDIA Blackwell B200
License: MIT
Availability: Performance preview on GitHub

Standard engines are often optimized for general chat, but agentic workloads (tasks where AI agents plan and act independently) require responsiveness across contexts exceeding 50,000 tokens. TokenSpeed targets this gap by outperforming TensorRT-LLM on Blackwell hardware, delivering 11% higher throughput. This mirrors Perplexity's ROSE engine and NVIDIA Dynamo optimizations.

Access the source code on GitHub to run models like DeepSeek V4 with optimized attention kernels already being adopted by vLLM. Production hardening is planned for next month. The project is a collaboration including engineers from NVIDIA, AMD, and Alibaba's Qwen team.

View the full update on lightseek.org

LightSeek Foundation

@lightseekorgMay 6

Introducing TokenSpeed, a speed-of-light LLM inference engine. > TensorRT LLM level performance > vLLM level usability > Built by a lean and mission-driven team in two months > MIT license, open-source https://t.co/MJzhCEg7m8 https://t.co/anhoETwwS9 https://t.co/BWn4Me62x7

1241.1k

View on X

Still wondering? A few quick answers below.

TokenSpeed is an open-source LLM inference engine designed by the LightSeek Foundation specifically for agentic AI workloads. It aims to provide the high performance of NVIDIA TensorRT-LLM with the ease of use found in vLLM, a popular open-source serving framework. The engine uses a compiler-backed modeling mechanism to optimize how models process large volumes of tokens.

TokenSpeed is designed to match or exceed the performance of TensorRT-LLM on NVIDIA Blackwell hardware for specific agentic tasks. Benchmarks show it achieves roughly 11 percent higher throughput than TensorRT-LLM when serving coding agents. It achieves this by using optimized kernels for Multi-head Latent Attention, which nearly halves latency relative to other state-of-the-art engines on typical decode workloads.

Yes, TokenSpeed is released under the MIT license, making it free for both personal and commercial use. The source code is hosted on GitHub by the LightSeek Foundation. While the current release is a performance preview intended for benchmarking and technical evaluation, the team plans to provide production-hardened versions and additional feature updates over the coming month.

TokenSpeed is currently optimized for NVIDIA Blackwell architecture, specifically the B200 GPU. It includes specialized kernels designed to fully utilize Tensor Cores, which are specialized hardware units for fast matrix math. The development team is also working on platform optimizations for NVIDIA Hopper and AMD MI350 hardware, with support for these accelerators planned for future updates.

Agentic workloads refer to the specific demands of AI agents that perform multi-step tasks, such as autonomous coding. These workloads typically involve very long context windows, often exceeding 50,000 tokens, and require high tokens-per-second performance to remain responsive. TokenSpeed is built from first principles to handle these long-running, high-concurrency sessions more efficiently than general-purpose inference engines.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Keep reading

NVIDIA Blackwell Ultra Powers DeepSeek V4 Pro at 150 Tokens Per Second

NVIDIA reported that DeepSeek-V4-Pro achieves over 150 tokens per second on Blackwell Ultra hardware. This performance level makes 1.6-trillion parameter models viable for real-time autonomous agents. Future software updates like Dynamo and NVFP4 are expected to push these speeds even higher.

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

QwenMay 27

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen achieved a record 580 tokens per second running its Qwen3.5-397B-A17B model on NVIDIA Blackwell GPUs using the TokenSpeed inference engine. The optimization targets agentic workloads, where multi-turn reasoning and tool-calling typically suffer from high latency. By combining a hybrid attention architecture with deep kernel fusion, the system maintains high throughput even as context scales to one million tokens.

Perplexity Launches ROSE Inference Engine to Optimize Blackwell GPU Performance

PerplexityMay 7

Perplexity Launches ROSE Inference Engine to Optimize Blackwell GPU Performance

Perplexity developed a custom inference engine called ROSE and a domain-specific language to build specialized GPU kernels for NVIDIA hardware. By moving down the stack, the company can achieve peak performance on Blackwell chips and reduce latency for massive trillion-parameter models.

DeepSeek Launches V4 Preview With 1M Context and Agentic Coding Focus

DeepSeekApr 30

DeepSeek Launches V4 Preview With 1M Context and Agentic Coding Focus

DeepSeek released the preview and open weights for DeepSeek-V4, a Mixture-of-Experts model family with a 1.6-trillion-parameter flagship and a 1M-token context window as the default. By introducing sparse attention and dual reasoning modes, the release delivers frontier-level agentic performance at lower compute costs.

What is TokenSpeed?

How does TokenSpeed compare to NVIDIA TensorRT-LLM?

Is TokenSpeed open source?

What hardware does TokenSpeed support?

What are agentic workloads in the context of TokenSpeed?

Keep reading

NVIDIA Blackwell Ultra Powers DeepSeek V4 Pro at 150 Tokens Per Second

NVIDIA Blackwell Ultra Powers DeepSeek V4 Pro at 150 Tokens Per Second

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Perplexity Launches ROSE Inference Engine to Optimize Blackwell GPU Performance

Perplexity Launches ROSE Inference Engine to Optimize Blackwell GPU Performance

DeepSeek Launches V4 Preview With 1M Context and Agentic Coding Focus

DeepSeek Launches V4 Preview With 1M Context and Agentic Coding Focus

Keep reading

NVIDIA Blackwell Ultra Powers DeepSeek V4 Pro at 150 Tokens Per Second

NVIDIA Blackwell Ultra Powers DeepSeek V4 Pro at 150 Tokens Per Second

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Qwen Sets 580 TPS Record for Agentic Workloads on Blackwell GPUs

Perplexity Launches ROSE Inference Engine to Optimize Blackwell GPU Performance

Perplexity Launches ROSE Inference Engine to Optimize Blackwell GPU Performance

DeepSeek Launches V4 Preview With 1M Context and Agentic Coding Focus

DeepSeek Launches V4 Preview With 1M Context and Agentic Coding Focus