HeadsUpAI

Alibaba Qwen Releases FlashQLA to Speed Up On-Device Agentic AI

Β· Updated

Qwen, the AI lab behind Alibaba's open-source models, released FlashQLA β€” a library of high-performance kernels for linear attention (a memory-efficient alternative to standard attention). Built using TileLang (a specialized language for hardware kernels), it achieves a 2–3x forward speedup and a 2x backward speedup for models using the Gated DeltaNet architecture.
Forward speedup
2–3x
Backward speedup
2x
Architecture
Gated DeltaNet (GDN)
Programming language
TileLang
Pipeline stages
16-stage warp-specialized
Availability
Open-source (GitHub)

This optimization addresses the hardware bottleneck for long-running agents on personal devices. While standard attention slows down as context grows, linear attention maintains speed but requires these custom kernels to run efficiently on physical chips. This mirrors Kimi's FlashKDA release, signaling a broader industry shift toward making long-context prefill practical for edge-based agentic AI.

You can access the open-source code on GitHub to integrate these kernels into local inference pipelines. The implementation uses a 16-stage warp-specialized pipeline to manage tight on-chip memory constraints, specifically benefiting Qwen 3.5 Small models and other lightweight architectures designed for local deployment.

Qwen
Qwen
@Alibaba_Qwen
X

πŸš€ Introducing FlashQLA: high-performance linear attention kernels built on TileLang. ⚑ 2–3Γ— forward speedup. 2Γ— backward speedup. πŸ’» Purpose-built for agentic AI on your personal devices. πŸ’‘Key insights: 1. Gate-driven automatic intra-card CP. 2. Hardware-friendly algebraic reformulation. 3. TileLang fused warp-specialized kernels. FlashQLA boosts SM utilization via automatic intra-device CP. The gains are especially pronounced for TP setups, small models, and long-context workloads. Instead of fusing the entire GDN flow into a single kernel, we split it into two kernels optimized for CP and backward efficiency. At large batch sizes this incurs extra memory I/O overhead vs. a fully fused approach, but it delivers better real-world performance on edge devices and long-context workloads. The backward pass was the hardest part: we built a 16-stage warp-specialized pipeline under extremely tight on-chip memory constraints, ultimately achieving 2Γ—+ kernel-level speedups. We hope this is useful to the community!🫢🫢 Learn more: πŸ“– Blog: https://t.co/HF6opiR4yf πŸ’» Code: https://t.co/G3oaf5L1AZ

79retweets772likes
View on X

Still wondering? A few quick answers below.

FlashQLA is a library of high-performance linear attention kernels developed by the Qwen team at Alibaba. It is built using TileLang, a specialized programming language for AI kernels. The library is specifically designed to optimize the Gated DeltaNet architecture, providing a faster alternative to standard attention mechanisms for large language models and autonomous agents.

FlashQLA delivers a 2 to 3 times speedup in forward passes and a 2 times speedup in backward passes. It achieves this through hardware-friendly algebraic reformulation and automatic intra-device communication. These optimizations significantly boost hardware utilization, particularly for small models, tensor parallel setups, and workloads involving very long sequences of text or complex reasoning steps.

Yes, Alibaba has made FlashQLA available to the community as an open-source project. Developers can access the source code and implementation details through the official QwenLM GitHub repository. This allows researchers and engineers to integrate these high-performance kernels into their own local inference pipelines or use them to build more efficient AI systems on consumer hardware.

Gated DeltaNet is a linear-time attention mechanism that serves as an alternative to the quadratic complexity of standard Softmax attention. FlashQLA provides the optimized kernels necessary to run this architecture efficiently. By splitting the processing into specialized kernels, it manages memory more effectively for long-context tasks where traditional attention models often face significant performance bottlenecks.

FlashQLA is purpose-built for agentic AI running on personal hardware where memory and compute resources are limited. The team implemented a 16-stage warp-specialized pipeline to handle extremely tight on-chip memory constraints. This design ensures that complex AI agents can process long contexts and perform multi-step reasoning locally without relying on expensive or high-latency cloud infrastructure.

Share this update