🚀 Introducing FlashQLA: high-performance linear attention kernels built on TileLang. ⚡ 2–3× forward speedup. 2× backward speedup. 💻 Purpose-built for agentic AI on your personal devices. 💡Key insights: 1. Gate-driven automatic intra-card CP. 2. Hardware-friendly algebraic reformulation. 3. TileLang fused warp-specialized kernels. FlashQLA boosts SM utilization via automatic intra-device CP. The gains are especially pronounced for TP setups, small models, and long-context workloads. Instead of fusing the entire GDN flow into a single kernel, we split it into two kernels optimized for CP and backward efficiency. At large batch sizes this incurs extra memory I/O overhead vs. a fully fused approach, but it delivers better real-world performance on edge devices and long-context workloads. The backward pass was the hardest part: we built a 16-stage warp-specialized pipeline under extremely tight on-chip memory constraints, ultimately achieving 2×+ kernel-level speedups. We hope this is useful to the community!🫶🫶 Learn more: 📖 Blog: https://t.co/HF6opiR4yf 💻 Code: https://t.co/G3oaf5L1AZ
Alibaba Qwen Releases FlashQLA to Speed Up On-Device Agentic AI
· Updated
Alibaba's Qwen team launched FlashQLA, a library of high-performance linear attention kernels that deliver up to a 3x speedup for on-device models. By optimizing the Gated DeltaNet architecture for personal hardware, the release makes long-context agentic workflows more viable on consumer electronics.
TileLang (a specialized language for hardware kernels), it achieves a 2–3x forward speedup and a 2x backward speedup for models using the Gated DeltaNet architecture.- Forward speedup
- 2–3x
- Backward speedup
- 2x
- Architecture
- Gated DeltaNet (GDN)
- Programming language
- TileLang
- Pipeline stages
- 16-stage warp-specialized
- Availability
- Open-source (GitHub)
This optimization addresses the hardware bottleneck for long-running agents on personal devices. While standard attention slows down as context grows, linear attention maintains speed but requires these custom kernels to run efficiently on physical chips. This mirrors Kimi's FlashKDA release, signaling a broader industry shift toward making long-context prefill practical for edge-based agentic AI.
You can access the open-source code on GitHub to integrate these kernels into local inference pipelines. The implementation uses a 16-stage warp-specialized pipeline to manage tight on-chip memory constraints, specifically benefiting Qwen 3.5 Small models and other lightweight architectures designed for local deployment.
Still wondering? A few quick answers below.
Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →




