π Introducing FlashQLA: high-performance linear attention kernels built on TileLang. β‘ 2β3Γ forward speedup. 2Γ backward speedup. π» Purpose-built for agentic AI on your personal devices. π‘Key insights: 1. Gate-driven automatic intra-card CP. 2. Hardware-friendly algebraic reformulation. 3. TileLang fused warp-specialized kernels. FlashQLA boosts SM utilization via automatic intra-device CP. The gains are especially pronounced for TP setups, small models, and long-context workloads. Instead of fusing the entire GDN flow into a single kernel, we split it into two kernels optimized for CP and backward efficiency. At large batch sizes this incurs extra memory I/O overhead vs. a fully fused approach, but it delivers better real-world performance on edge devices and long-context workloads. The backward pass was the hardest part: we built a 16-stage warp-specialized pipeline under extremely tight on-chip memory constraints, ultimately achieving 2Γ+ kernel-level speedups. We hope this is useful to the community!π«Άπ«Ά Learn more: π Blog: https://t.co/HF6opiR4yf π» Code: https://t.co/G3oaf5L1AZ
Alibaba Qwen Releases FlashQLA to Speed Up On-Device Agentic AI
Β· Updated
TileLang (a specialized language for hardware kernels), it achieves a 2β3x forward speedup and a 2x backward speedup for models using the Gated DeltaNet architecture.- Forward speedup
- 2β3x
- Backward speedup
- 2x
- Architecture
- Gated DeltaNet (GDN)
- Programming language
- TileLang
- Pipeline stages
- 16-stage warp-specialized
- Availability
- Open-source (GitHub)
This optimization addresses the hardware bottleneck for long-running agents on personal devices. While standard attention slows down as context grows, linear attention maintains speed but requires these custom kernels to run efficiently on physical chips. This mirrors Kimi's FlashKDA release, signaling a broader industry shift toward making long-context prefill practical for edge-based agentic AI.
You can access the open-source code on GitHub to integrate these kernels into local inference pipelines. The implementation uses a 16-stage warp-specialized pipeline to manage tight on-chip memory constraints, specifically benefiting Qwen 3.5 Small models and other lightweight architectures designed for local deployment.
Still wondering? A few quick answers below.




