Alibaba Qwen Releases FlashQLA to Speed Up On-Device Agentic AI

Qwen

Apr 30, 2026 · Updated May 8, 2026

Alibaba's Qwen team launched FlashQLA, a library of high-performance linear attention kernels that deliver up to a 3x speedup for on-device models. By optimizing the Gated DeltaNet architecture for personal hardware, the release makes long-context agentic workflows more viable on consumer electronics.

Qwen, the AI lab behind Alibaba's open-source models, released FlashQLA — a library of high-performance kernels for linear attention (a memory-efficient alternative to standard attention). Built using TileLang (a specialized language for hardware kernels), it achieves a 2–3x forward speedup and a 2x backward speedup for models using the Gated DeltaNet architecture.

Forward speedup: 2–3x
Backward speedup: 2x
Architecture: Gated DeltaNet (GDN)
Programming language: TileLang
Pipeline stages: 16-stage warp-specialized
Availability: Open-source (GitHub)

This optimization addresses the hardware bottleneck for long-running agents on personal devices. While standard attention slows down as context grows, linear attention maintains speed but requires these custom kernels to run efficiently on physical chips. This mirrors Kimi's FlashKDA release, signaling a broader industry shift toward making long-context prefill practical for edge-based agentic AI.

You can access the open-source code on GitHub to integrate these kernels into local inference pipelines. The implementation uses a 16-stage warp-specialized pipeline to manage tight on-chip memory constraints, specifically benefiting Qwen 3.5 Small models and other lightweight architectures designed for local deployment.

View the full update on qwen.ai

Qwen

@Alibaba_QwenApr 29

🚀 Introducing FlashQLA: high-performance linear attention kernels built on TileLang. ⚡ 2–3× forward speedup. 2× backward speedup. 💻 Purpose-built for agentic AI on your personal devices. 💡Key insights: 1. Gate-driven automatic intra-card CP. 2. Hardware-friendly algebraic reformulation. 3. TileLang fused warp-specialized kernels. FlashQLA boosts SM utilization via automatic intra-device CP. The gains are especially pronounced for TP setups, small models, and long-context workloads. Instead of fusing the entire GDN flow into a single kernel, we split it into two kernels optimized for CP and backward efficiency. At large batch sizes this incurs extra memory I/O overhead vs. a fully fused approach, but it delivers better real-world performance on edge devices and long-context workloads. The backward pass was the hardest part: we built a 16-stage warp-specialized pipeline under extremely tight on-chip memory constraints, ultimately achieving 2×+ kernel-level speedups. We hope this is useful to the community!🫶🫶 Learn more: 📖 Blog: https://t.co/HF6opiR4yf 💻 Code: https://t.co/G3oaf5L1AZ

79772

View on X

Still wondering? A few quick answers below.

FlashQLA is a library of high-performance linear attention kernels developed by the Qwen team at Alibaba. It is built using TileLang, a specialized programming language for AI kernels. The library is specifically designed to optimize the Gated DeltaNet architecture, providing a faster alternative to standard attention mechanisms for large language models and autonomous agents.

FlashQLA delivers a 2 to 3 times speedup in forward passes and a 2 times speedup in backward passes. It achieves this through hardware-friendly algebraic reformulation and automatic intra-device communication. These optimizations significantly boost hardware utilization, particularly for small models, tensor parallel setups, and workloads involving very long sequences of text or complex reasoning steps.

Yes, Alibaba has made FlashQLA available to the community as an open-source project. Developers can access the source code and implementation details through the official QwenLM GitHub repository. This allows researchers and engineers to integrate these high-performance kernels into their own local inference pipelines or use them to build more efficient AI systems on consumer hardware.

Gated DeltaNet is a linear-time attention mechanism that serves as an alternative to the quadratic complexity of standard Softmax attention. FlashQLA provides the optimized kernels necessary to run this architecture efficiently. By splitting the processing into specialized kernels, it manages memory more effectively for long-context tasks where traditional attention models often face significant performance bottlenecks.

FlashQLA is purpose-built for agentic AI running on personal hardware where memory and compute resources are limited. The team implemented a 16-stage warp-specialized pipeline to handle extremely tight on-chip memory constraints. This design ensures that complex AI agents can process long contexts and perform multi-step reasoning locally without relying on expensive or high-latency cloud infrastructure.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Qwen →

Keep reading

Alibaba Launches Qwen3.7-Max for Long-Horizon Autonomous Agent Tasks

Alibaba released Qwen3.7-Max, a flagship model optimized for autonomous agents capable of executing multi-step tasks over dozens of hours. The model features native support for the Model Context Protocol and demonstrated a tenfold performance increase in self-directed kernel optimization.

OpenRouter Adds Qwen3.7-Max for Long Horizon Agentic Coding and Office Tasks

OpenRouterMay 21

OpenRouter Adds Qwen3.7-Max for Long Horizon Agentic Coding and Office Tasks

OpenRouter integrated Alibaba's Qwen3.7-Max, a flagship model optimized for autonomous agent loops and multi-hour task execution. The update introduces explicit prompt caching for the Qwen series, allowing developers to maintain massive context windows at a 90 percent discount on subsequent requests.

Google Gemini 3.5 Flash Beats Larger Models on Agentic Benchmark

Google AI StudioMay 22

Google Gemini 3.5 Flash Beats Larger Models on Agentic Benchmark

Gemini 3.5 Flash has ranked first on the APEX-Agents-AA benchmark, outperforming larger frontier models in autonomous task execution. The result confirms that high-speed, low-cost models are now capable of handling complex agentic workflows previously reserved for larger architectures.

Vercel Integrates Qwen 3.7 Max to Power Autonomous Multi Step Agent Workflows

VercelMay 21

Vercel Integrates Qwen 3.7 Max to Power Autonomous Multi Step Agent Workflows

Vercel added Alibaba's Qwen 3.7 Max to its AI Gateway, enabling developers to access the agent-focused model without separate provider accounts. The model is optimized for long-horizon execution, allowing it to maintain reasoning across complex, multi-step tasks like multi-file engineering and office automation.

What is FlashQLA?

How does FlashQLA improve AI performance?

Is FlashQLA open source?

What is Gated DeltaNet in the context of FlashQLA?

Why is FlashQLA optimized for personal devices?

Keep reading

Alibaba Launches Qwen3.7-Max for Long-Horizon Autonomous Agent Tasks

Alibaba Launches Qwen3.7-Max for Long-Horizon Autonomous Agent Tasks

OpenRouter Adds Qwen3.7-Max for Long Horizon Agentic Coding and Office Tasks

OpenRouter Adds Qwen3.7-Max for Long Horizon Agentic Coding and Office Tasks

Google Gemini 3.5 Flash Beats Larger Models on Agentic Benchmark

Google Gemini 3.5 Flash Beats Larger Models on Agentic Benchmark

Vercel Integrates Qwen 3.7 Max to Power Autonomous Multi Step Agent Workflows

Vercel Integrates Qwen 3.7 Max to Power Autonomous Multi Step Agent Workflows

Keep reading

Alibaba Launches Qwen3.7-Max for Long-Horizon Autonomous Agent Tasks

Alibaba Launches Qwen3.7-Max for Long-Horizon Autonomous Agent Tasks

OpenRouter Adds Qwen3.7-Max for Long Horizon Agentic Coding and Office Tasks

OpenRouter Adds Qwen3.7-Max for Long Horizon Agentic Coding and Office Tasks

Google Gemini 3.5 Flash Beats Larger Models on Agentic Benchmark

Google Gemini 3.5 Flash Beats Larger Models on Agentic Benchmark

Vercel Integrates Qwen 3.7 Max to Power Autonomous Multi Step Agent Workflows

Vercel Integrates Qwen 3.7 Max to Power Autonomous Multi Step Agent Workflows