Kimi Open Sources FlashKDA to Double Long Context Prefill Speeds

LLM
AI Research
Performance

Kimi Open Sources FlashKDA to Double Long Context Prefill Speeds
Kimi, an AI company building models with long-context capabilities, open-sourced FlashKDA. It is a high-performance implementation of Kimi Delta Attention (KDA) kernels—a linear attention mechanism (an architecture that processes text more efficiently) built using the CUTLASS library.

Building on its shift toward linear attention, hardware-level optimization is the primary bottleneck for speed. FlashKDA achieves a 1.72x to 2.22x speedup in the prefill phase on NVIDIA H20 GPUs. This follows Kimi's recent efforts to slash long context costs.

You can use FlashKDA as a drop-in backend for the flash-linear-attention library to improve performance. The code is available on GitHub, providing the CUDA kernels needed to accelerate inference. This enables faster execution for long-horizon coding agents that process massive codebases.

Read the full update →

Frequently asked questions

What is FlashKDA?
FlashKDA is an open-source implementation of high-performance kernels for Kimi Delta Attention, a linear attention mechanism developed by Kimi. It is built using the CUTLASS library to optimize how AI models process long sequences of text. By improving the efficiency of attention calculations, it helps models handle massive context windows more effectively.
How much faster is FlashKDA compared to other linear attention tools?
FlashKDA achieves a significant performance boost during the prefill phase, which is the initial processing of input data. According to Kimi, it delivers a 1.72x to 2.22x speedup compared to the standard flash-linear-attention baseline. These performance gains were specifically measured on NVIDIA H20 GPUs, which are specialized hardware for AI workloads.
Is FlashKDA open source and where can I find it?
Yes, Kimi has officially open-sourced FlashKDA. The implementation is available for public exploration and use on GitHub under the MoonshotAI organization. Developers and researchers can access the repository to integrate these high-performance kernels into their own AI projects or to study the underlying implementation of Kimi Delta Attention.
Can I use FlashKDA as a replacement for existing linear attention libraries?
FlashKDA is designed to work as a drop-in backend for the flash-linear-attention library. This means developers already using that library can swap in FlashKDA to gain performance improvements without needing to rewrite their entire codebase. It specifically targets the Kimi Delta Attention kernels to provide a more efficient way to execute linear attention operations.