Kimi Team Releases AttnRes to Replace Fixed Residual Connections in Transformers

Kimi

Mar 18, 2026 · Updated Apr 25, 2026

Kimi Team published Attention Residuals (AttnRes), replacing standard fixed residual connections in transformers with learned attention over preceding layer outputs. Block AttnRes matches a baseline trained with 1.25x more compute, improving across all evaluated benchmarks.

Attention Residuals (AttnRes), from Kimi Team, replaces the fixed-weight accumulation of standard residual connections with softmax attention over all preceding layer outputs. Each layer learns a single pseudo-query to selectively aggregate earlier representations. The memory-efficient Block AttnRes groups layers into blocks and applies attention only at block boundaries — reducing memory from O(Ld) to O(Nd) with around 8 blocks. Integrated into Kimi Linear (48B total / 3B activated, trained on 1.4T tokens), AttnRes improves across all evaluated benchmarks.

Standard residuals with PreNorm cause hidden-state magnitudes to grow unboundedly with depth, diluting each layer's contribution. AttnRes addresses this: training dynamics show more uniform gradient distribution and bounded output magnitudes. Block AttnRes matches a baseline requiring 1.25x more compute.

If you train transformers at scale, Block AttnRes is a drop-in replacement for standard residuals — the paper and accompanying code provide the implementation details needed to evaluate it on your own runs.

View the full update on arxiv.org

Kimi.ai

@Kimi_MoonshotMar 16

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with https://t.co/gcWyzhZVc0

1.9k

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Kimi →

Keep reading

Kimi Open Sources FlashKDA to Double Long Context Prefill Speeds

Kimi open-sourced FlashKDA, a high-performance implementation of its Kimi Delta Attention kernels built on the CUTLASS library. The release achieves up to a 2.22x speedup in prefill processing on H20 GPUs compared to standard linear attention baselines. This allows developers to significantly reduce latency for long-context applications by swapping in a more efficient backend.

Fireworks AI Launches Day-0 Support for Kimi K2.6 Agentic Model

Fireworks AIApr 25

Fireworks AI Launches Day-0 Support for Kimi K2.6 Agentic Model

Fireworks AI added immediate support for Kimi K2.6, a 1-trillion parameter multimodal model optimized for long-horizon agentic coding. The update provides the high-speed inference and fine-tuning infrastructure needed to run the successor to the model that powered Cursor's Composer 2.

MiniMax M3 drops attention overhead from 30 to 5 percent

MiniMaxJun 3

MiniMax M3 drops attention overhead from 30 to 5 percent

MiniMax revealed technical highlights for its M3 model, featuring a Sparse Attention architecture that maintains uncompressed data for its 1-million-token context window. The update reduces attention kernel overhead from 30% to 5% of per-decode wall-clock time and introduces vision-coding capabilities where the model self-evaluates its own rendered UI.

Karpathy Open-Sources autoresearch for Autonomous LLM Training by AI Agents

Andrej KarpathyMar 15

Karpathy Open-Sources autoresearch for Autonomous LLM Training by AI Agents

Andrej Karpathy released autoresearch, a minimal single-GPU repo where an AI agent autonomously runs LLM training experiments overnight. The agent edits train.py, runs 5-minute experiments, and keeps only the runs that lower validation loss — no human involvement needed.