Kimi Team Releases AttnRes to Replace Fixed Residual Connections in Transformers

KimiKimi

· Updated

Kimi Team published Attention Residuals (AttnRes), replacing standard fixed residual connections in transformers with learned attention over preceding layer outputs. Block AttnRes matches a baseline trained with 1.25x more compute, improving across all evaluated benchmarks.

Attention Residuals (AttnRes), from Kimi Team, replaces the fixed-weight accumulation of standard residual connections with softmax attention over all preceding layer outputs. Each layer learns a single pseudo-query to selectively aggregate earlier representations. The memory-efficient Block AttnRes groups layers into blocks and applies attention only at block boundaries — reducing memory from O(Ld) to O(Nd) with around 8 blocks. Integrated into Kimi Linear (48B total / 3B activated, trained on 1.4T tokens), AttnRes improves across all evaluated benchmarks.

Standard residuals with PreNorm cause hidden-state magnitudes to grow unboundedly with depth, diluting each layer's contribution. AttnRes addresses this: training dynamics show more uniform gradient distribution and bounded output magnitudes. Block AttnRes matches a baseline requiring 1.25x more compute.

If you train transformers at scale, Block AttnRes is a drop-in replacement for standard residuals — the paper and accompanying code provide the implementation details needed to evaluate it on your own runs.

Kimi.ai
Kimi.ai
@Kimi_Moonshot
X

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with https://t.co/gcWyzhZVc0

1.9kretweets
View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Share this update