HeadsUpAI

Kimi Team Releases AttnRes to Replace Fixed Residual Connections in Transformers

ยท Updated

Attention Residuals (AttnRes), from Kimi Team, replaces the fixed-weight accumulation of standard residual connections with softmax attention over all preceding layer outputs. Each layer learns a single pseudo-query to selectively aggregate earlier representations. The memory-efficient Block AttnRes groups layers into blocks and applies attention only at block boundaries โ€” reducing memory from O(Ld) to O(Nd) with around 8 blocks. Integrated into Kimi Linear (48B total / 3B activated, trained on 1.4T tokens), AttnRes improves across all evaluated benchmarks.

Standard residuals with PreNorm cause hidden-state magnitudes to grow unboundedly with depth, diluting each layer's contribution. AttnRes addresses this: training dynamics show more uniform gradient distribution and bounded output magnitudes. Block AttnRes matches a baseline requiring 1.25x more compute.

If you train transformers at scale, Block AttnRes is a drop-in replacement for standard residuals โ€” the paper and accompanying code provide the implementation details needed to evaluate it on your own runs.

Kimi.ai
Kimi.ai
@Kimi_Moonshot
X

Introducing ๐‘จ๐’•๐’•๐’†๐’๐’•๐’Š๐’๐’ ๐‘น๐’†๐’”๐’Š๐’…๐’–๐’‚๐’๐’”: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with https://t.co/gcWyzhZVc0

1.9kretweets
View on X

Share this update