Introducing ๐จ๐๐๐๐๐๐๐๐ ๐น๐๐๐๐ ๐๐๐๐: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with https://t.co/gcWyzhZVc0
Kimi Team Releases AttnRes to Replace Fixed Residual Connections in Transformers
ยท Updated
Attention Residuals (AttnRes), from Kimi Team, replaces the fixed-weight accumulation of standard residual connections with softmax attention over all preceding layer outputs. Each layer learns a single pseudo-query to selectively aggregate earlier representations. The memory-efficient Block AttnRes groups layers into blocks and applies attention only at block boundaries โ reducing memory from O(Ld) to O(Nd) with around 8 blocks. Integrated into Kimi Linear (48B total / 3B activated, trained on 1.4T tokens), AttnRes improves across all evaluated benchmarks.
Standard residuals with PreNorm cause hidden-state magnitudes to grow unboundedly with depth, diluting each layer's contribution. AttnRes addresses this: training dynamics show more uniform gradient distribution and bounded output magnitudes. Block AttnRes matches a baseline requiring 1.25x more compute.
If you train transformers at scale, Block AttnRes is a drop-in replacement for standard residuals โ the paper and accompanying code provide the implementation details needed to evaluate it on your own runs.
Kimi.ai
@Kimi_Moonshot
1.9kretweets
View on X



