Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with https://t.co/gcWyzhZVc0
Kimi Team Releases AttnRes to Replace Fixed Residual Connections in Transformers
· Updated
Kimi Team published Attention Residuals (AttnRes), replacing standard fixed residual connections in transformers with learned attention over preceding layer outputs. Block AttnRes matches a baseline trained with 1.25x more compute, improving across all evaluated benchmarks.
Standard residuals with PreNorm cause hidden-state magnitudes to grow unboundedly with depth, diluting each layer's contribution. AttnRes addresses this: training dynamics show more uniform gradient distribution and bounded output magnitudes. Block AttnRes matches a baseline requiring 1.25x more compute.
If you train transformers at scale, Block AttnRes is a drop-in replacement for standard residuals — the paper and accompanying code provide the implementation details needed to evaluate it on your own runs.
Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →





