We push Prefill/Decode disaggregation beyond a single cluster: cross-datacenter + heterogeneous hardware, unlocking the potential for significantly lower cost per token. This was previously blocked by KV cache transfer overhead. The key enabler is our hybrid model (Kimi Linear), which reduces KV cache size and makes cross-DC PD practical. Validated on a 20x scaled-up Kimi Linear model: ✅ 1.54× throughput ✅ 64% ↓ P90 TTFT → Directly translating into lower token cost. More in Prefill-as-a-Service: https://t.co/If8fA3t9Og
Moonshot AI enables cross datacenter inference to slash long context costs
· Updated
Kimi Linear, a hybrid-attention model that reduces KV cache (the memory state representing processed text) size by roughly 10x.- Throughput increase
- 1.54x
- P90 TTFT reduction
- 64%
- KV cache reduction
- 10x
- Model architecture
- Kimi Linear
- Network requirement
- Commodity Ethernet
- Case study model size
- 1T parameters
Standard models produce massive KV caches requiring expensive RDMA networks. This bandwidth wall has historically prevented teams from using fragmented GPU capacity across regions. By shrinking the cache, Moonshot can now stream inference data over commodity Ethernet, making distributed global infrastructure a practical reality for production workloads.
You can now scale long-context applications more efficiently by offloading heavy prefills to compute-dense clusters. In a case study with a 1T-parameter model, this approach delivered 1.54x higher throughput and a 64% reduction in P90 latency. Architectural support is expanding across open inference frameworks like vLLM and SGLang.
Still wondering? A few quick answers below.




