HeadsUpAI

Moonshot AI enables cross datacenter inference to slash long context costs

· Updated

Moonshot AI, an AI company building the Kimi model family, introduced Prefill-as-a-Service to enable cross-datacenter inference. This architecture separates prefill (the compute-intensive initial processing) from decode across loosely coupled clusters. It uses Kimi Linear, a hybrid-attention model that reduces KV cache (the memory state representing processed text) size by roughly 10x.
Throughput increase
1.54x
P90 TTFT reduction
64%
KV cache reduction
10x
Model architecture
Kimi Linear
Network requirement
Commodity Ethernet
Case study model size
1T parameters

Standard models produce massive KV caches requiring expensive RDMA networks. This bandwidth wall has historically prevented teams from using fragmented GPU capacity across regions. By shrinking the cache, Moonshot can now stream inference data over commodity Ethernet, making distributed global infrastructure a practical reality for production workloads.

You can now scale long-context applications more efficiently by offloading heavy prefills to compute-dense clusters. In a case study with a 1T-parameter model, this approach delivered 1.54x higher throughput and a 64% reduction in P90 latency. Architectural support is expanding across open inference frameworks like vLLM and SGLang.

Kimi.ai
Kimi.ai
@Kimi_Moonshot
X

We push Prefill/Decode disaggregation beyond a single cluster: cross-datacenter + heterogeneous hardware, unlocking the potential for significantly lower cost per token. This was previously blocked by KV cache transfer overhead. The key enabler is our hybrid model (Kimi Linear), which reduces KV cache size and makes cross-DC PD practical. Validated on a 20x scaled-up Kimi Linear model: ✅ 1.54× throughput ✅ 64% ↓ P90 TTFT → Directly translating into lower token cost. More in Prefill-as-a-Service: https://t.co/If8fA3t9Og

216retweets1.8klikes
View on X

Still wondering? A few quick answers below.

Prefill-as-a-Service is a distributed architecture that separates the compute-intensive prefill phase of AI inference from the memory-intensive decode phase across different clusters. It selectively offloads long-context requests to standalone, compute-dense clusters. This allows operators to scale prefill and decode capacity independently across loosely coupled datacenters or regions using commodity Ethernet instead of expensive, high-bandwidth interconnects.

The Kimi Linear model uses a hybrid-attention architecture that interleaves a small number of full-attention layers with many linear-complexity layers. This design reduces the size of the KV cache—the memory state required to process text—by roughly an order of magnitude. Because the resulting data transfer is significantly smaller, it can traverse standard internet links without causing the network congestion typical of standard models.

In a case study using a 1-trillion parameter hybrid model, the PrfaaS-PD architecture achieved 1.54 times higher serving throughput compared to a standard homogeneous deployment. It also reduced the P90 Time to First Token by 64 percent, meaning the delay before the AI starts generating a response is significantly lower for the longest and most complex user requests.

Moonshot AI uses a dual-timescale scheduler that monitors network utilization and request queue depth in real time. It applies length-based threshold routing to offload only sufficiently long requests to the remote prefill cluster. This selective offloading ensures that cross-datacenter bandwidth is used only when the compute gains of remote acceleration outweigh the transfer costs, keeping link utilization stable.

Yes, the architecture is designed to be compatible with major open-source serving frameworks. Moonshot AI has collaborated with the developers of vLLM, SGLang, and Dynamo to integrate these disaggregated serving principles. The system builds on existing hybrid model managers to handle the specific storage and transfer requirements of both linear states and full-attention KV cache blocks across distributed clusters.

Share this update