Moonshot AI enables cross datacenter inference to slash long context costs

Kimi

Apr 18, 2026 · Updated May 4, 2026

Moonshot AI launched Prefill-as-a-Service, a distributed architecture that separates the compute-heavy prefill phase from the memory-heavy decode phase across different datacenters. By using a hybrid model to shrink memory overhead, the system achieves 1.54x higher throughput and significantly lower latency for long-context requests.

Moonshot AI, an AI company building the Kimi model family, introduced Prefill-as-a-Service to enable cross-datacenter inference. This architecture separates prefill (the compute-intensive initial processing) from decode across loosely coupled clusters. It uses Kimi Linear, a hybrid-attention model that reduces KV cache (the memory state representing processed text) size by roughly 10x.

Throughput increase: 1.54x
P90 TTFT reduction: 64%
KV cache reduction: 10x
Model architecture: Kimi Linear
Network requirement: Commodity Ethernet
Case study model size: 1T parameters

Standard models produce massive KV caches requiring expensive RDMA networks. This bandwidth wall has historically prevented teams from using fragmented GPU capacity across regions. By shrinking the cache, Moonshot can now stream inference data over commodity Ethernet, making distributed global infrastructure a practical reality for production workloads.

You can now scale long-context applications more efficiently by offloading heavy prefills to compute-dense clusters. In a case study with a 1T-parameter model, this approach delivered 1.54x higher throughput and a 64% reduction in P90 latency. Architectural support is expanding across open inference frameworks like vLLM and SGLang.

View the full update on arxiv.org

Kimi.ai

@Kimi_MoonshotApr 18

We push Prefill/Decode disaggregation beyond a single cluster: cross-datacenter + heterogeneous hardware, unlocking the potential for significantly lower cost per token. This was previously blocked by KV cache transfer overhead. The key enabler is our hybrid model (Kimi Linear), which reduces KV cache size and makes cross-DC PD practical. Validated on a 20x scaled-up Kimi Linear model: ✅ 1.54× throughput ✅ 64% ↓ P90 TTFT → Directly translating into lower token cost. More in Prefill-as-a-Service: https://t.co/If8fA3t9Og

2161.8k

View on X

Still wondering? A few quick answers below.

Prefill-as-a-Service is a distributed architecture that separates the compute-intensive prefill phase of AI inference from the memory-intensive decode phase across different clusters. It selectively offloads long-context requests to standalone, compute-dense clusters. This allows operators to scale prefill and decode capacity independently across loosely coupled datacenters or regions using commodity Ethernet instead of expensive, high-bandwidth interconnects.

The Kimi Linear model uses a hybrid-attention architecture that interleaves a small number of full-attention layers with many linear-complexity layers. This design reduces the size of the KV cache—the memory state required to process text—by roughly an order of magnitude. Because the resulting data transfer is significantly smaller, it can traverse standard internet links without causing the network congestion typical of standard models.

In a case study using a 1-trillion parameter hybrid model, the PrfaaS-PD architecture achieved 1.54 times higher serving throughput compared to a standard homogeneous deployment. It also reduced the P90 Time to First Token by 64 percent, meaning the delay before the AI starts generating a response is significantly lower for the longest and most complex user requests.

Moonshot AI uses a dual-timescale scheduler that monitors network utilization and request queue depth in real time. It applies length-based threshold routing to offload only sufficiently long requests to the remote prefill cluster. This selective offloading ensures that cross-datacenter bandwidth is used only when the compute gains of remote acceleration outweigh the transfer costs, keeping link utilization stable.

Yes, the architecture is designed to be compatible with major open-source serving frameworks. Moonshot AI has collaborated with the developers of vLLM, SGLang, and Dynamo to integrate these disaggregated serving principles. The system builds on existing hybrid model managers to handle the specific storage and transfer requirements of both linear states and full-attention KV cache blocks across distributed clusters.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Kimi →

Keep reading

Moonshot AI Scales Kimi K2.6 Agent Swarm to 300 Parallel Subagents

Moonshot AI released Kimi K2.6 Agent Swarm, a system that coordinates up to 300 parallel sub-agents across 4,000 steps to complete complex tasks. The update shifts from conversational responses to production-ready file bundles, including 100-page reports and massive datasets.

Fireworks AI hosts MiniMax M3 with 15x faster long context decoding

Fireworks AIJun 4

Fireworks AI hosts MiniMax M3 with 15x faster long context decoding

Fireworks AI is now powering inference for MiniMax M3, a multimodal model featuring a novel sparse attention architecture. The partnership enables 15.6x faster decoding at 1-million-token context, making real-time agentic workflows viable at scale.

Z.ai Deploys ZCube Network to Slash Inference Costs and Latency

Zhipu AIMay 21

Z.ai Deploys ZCube Network to Slash Inference Costs and Latency

Z.ai successfully deployed its ZCube network architecture in production to power GLM-5.1 coding services, reducing hardware costs by 33% while boosting throughput. By flattening the network topology, the system eliminates the congestion typically caused by moving massive amounts of data between GPUs during long-context inference.

Cloudflare Workers AI Adds Kimi K2.5 for End-to-End Agent Workflows

CloudflareMar 20

Cloudflare Workers AI Adds Kimi K2.5 for End-to-End Agent Workflows

Cloudflare's Workers AI now supports Kimi K2.5, Moonshot AI's frontier open-source model with a 256k context window. Developers can build and run full agent workflows on Cloudflare's platform, with prefix caching and a new async API cutting inference costs.

What is Prefill-as-a-Service (PrfaaS)?

How does the Kimi Linear model enable cross-datacenter inference?

What performance improvements does the PrfaaS-PD architecture provide?

How does Moonshot AI manage bandwidth during cross-cluster transfers?

Is the Prefill-as-a-Service architecture compatible with vLLM or SGLang?

Keep reading

Moonshot AI Scales Kimi K2.6 Agent Swarm to 300 Parallel Subagents

Moonshot AI Scales Kimi K2.6 Agent Swarm to 300 Parallel Subagents

Fireworks AI hosts MiniMax M3 with 15x faster long context decoding

Fireworks AI hosts MiniMax M3 with 15x faster long context decoding

Z.ai Deploys ZCube Network to Slash Inference Costs and Latency

Z.ai Deploys ZCube Network to Slash Inference Costs and Latency

Cloudflare Workers AI Adds Kimi K2.5 for End-to-End Agent Workflows

Cloudflare Workers AI Adds Kimi K2.5 for End-to-End Agent Workflows

Keep reading

Moonshot AI Scales Kimi K2.6 Agent Swarm to 300 Parallel Subagents

Moonshot AI Scales Kimi K2.6 Agent Swarm to 300 Parallel Subagents

Fireworks AI hosts MiniMax M3 with 15x faster long context decoding

Fireworks AI hosts MiniMax M3 with 15x faster long context decoding

Z.ai Deploys ZCube Network to Slash Inference Costs and Latency

Z.ai Deploys ZCube Network to Slash Inference Costs and Latency

Cloudflare Workers AI Adds Kimi K2.5 for End-to-End Agent Workflows

Cloudflare Workers AI Adds Kimi K2.5 for End-to-End Agent Workflows