https://t.co/jaOMnP7Yud
Z.ai Deploys ZCube Network to Slash Inference Costs and Latency
- Network hardware CapEx reduction
- 33%
- GPU inference throughput increase
- 15%
- TTFT P99 latency reduction
- 40.6%
- Network diameter
- 2 switch hops
- Scalability
- 16,384 400Gbps NICs and more
Modern models use Prefill-Decode disaggregation, separating prompt processing from token generation. This creates asymmetric traffic as KV Caches move between nodes. Traditional networks suffer from hotspots during these transfers, but ZCube's topology distributes traffic across a broader path space. This mirrors Moonshot AI's distributed prefill architecture by treating compute phase separation as a primary infrastructure challenge.
In production tests for the GLM-5.1 coding model, ZCube reduced hardware costs by 33% and cut tail latency by 40%. While originally a research paper, this deployment proves that hardware-layer innovation can scale to tens of thousands of GPUs. These optimizations will likely underpin future high-concurrency agentic engineering services.
Still wondering? A few quick answers below.





