Z.ai Deploys ZCube Network to Slash Inference Costs and Latency

Zhipu AI

May 21, 2026 · Updated Jun 12, 2026

Z.ai successfully deployed its ZCube network architecture in production to power GLM-5.1 coding services, reducing hardware costs by 33% while boosting throughput. By flattening the network topology, the system eliminates the congestion typically caused by moving massive amounts of data between GPUs during long-context inference.

Z.ai, the AI lab behind the GLM model series, deployed ZCube — a flattened network architecture designed to eliminate bottlenecks in large-scale inference. Unlike traditional hierarchical designs, ZCube removes the middle switch layer to create a direct interconnect between GPU nodes. This follows Z.ai's LayerSplit optimization for long-context workloads.

Network hardware CapEx reduction: 33%
GPU inference throughput increase: 15%
TTFT P99 latency reduction: 40.6%
Network diameter: 2 switch hops
Scalability: 16,384 400Gbps NICs and more

Modern models use Prefill-Decode disaggregation, separating prompt processing from token generation. This creates asymmetric traffic as KV Caches move between nodes. Traditional networks suffer from hotspots during these transfers, but ZCube's topology distributes traffic across a broader path space. This mirrors Moonshot AI's distributed prefill architecture by treating compute phase separation as a primary infrastructure challenge.

In production tests for the GLM-5.1 coding model, ZCube reduced hardware costs by 33% and cut tail latency by 40%. While originally a research paper, this deployment proves that hardware-layer innovation can scale to tens of thousands of GPUs. These optimizations will likely underpin future high-concurrency agentic engineering services.

View the full update on z.ai

Z.ai

@Zai_orgMay 20

https://t.co/jaOMnP7Yud

112768

View on X

Still wondering? A few quick answers below.

ZCube is a flattened network topology designed by Z.ai to interconnect GPUs in large-scale AI clusters. Unlike traditional hierarchical designs that stack multiple layers of switches, ZCube removes the spine switch layer and uses a bipartite interconnect. This structure is specifically optimized to handle the heavy data traffic patterns required for modern long-context language model inference.

ZCube improves performance by eliminating topology-induced congestion during the transfer of KV Cache data between GPU nodes. In production tests using the GLM-5.1 coding model, the architecture increased average GPU inference throughput by 15% and reduced the P99 tail latency for the time to first token by 40.6% compared to traditional rail-optimized fat-tree network designs.

Yes, ZCube significantly lowers infrastructure costs by simplifying the network fabric. Because the architecture eliminates the spine switch layer, it reduces the required investment in switches and optical modules by approximately 33%. For a 10,000-GPU cluster, Z.ai estimates this architectural shift can save between 210 million and 640 million RMB in network hardware capital expenditures.

As models move toward prefill-decode disaggregation, where different GPUs handle prompt processing and token generation, they must move massive amounts of data across the network. Traditional networks were designed for the symmetric traffic of AI training and struggle with the asymmetric, dynamic data transfers of inference, leading to local hotspots and link congestion that slow down responses.

ZCube is already running in a live production environment powering Z.ai's GLM-5.1 coding services. The architecture is designed to be highly scalable; using standard 51.2T switches, it can support a network of over 16,000 high-speed network interface cards. Z.ai indicates the design can scale further to support clusters containing hundreds of thousands of GPUs for next-generation AI infrastructure.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Zhipu AI →

Keep reading

Z.ai Resolves GLM-5 Infrastructure Race Conditions to Stabilize Long Horizon Coding Agents

Z.ai identified and fixed low-level race conditions in its GLM-5 inference stack that caused garbled outputs and repetition during high-concurrency coding tasks. By introducing a layer-wise cache storage scheme called LayerSplit, the lab also increased system throughput by up to 132% for long-context workloads.

What is the ZCube network architecture?

How does ZCube improve LLM inference performance?

Does ZCube reduce the cost of building AI clusters?

Why is network architecture becoming a bottleneck for AI models?

Is ZCube available for large-scale deployments?

Keep reading

Z.ai Resolves GLM-5 Infrastructure Race Conditions to Stabilize Long Horizon Coding Agents

Z.ai Resolves GLM-5 Infrastructure Race Conditions to Stabilize Long Horizon Coding Agents

Keep reading

Z.ai Resolves GLM-5 Infrastructure Race Conditions to Stabilize Long Horizon Coding Agents

Z.ai Resolves GLM-5 Infrastructure Race Conditions to Stabilize Long Horizon Coding Agents