HeadsUpAI

Z.ai Deploys ZCube Network to Slash Inference Costs and Latency

Z.ai, the AI lab behind the GLM model series, deployed ZCube — a flattened network architecture designed to eliminate bottlenecks in large-scale inference. Unlike traditional hierarchical designs, ZCube removes the middle switch layer to create a direct interconnect between GPU nodes. This follows Z.ai's LayerSplit optimization for long-context workloads.
Network hardware CapEx reduction
33%
GPU inference throughput increase
15%
TTFT P99 latency reduction
40.6%
Network diameter
2 switch hops
Scalability
16,384 400Gbps NICs and more

Modern models use Prefill-Decode disaggregation, separating prompt processing from token generation. This creates asymmetric traffic as KV Caches move between nodes. Traditional networks suffer from hotspots during these transfers, but ZCube's topology distributes traffic across a broader path space. This mirrors Moonshot AI's distributed prefill architecture by treating compute phase separation as a primary infrastructure challenge.

In production tests for the GLM-5.1 coding model, ZCube reduced hardware costs by 33% and cut tail latency by 40%. While originally a research paper, this deployment proves that hardware-layer innovation can scale to tens of thousands of GPUs. These optimizations will likely underpin future high-concurrency agentic engineering services.

Still wondering? A few quick answers below.

ZCube is a flattened network topology designed by Z.ai to interconnect GPUs in large-scale AI clusters. Unlike traditional hierarchical designs that stack multiple layers of switches, ZCube removes the spine switch layer and uses a bipartite interconnect. This structure is specifically optimized to handle the heavy data traffic patterns required for modern long-context language model inference.

ZCube improves performance by eliminating topology-induced congestion during the transfer of KV Cache data between GPU nodes. In production tests using the GLM-5.1 coding model, the architecture increased average GPU inference throughput by 15% and reduced the P99 tail latency for the time to first token by 40.6% compared to traditional rail-optimized fat-tree network designs.

Yes, ZCube significantly lowers infrastructure costs by simplifying the network fabric. Because the architecture eliminates the spine switch layer, it reduces the required investment in switches and optical modules by approximately 33%. For a 10,000-GPU cluster, Z.ai estimates this architectural shift can save between 210 million and 640 million RMB in network hardware capital expenditures.

As models move toward prefill-decode disaggregation, where different GPUs handle prompt processing and token generation, they must move massive amounts of data across the network. Traditional networks were designed for the symmetric traffic of AI training and struggle with the asymmetric, dynamic data transfers of inference, leading to local hotspots and link congestion that slow down responses.

ZCube is already running in a live production environment powering Z.ai's GLM-5.1 coding services. The architecture is designed to be highly scalable; using standard 51.2T switches, it can support a network of over 16,000 high-speed network interface cards. Z.ai indicates the design can scale further to support clusters containing hundreds of thousands of GPUs for next-generation AI infrastructure.

Share this update