Z.ai Resolves GLM-5 Infrastructure Race Conditions to Stabilize Long Horizon Coding Agents

Zhipu AI

Apr 30, 2026

Z.ai identified and fixed low-level race conditions in its GLM-5 inference stack that caused garbled outputs and repetition during high-concurrency coding tasks. By introducing a layer-wise cache storage scheme called LayerSplit, the lab also increased system throughput by up to 132% for long-context workloads.

Z.ai, the AI lab behind the GLM model series, detailed a technical post-mortem on the "Scaling Pain" encountered while serving GLM-5.1 for complex coding tasks. The team identified two race conditions in their inference (running a trained model) infrastructure that corrupted the KV Cache (a memory system that stores previous tokens).

Throughput improvement: Up to 132%
Context length tested: 40K to 120K tokens
Optimization name: LayerSplit
Upstreamed to: SGLang (PR #22811)
Anomaly types fixed: Garbled output, repetition, rare characters

This investigation marks a shift from vibe coding toward disciplined agentic engineering. As agents move from simple chat to long-running tasks supported by usage quota extensions, infrastructure reliability becomes as vital as model weights. Standard metrics like latency are insufficient if the system state is not perfectly preserved.

To address these bottlenecks, Z.ai introduced LayerSplit, a scheme that partitions the KV Cache across GPUs by layer. This optimization mirrors NVIDIA's inference stack rebuild for agentic workloads and is now live. Users can expect more stable performance for contexts up to 120K tokens, with some fixes already upstreamed to SGLang.

View the full update on z.ai

Z.ai

@Zai_orgApr 29

Scaling laws push model capability forward. But whether that capability becomes reliable in production depends on how we handle Scaling Pain. https://t.co/81QCQw941P In our latest blog, we share how we debugged GLM-5 serving at scale: reproducing rare garbled outputs, repetition, and rare-character generation; tracing and eliminating KV Cache race conditions; fixing HiCache synchronization issues; and introducing LayerSplit for up to 132% throughput improvement. We hope these lessons help the community avoid similar pitfalls and build more robust inference infrastructure.

45475

View on X

Still wondering? A few quick answers below.

LayerSplit is a layer-wise KV Cache storage scheme designed by Z.ai to improve inference performance for long-context coding tasks. Instead of storing every model layer on every GPU, it partitions the KV Cache—a memory system for previous tokens—across the cluster. This reduces memory pressure and increases system throughput by up to 132 percent.

The abnormal outputs were caused by low-level race conditions in the inference infrastructure rather than the model itself. Specifically, a KV Cache—the memory system for token state—reuse conflict occurred when aborted requests were not properly synchronized. This led to new requests reading corrupted data from memory addresses that were still being written to by old tasks.

Z.ai uses speculative decoding—a technique where a small model predicts tokens for a larger one—as a real-time signal for output quality. Extremely low acceptance lengths often indicate garbled text, while unusually high acceptance rates can signal repetitive loops. If these metrics cross specific thresholds, the system proactively terminates the generation and triggers an automatic retry.

HiCache is a hierarchical KV caching system that swaps data between CPU and GPU memory to handle long-context inputs. Z.ai discovered a read-before-ready bug where computations started before data was fully loaded from the CPU. The fix involved restructuring the kernel pipeline to enforce explicit synchronization, ensuring all data is ready before the model begins its attention computation.

While the full Z.ai inference stack is proprietary, the team has contributed specific fixes to the open-source community. For example, the fix for missing load-use ordering in hierarchical caching was submitted as a pull request to SGLang, an open-source inference framework. This allows other developers to benefit from improved stability in high-concurrency, long-context AI agent scenarios.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Zhipu AI →

Keep reading

Z.ai Deploys ZCube Network to Slash Inference Costs and Latency

Z.ai successfully deployed its ZCube network architecture in production to power GLM-5.1 coding services, reducing hardware costs by 33% while boosting throughput. By flattening the network topology, the system eliminates the congestion typically caused by moving massive amounts of data between GPUs during long-context inference.

Fireworks AI Adds GLM 5.1 Training to Build Long Horizon Coding Agents

Fireworks AIApr 28

Fireworks AI Adds GLM 5.1 Training to Build Long Horizon Coding Agents

Fireworks AI added Z.ai's GLM 5.1 to its training platform, supporting supervised fine-tuning and direct preference optimization with a 200K context window. This allows developers to customize the flagship agentic model for multi-hour autonomous tasks without the numerical drift common in fragmented training and inference stacks.

OpenCode Integrates GLM-5.1 Into Go With Zero Data Retention Privacy

OpenCodeApr 8

OpenCode Integrates GLM-5.1 Into Go With Zero Data Retention Privacy

OpenCode added Z.ai's new GLM-5.1 model to its OpenCode Go platform, featuring a zero-retention policy for user data. This allows developers to use a frontier-level model for agentic engineering without their proprietary code being stored or used for future training.

What is LayerSplit?

Why was GLM-5 producing garbled or repetitive text?

How does Z.ai detect anomalous model outputs?

What is HiCache and how was it fixed?

Are the Z.ai GLM-5 infrastructure fixes open source?

Keep reading

Z.ai Deploys ZCube Network to Slash Inference Costs and Latency

Z.ai Deploys ZCube Network to Slash Inference Costs and Latency

Fireworks AI Adds GLM 5.1 Training to Build Long Horizon Coding Agents

Fireworks AI Adds GLM 5.1 Training to Build Long Horizon Coding Agents

OpenCode Integrates GLM-5.1 Into Go With Zero Data Retention Privacy

OpenCode Integrates GLM-5.1 Into Go With Zero Data Retention Privacy

Keep reading

Z.ai Deploys ZCube Network to Slash Inference Costs and Latency

Z.ai Deploys ZCube Network to Slash Inference Costs and Latency

Fireworks AI Adds GLM 5.1 Training to Build Long Horizon Coding Agents

Fireworks AI Adds GLM 5.1 Training to Build Long Horizon Coding Agents

OpenCode Integrates GLM-5.1 Into Go With Zero Data Retention Privacy

OpenCode Integrates GLM-5.1 Into Go With Zero Data Retention Privacy