HeadsUpAI

Z.ai Resolves GLM-5 Infrastructure Race Conditions to Stabilize Long Horizon Coding Agents

Z.ai, the AI lab behind the GLM model series, detailed a technical post-mortem on the "Scaling Pain" encountered while serving GLM-5.1 for complex coding tasks. The team identified two race conditions in their inference (running a trained model) infrastructure that corrupted the KV Cache (a memory system that stores previous tokens).
Throughput improvement
Up to 132%
Context length tested
40K to 120K tokens
Optimization name
LayerSplit
Upstreamed to
SGLang (PR #22811)
Anomaly types fixed
Garbled output, repetition, rare characters

This investigation marks a shift from vibe coding toward disciplined agentic engineering. As agents move from simple chat to long-running tasks supported by usage quota extensions, infrastructure reliability becomes as vital as model weights. Standard metrics like latency are insufficient if the system state is not perfectly preserved.

To address these bottlenecks, Z.ai introduced LayerSplit, a scheme that partitions the KV Cache across GPUs by layer. This optimization mirrors NVIDIA's inference stack rebuild for agentic workloads and is now live. Users can expect more stable performance for contexts up to 120K tokens, with some fixes already upstreamed to SGLang.

Z.ai
Z.ai
@Zai_org
X

Scaling laws push model capability forward. But whether that capability becomes reliable in production depends on how we handle Scaling Pain. https://t.co/81QCQw941P In our latest blog, we share how we debugged GLM-5 serving at scale: reproducing rare garbled outputs, repetition, and rare-character generation; tracing and eliminating KV Cache race conditions; fixing HiCache synchronization issues; and introducing LayerSplit for up to 132% throughput improvement. We hope these lessons help the community avoid similar pitfalls and build more robust inference infrastructure.

45retweets475likes
View on X

Still wondering? A few quick answers below.

LayerSplit is a layer-wise KV Cache storage scheme designed by Z.ai to improve inference performance for long-context coding tasks. Instead of storing every model layer on every GPU, it partitions the KV Cache—a memory system for previous tokens—across the cluster. This reduces memory pressure and increases system throughput by up to 132 percent.

The abnormal outputs were caused by low-level race conditions in the inference infrastructure rather than the model itself. Specifically, a KV Cache—the memory system for token state—reuse conflict occurred when aborted requests were not properly synchronized. This led to new requests reading corrupted data from memory addresses that were still being written to by old tasks.

Z.ai uses speculative decoding—a technique where a small model predicts tokens for a larger one—as a real-time signal for output quality. Extremely low acceptance lengths often indicate garbled text, while unusually high acceptance rates can signal repetitive loops. If these metrics cross specific thresholds, the system proactively terminates the generation and triggers an automatic retry.

HiCache is a hierarchical KV caching system that swaps data between CPU and GPU memory to handle long-context inputs. Z.ai discovered a read-before-ready bug where computations started before data was fully loaded from the CPU. The fix involved restructuring the kernel pipeline to enforce explicit synchronization, ensuring all data is ready before the model begins its attention computation.

While the full Z.ai inference stack is proprietary, the team has contributed specific fixes to the open-source community. For example, the fix for missing load-use ordering in hierarchical caching was submitted as a pull request to SGLang, an open-source inference framework. This allows other developers to benefit from improved stability in high-concurrency, long-context AI agent scenarios.

Share this update