HeadsUpAI

GLM-OCR Hits 3M Downloads, Technical Report Released on arXiv

· Updated

GLM-OCR, a compact 0.9B-parameter multimodal model for document understanding from Z.ai, the AI company behind the GLM model family, has reached 3 million downloads. The technical report details its two-component architecture: a 0.4B-parameter CogViT visual encoder paired with a 0.5B-parameter GLM language decoder. To accelerate inference, the model introduces Multi-Token Prediction (MTP) — predicting multiple tokens per step instead of one, improving throughput while keeping memory overhead low through shared parameters. A two-stage pipeline handles layout analysis via PP-DocLayout-V3, then parallel region-level recognition.

Evaluations on public benchmarks and industrial scenarios show GLM-OCR achieves competitive or state-of-the-art performance across document parsing, formula transcription, table structure recovery, and key information extraction. Its compact architecture targets both edge deployment and large-scale production systems.

Point it at your document processing pipeline to evaluate whether the MTP throughput gains hold for your workload — the technical report covers full benchmark results and architecture specs.

Z.ai
Z.ai
@Zai_org
X

GLM-OCR has accumulated over 3M downloads. We are releasing its technical report: https://t.co/KHFgnnDfYh We welcome your feedback!

61retweets
View on X

Share this update