Small Volumes of Expert-Labeled Data Nearly Double AI Model Performance

Mercor

· Updated

Mercor and Applied Compute post-trained an open-source model using fewer than 1,000 expert-labeled tasks, nearly doubling Pass@1 scores on the APEX-Agents benchmark. Corporate law scores tripled. Small volumes of high-quality data can dramatically outperform massive datasets for specialized professional work.

Mercor, an AI talent and evaluation platform, partnered with Applied Compute, a startup by ex-OpenAI researchers, to post-train GLM 4.6 using 874 expert-labeled tasks across investment banking, consulting, and corporate law. They tested against APEX-Agents, a benchmark of 480 professional tasks created by VPs and Managing Directors with 10+ years at top firms. Pass@1 and mean scores nearly doubled, with corporate law tripling. The training curve was near-linear.

This challenges the assumption that improving models requires massive datasets. In the corporate law test, the baseline produced a professional-looking but factually wrong memo. The post-trained model correctly identified tax code violations, citing specific USC sections - better reasoning, not just pattern matching. Hundreds of expert examples outperformed volume.

The APEX-Agents benchmark and dataset are on Mercor's Hugging Face, with experiment infrastructure on their GitHub. The technical report includes trajectory-level observability showing how models attempted each task.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Share this update