HeadsUpAI

Small Volumes of Expert-Labeled Data Nearly Double AI Model Performance

· Updated

Mercor, an AI talent and evaluation platform, partnered with Applied Compute, a startup by ex-OpenAI researchers, to post-train GLM 4.6 using 874 expert-labeled tasks across investment banking, consulting, and corporate law. They tested against APEX-Agents, a benchmark of 480 professional tasks created by VPs and Managing Directors with 10+ years at top firms. Pass@1 and mean scores nearly doubled, with corporate law tripling. The training curve was near-linear.

This challenges the assumption that improving models requires massive datasets. In the corporate law test, the baseline produced a professional-looking but factually wrong memo. The post-trained model correctly identified tax code violations, citing specific USC sections - better reasoning, not just pattern matching. Hundreds of expert examples outperformed volume.

The APEX-Agents benchmark and dataset are on Mercor's Hugging Face, with experiment infrastructure on their GitHub. The technical report includes trajectory-level observability showing how models attempted each task.

Share this update