https://t.co/CxrLJa2Yk3
Small Volumes of Expert-Labeled Data Nearly Double AI Model Performance
· Updated
Mercor and Applied Compute post-trained an open-source model using fewer than 1,000 expert-labeled tasks, nearly doubling Pass@1 scores on the APEX-Agents benchmark. Corporate law scores tripled. Small volumes of high-quality data can dramatically outperform massive datasets for specialized professional work.
This challenges the assumption that improving models requires massive datasets. In the corporate law test, the baseline produced a professional-looking but factually wrong memo. The post-trained model correctly identified tax code violations, citing specific USC sections - better reasoning, not just pattern matching. Hundreds of expert examples outperformed volume.
The APEX-Agents benchmark and dataset are on Mercor's Hugging Face, with experiment infrastructure on their GitHub. The technical report includes trajectory-level observability showing how models attempted each task.
Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →





