Small Volumes of Expert-Labeled Data Nearly Double AI Model Performance

Mercor

Jan 28, 2026 · Updated Apr 25, 2026

Mercor and Applied Compute post-trained an open-source model using fewer than 1,000 expert-labeled tasks, nearly doubling Pass@1 scores on the APEX-Agents benchmark. Corporate law scores tripled. Small volumes of high-quality data can dramatically outperform massive datasets for specialized professional work.

Mercor, an AI talent and evaluation platform, partnered with Applied Compute, a startup by ex-OpenAI researchers, to post-train GLM 4.6 using 874 expert-labeled tasks across investment banking, consulting, and corporate law. They tested against APEX-Agents, a benchmark of 480 professional tasks created by VPs and Managing Directors with 10+ years at top firms. Pass@1 and mean scores nearly doubled, with corporate law tripling. The training curve was near-linear.

This challenges the assumption that improving models requires massive datasets. In the corporate law test, the baseline produced a professional-looking but factually wrong memo. The post-trained model correctly identified tax code violations, citing specific USC sections - better reasoning, not just pattern matching. Hundreds of expert examples outperformed volume.

The APEX-Agents benchmark and dataset are on Mercor's Hugging Face, with experiment infrastructure on their GitHub. The technical report includes trajectory-level observability showing how models attempted each task.

View the full update on x.com

Mercor

@mercor_aiJan 28

https://t.co/CxrLJa2Yk3

865

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Keep reading

Mercor Launches APEX-SWE Benchmark for Real Production Software Engineering

Mercor Launches APEX-SWE Benchmark for Real Production Software Engineering

Mercor and Cognition launched APEX-SWE, a benchmark testing AI models on real software engineering — system integration, debugging production failures — not just writing code. Traditional benchmarks miss 84% of dev work. Even the top model scores just 41.5%.

Google Gemini 3.5 Flash Beats Larger Models on Agentic Benchmark

Google AI StudioMay 22

Google Gemini 3.5 Flash Beats Larger Models on Agentic Benchmark

Gemini 3.5 Flash has ranked first on the APEX-Agents-AA benchmark, outperforming larger frontier models in autonomous task execution. The result confirms that high-speed, low-cost models are now capable of handling complex agentic workflows previously reserved for larger architectures.

SkillsBench Measures Whether Agent Skills Actually Improve AI Performance

Kol TregaskesMar 2

SkillsBench Measures Whether Agent Skills Actually Improve AI Performance

SkillsBench launched as a benchmark of 86 tasks across 11 domains, testing whether agent skills actually improve AI agent performance. Curated human-authored skills raise pass rates by 16.2 percentage points on average, while self-generated skills provide no benefit.

LangChain Research Makes AI Agent Post-Training Verification 1000x Cheaper

LangChainJun 7

LangChain Research Makes AI Agent Post-Training Verification 1000x Cheaper

LangChain Labs and Harvey published a study demonstrating how to significantly reduce the cost of LLM-as-judge verifiers for AI agents. Their research shows that batching verifier calls and using open-weight models can cut costs by up to 1,000 times. This makes it more practical to run extensive experiments and accelerate the iteration cycle for agent development, especially in complex domains like legal work.