New on the Science Blog: We gave Claude 99 problems analyzing real biological data and compared its performance against an expert panel. On 23 problems, the experts were stumped. Our most recent models solved roughly 30% of those—and most of the rest. https://t.co/BYqr76zxhk
Anthropic Benchmark Shows Claude Solving Biological Research Problems That Stump Experts
· Updated
- Total evaluation problems
- 99
- Human-difficult problems
- 23
- Claude Mythos Preview solve rate (difficult)
- 30%
- Claude Opus 4.6 solve rate (human-solvable)
- 77.4%
- Claude Sonnet 4.6 solve rate (human-solvable)
- Approx 70%
- Availability
- Hugging Face (BioMysteryBench-preview)
This shift toward agentic science evaluations addresses a critical gap in AI research. While models have long passed medical exams, BioMysteryBench measures if they can function as autonomous researchers. The results show Claude Mythos Preview—a restricted model with near-superhuman reasoning—solving 30% of human-difficult problems that a panel of five domain experts could not crack.
You can now access the benchmark preview on Hugging Face to test scientific agents in containerized environments. The data suggests frontier models are becoming viable collaborators for scientific discovery, though their superhuman wins remain brittle and less reliable than their performance on standard tasks.
Still wondering? A few quick answers below.




