HeadsUpAI

Anthropic Benchmark Shows Claude Solving Biological Research Problems That Stump Experts

· Updated

Anthropic released BioMysteryBench, a bioinformatics benchmark that tasks AI with analyzing raw DNA, RNA, and metabolic data. Unlike previous evaluations using multiple-choice questions, this framework uses 99 verifiable problems derived from objective ground truths in messy, real-world biological datasets.
Total evaluation problems
99
Human-difficult problems
23
Claude Mythos Preview solve rate (difficult)
30%
Claude Opus 4.6 solve rate (human-solvable)
77.4%
Claude Sonnet 4.6 solve rate (human-solvable)
Approx 70%
Availability
Hugging Face (BioMysteryBench-preview)

This shift toward agentic science evaluations addresses a critical gap in AI research. While models have long passed medical exams, BioMysteryBench measures if they can function as autonomous researchers. The results show Claude Mythos Preview—a restricted model with near-superhuman reasoning—solving 30% of human-difficult problems that a panel of five domain experts could not crack.

You can now access the benchmark preview on Hugging Face to test scientific agents in containerized environments. The data suggests frontier models are becoming viable collaborators for scientific discovery, though their superhuman wins remain brittle and less reliable than their performance on standard tasks.

Anthropic
Anthropic
@AnthropicAI
X

New on the Science Blog: We gave Claude 99 problems analyzing real biological data and compared its performance against an expert panel. On 23 problems, the experts were stumped. Our most recent models solved roughly 30% of those—and most of the rest. https://t.co/BYqr76zxhk

77retweets955likes
View on X

Still wondering? A few quick answers below.

BioMysteryBench is a bioinformatics evaluation framework developed by Anthropic to test AI models on complex, open-ended research problems. It consists of 99 questions derived from real-world biological data, such as DNA and RNA sequencing. Unlike traditional benchmarks, it requires models to use bioinformatics tools and databases to reach objective, verifiable conclusions about messy biological systems.

In Anthropic testing, the latest Claude models demonstrated significant scientific reasoning capabilities. While a panel of five domain experts was unable to solve 23 of the most difficult problems, the Claude Mythos Preview model successfully solved roughly 30% of those cases. On tasks that humans could solve, the models achieved high reliability, often mirroring or improving upon human analytical strategies.

Claude employs two primary strategies when tackling bioinformatics tasks. First, it utilizes a vast internal knowledge base of structural biology and molecular profiles drawn from hundreds of thousands of scientific papers. Second, when uncertain, the model layers multiple analytical methods and combines different lines of evidence to verify its conclusions, a technique that helps it navigate noisy or complex datasets.

Yes, Anthropic has made a preview of BioMysteryBench available to the research community. Interested users can access the dataset and evaluation framework through Hugging Face to test the scientific capabilities of their own AI agents. The benchmark is designed to run in containerized environments where models have access to standard bioinformatics tools and external databases.

While Claude solves problems that stump experts, its performance on difficult tasks is currently brittle. On human-solvable problems, the model is highly consistent, but on harder questions, many correct answers come from reasoning paths it cannot reliably reproduce. This reliability gap suggests that while the capability frontier is moving, consistent superhuman performance is still developing.

Share this update