MotherDuck Uses Claude to Psychoanalyze Its Own Analytics Agent

MotherDuck

Mar 21, 2026 · Updated Apr 25, 2026

MotherDuck ran 50 BIRD-Bench text-to-SQL questions through Claude Opus 4.5 and their MCP Server, then used Claude sub-agents to classify every chain-of-thought trace. Single-shot answers hit 91% accuracy; iterative loops landed at 64% — revealing how agents think under ambiguity.

MotherDuck, a cloud analytics platform built on DuckDB, ran a 50-question BIRD-Bench sample through Claude Opus 4.5 and the MotherDuck MCP Server. Every run captured a chain-of-thought JSON trace — the agent's internal monologue, tool calls, and query results. Claude sub-agents then classified each trace in an "LLM as judge" setup: Opus orchestrated, Sonnet sub-agents classified across query iteration pattern, error recovery, and tool effectiveness.

The results show a clear split: single-shot executions (23 traces) succeeded 91% of the time, iterative loops (25 traces) only 64%. Iteration isn't straightforwardly bad — agents frequently hit a wall, pivoted, and recovered. The clearest failure came from semantically similar columns (position vs rank): the agent mapped the wrong concept even after writing the correct query moments before.

Run the MotherDuck MCP Server against your own datasets and study the chain-of-thought traces — they expose where your agent's reasoning breaks down under schema ambiguity.

View the full update on motherduck.com

MotherDuck

@motherduckMar 20

What happens when you use Claude to psychoanalyze Claude? We ran 50 BIRD-Bench questions through a testing harness using Claude Opus 4.5 and our MCP Server, harvested every chain-of-thought trace, then deployed a team of Claude sub-agents to classify what went right, what went wrong, and why. We're calling it Claude-ception. New on the blog 👇️ https://t.co/VkJD1K58PL

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Keep reading

Anthropic automates 95 percent of internal analytics using new Claude agent stack

Anthropic released a technical framework for building self-service business analytics agents that achieve high accuracy by treating data as a context problem rather than a coding task. The guide details the agentic analytics stack and skill architectures used to automate 95% of the company's internal data queries.

ClaudeMar 15

Claude Code Adds Multi-Agent Code Review to Catch Bugs in Pull Requests

Claude Code now dispatches agents to review every pull request, catching bugs human skims often miss. Multiple agents run in parallel, verify findings, and rank issues by severity. Available as a research preview for Team and Enterprise plans.

Simon Willison analyzes Claude Opus 4.8 and its agentic efficiency gains

Simon WillisonMay 29

Simon Willison analyzes Claude Opus 4.8 and its agentic efficiency gains

Simon Willison analyzed the release of Claude Opus 4.8, highlighting its shift toward self-correcting honesty and significant API efficiency improvements. The update introduces mid-conversation system messages and a lower caching floor, making long-running agentic loops cheaper and more steerable.

Claude Opus 4.8 takes top spot on agentic work benchmark

Artificial AnalysisJun 1

Claude Opus 4.8 takes top spot on agentic work benchmark

Anthropic's Claude Opus 4.8 has claimed the lead on the GDPval-AA leaderboard for agentic professional tasks. The model achieved an 1890 Elo rating, demonstrating a 67% win rate against GPT-5.5 xhigh in real-world work scenarios. This update establishes a new performance ceiling for AI agents capable of producing complex office deliverables.