What happens when you use Claude to psychoanalyze Claude? We ran 50 BIRD-Bench questions through a testing harness using Claude Opus 4.5 and our MCP Server, harvested every chain-of-thought trace, then deployed a team of Claude sub-agents to classify what went right, what went wrong, and why. We're calling it Claude-ception. New on the blog ๐๏ธ https://t.co/VkJD1K58PL
MotherDuck Uses Claude to Psychoanalyze Its Own Analytics Agent
ยท Updated
MotherDuck, a cloud analytics platform built on DuckDB, ran a 50-question
BIRD-Bench sample through Claude Opus 4.5 and the MotherDuck MCP Server. Every run captured a chain-of-thought JSON trace โ the agent's internal monologue, tool calls, and query results. Claude sub-agents then classified each trace in an "LLM as judge" setup: Opus orchestrated, Sonnet sub-agents classified across query iteration pattern, error recovery, and tool effectiveness.The results show a clear split: single-shot executions (23 traces) succeeded 91% of the time, iterative loops (25 traces) only 64%. Iteration isn't straightforwardly bad โ agents frequently hit a wall, pivoted, and recovered. The clearest failure came from semantically similar columns (position vs rank): the agent mapped the wrong concept even after writing the correct query moments before.
Run the MotherDuck MCP Server against your own datasets and study the chain-of-thought traces โ they expose where your agent's reasoning breaks down under schema ambiguity.
MotherDuck
@motherduck


