What happens when you use Claude to psychoanalyze Claude? We ran 50 BIRD-Bench questions through a testing harness using Claude Opus 4.5 and our MCP Server, harvested every chain-of-thought trace, then deployed a team of Claude sub-agents to classify what went right, what went wrong, and why. We're calling it Claude-ception. New on the blog 👇️ https://t.co/VkJD1K58PL
MotherDuck Uses Claude to Psychoanalyze Its Own Analytics Agent
· Updated
MotherDuck ran 50 BIRD-Bench text-to-SQL questions through Claude Opus 4.5 and their MCP Server, then used Claude sub-agents to classify every chain-of-thought trace. Single-shot answers hit 91% accuracy; iterative loops landed at 64% — revealing how agents think under ambiguity.
BIRD-Bench sample through Claude Opus 4.5 and the MotherDuck MCP Server. Every run captured a chain-of-thought JSON trace — the agent's internal monologue, tool calls, and query results. Claude sub-agents then classified each trace in an "LLM as judge" setup: Opus orchestrated, Sonnet sub-agents classified across query iteration pattern, error recovery, and tool effectiveness.The results show a clear split: single-shot executions (23 traces) succeeded 91% of the time, iterative loops (25 traces) only 64%. Iteration isn't straightforwardly bad — agents frequently hit a wall, pivoted, and recovered. The clearest failure came from semantically similar columns (position vs rank): the agent mapped the wrong concept even after writing the correct query moments before.
Run the MotherDuck MCP Server against your own datasets and study the chain-of-thought traces — they expose where your agent's reasoning breaks down under schema ambiguity.
Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →




