HeadsUpAI

Anthropic Shares Multi-Agent Harness Design for Long-Running App Development

ยท Updated

Anthropic published an engineering post on how a generator-evaluator multi-agent architecture tackles poor self-evaluation and context degradation in long-running coding tasks. A standalone evaluator uses the Playwright Model Context Protocol (MCP) to interact with live pages, scoring against design quality, originality, craft, and functionality criteria โ€” feeding critique back across 5โ€“15 iterations. Applied to full-stack development, a three-agent planner-generator-evaluator system produced a working retro game maker; a solo run produced a broken one.

Agents reliably praise their own work โ€” a tuned, skeptical evaluator gives the generator concrete feedback to iterate against, which is more tractable than self-critique. With Claude Opus 4.6, stronger long-context performance let the team drop sprint constructs and session resets the earlier harness required.

Apply the generator-evaluator pattern to your own agent harness for tasks where quality is subjective or hard to verify in one pass. The post includes sprint contract examples and evaluator tuning notes.

Anthropic
Anthropic
@AnthropicAI
X

New on the Anthropic Engineering Blog: How we use a multi-agent harness to push Claude further in frontend design and long-running autonomous software engineering. Read more: https://t.co/HWvmXk1ykn

292retweets
View on X

Share this update