Where is AI-assisted web development heading? We've added new categories to Code Arena: Frontend, covering 7 domains across agentic web development. Learn about the ML methodology behind it, what the shifting data tells us about how people are actually using AI to build for the web, and which models are quietly excelling in specific niches. 0:00 What sparked the new 7 categories? 0:34 The classical ML approach: building a taxonomy from scratch 2:06 Clustering prompts at scale 2:19 Prototype extraction: using LLMs to name and label clusters 3:26 Why raw embeddings fail: language bias and multi-angle prompts 4:07 Breaking down user intent: what to build, style, components 5:09 From clusters to categories: the key research decision 6:43 Optimization goal #1: Coverage — how much of the data is represented? 7:12 Optimization goal #2: Boundary clarity — keeping definitions tight 8:19 The iterative refinement loop: human-in-the-loop + LLM polish 9:38 Measuring coverage (aiming for 80%+) and sampling the long tail 10:09 Optimization goal #3: Interpretability — titles that make intuitive sense 10:20 How available tools (web search, screenshots) shape the final categories 12:07 How prompt category distribution has shifted over time 14:40 Growing categories: brand/marketing sites and consumer products 14:41 Model-specific strengths: GPT-5.5 and Gemma-4-31b 15:09 Radar plots as a practical model-selection tool 16:11 Combining domain rankings with price/speed Pareto curves 16:52 Predictions: what new categories are coming next?
Arena.ai Adds Seven WebDev Categories to Reveal Niche Model Strengths
- Analysis sample size
- 250,000 prompts
- New categories
- 7 domains
- Total models ranked
- 81 models
- Most common category
- Reference-Based Design (29%)
- Top overall model
- Claude Opus 4.7 Thinking
This granular view supports Arena's task-specific leaderboards by revealing that user intent is shifting toward practical product builds. While gaming was an early favorite, users are increasingly building marketing sites and data-heavy applications. These tasks stress different model behaviors, a finding that validates Cursor's agentic coding benchmark research.
You can now use radar plots to identify specific model strengths, such as GPT-5.5 High for simulations. These category-level views help separate changes in user demand from total platform volume across 81 models. The updated rankings, incorporating the Alibaba Qwen3.7 Max leaderboard debut, are now live on the platform's WebDev leaderboard.
Still wondering? A few quick answers below.




