Which local models can actually handle tool calling? I built a framework to find out. 15 scenarios. 12 tools. Mocked responses. Temperature 0. No cherry-picking. Tested every Qwen3.5 size from 0.8B to 397B, and since some of you asked after the distillation tests: yes, I https://t.co/Naw1BBwhKl
ToolCall-15 Open-Sources a Practical Tool-Calling Benchmark for Local LLMs
stevibe· Updated
ToolCall-15 open-sourced a benchmark that tests how well local LLMs handle tool calling across 15 scenarios and 12 tools. Built for practitioners shipping agents, it produces deterministic, reproducible results with a live pass/partial/fail dashboard — no research paper required.
Most tool-calling benchmarks are buried in papers with no runnable code or built on abstract cases that don't reflect how agents actually fail. ToolCall-15 is fully inspectable — every prompt, tool definition, expected answer, and scoring rule is published so anyone can verify results.
If you're evaluating local models for an agentic pipeline, this gives you a structured way to test the failures that matter — whether a model picks the right tool, chains steps correctly, and refuses when no tool applies. The repo is MIT-licensed and includes a full methodology document.
Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →






