Which local models can actually handle tool calling? I built a framework to find out. 15 scenarios. 12 tools. Mocked responses. Temperature 0. No cherry-picking. Tested every Qwen3.5 size from 0.8B to 397B, and since some of you asked after the distillation tests: yes, I https://t.co/Naw1BBwhKl
ToolCall-15 Open-Sources a Practical Tool-Calling Benchmark for Local LLMs
ยท Updated
ToolCall-15, an open-source benchmark for LLM tool use, runs 15 fixed scenarios against a 12-tool set with mocked responses at temperature 0. The suite covers five categories โ Tool Selection, Parameter Precision, Multi-Step Chains, Restraint and Refusal, and Error Recovery โ three scenarios each, scored pass/partial/fail. A live dashboard renders the full matrix across OpenRouter, Ollama, and llama.cpp.
Most tool-calling benchmarks are buried in papers with no runnable code or built on abstract cases that don't reflect how agents actually fail. ToolCall-15 is fully inspectable โ every prompt, tool definition, expected answer, and scoring rule is published so anyone can verify results.
If you're evaluating local models for an agentic pipeline, this gives you a structured way to test the failures that matter โ whether a model picks the right tool, chains steps correctly, and refuses when no tool applies. The repo is MIT-licensed and includes a full methodology document.
stevibe
@stevibe
143retweets
View on X


