HeadsUpAI

ToolCall-15 Open-Sources a Practical Tool-Calling Benchmark for Local LLMs

ยท Updated

ToolCall-15, an open-source benchmark for LLM tool use, runs 15 fixed scenarios against a 12-tool set with mocked responses at temperature 0. The suite covers five categories โ€” Tool Selection, Parameter Precision, Multi-Step Chains, Restraint and Refusal, and Error Recovery โ€” three scenarios each, scored pass/partial/fail. A live dashboard renders the full matrix across OpenRouter, Ollama, and llama.cpp.

Most tool-calling benchmarks are buried in papers with no runnable code or built on abstract cases that don't reflect how agents actually fail. ToolCall-15 is fully inspectable โ€” every prompt, tool definition, expected answer, and scoring rule is published so anyone can verify results.

If you're evaluating local models for an agentic pipeline, this gives you a structured way to test the failures that matter โ€” whether a model picks the right tool, chains steps correctly, and refuses when no tool applies. The repo is MIT-licensed and includes a full methodology document.

stevibe
stevibe
@stevibe
X

Which local models can actually handle tool calling? I built a framework to find out. 15 scenarios. 12 tools. Mocked responses. Temperature 0. No cherry-picking. Tested every Qwen3.5 size from 0.8B to 397B, and since some of you asked after the distillation tests: yes, I https://t.co/Naw1BBwhKl

143retweets
View on X

Share this update