ToolCall-15 Open-Sources a Practical Tool-Calling Benchmark for Local LLMs

stevibe

Mar 26, 2026 · Updated Jun 6, 2026

ToolCall-15 open-sourced a benchmark that tests how well local LLMs handle tool calling across 15 scenarios and 12 tools. Built for practitioners shipping agents, it produces deterministic, reproducible results with a live pass/partial/fail dashboard — no research paper required.

ToolCall-15, an open-source benchmark for LLM tool use, runs 15 fixed scenarios against a 12-tool set with mocked responses at temperature 0. The suite covers five categories — Tool Selection, Parameter Precision, Multi-Step Chains, Restraint and Refusal, and Error Recovery — three scenarios each, scored pass/partial/fail. A live dashboard renders the full matrix across OpenRouter, Ollama, and llama.cpp.

Most tool-calling benchmarks are buried in papers with no runnable code or built on abstract cases that don't reflect how agents actually fail. ToolCall-15 is fully inspectable — every prompt, tool definition, expected answer, and scoring rule is published so anyone can verify results.

If you're evaluating local models for an agentic pipeline, this gives you a structured way to test the failures that matter — whether a model picks the right tool, chains steps correctly, and refuses when no tool applies. The repo is MIT-licensed and includes a full methodology document.

View the full update on github.com

stevibe

@stevibeMar 25

Which local models can actually handle tool calling? I built a framework to find out. 15 scenarios. 12 tools. Mocked responses. Temperature 0. No cherry-picking. Tested every Qwen3.5 size from 0.8B to 397B, and since some of you asked after the distillation tests: yes, I https://t.co/Naw1BBwhKl

143

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Keep reading

Skillgrade Brings Regression Testing to Agent Skills via Automated Evals

Skillgrade released an open-source CLI that runs automated evals against agent skills, catching regressions when a skill, model, or agent changes. Until now, there was no standard way to verify agent skills hold up across model updates.

SkillsBench Measures Whether Agent Skills Actually Improve AI Performance

Kol TregaskesMar 2

SkillsBench Measures Whether Agent Skills Actually Improve AI Performance

SkillsBench launched as a benchmark of 86 tasks across 11 domains, testing whether agent skills actually improve AI agent performance. Curated human-authored skills raise pass rates by 16.2 percentage points on average, while self-generated skills provide no benefit.

LangSmith Engine Automates Agent Issue Resolution with PRs and Evals

LangChainJun 9

LangSmith Engine Automates Agent Issue Resolution with PRs and Evals

LangChain's LangSmith Engine now automatically proposes three resolution actions for every agent issue it identifies: opening a Pull Request (PR), creating a custom online evaluator, and adding failing traces to an offline evaluation suite. This aims to accelerate the agent development lifecycle by automating issue diagnosis and fix validation.

Artificial Analysis Benchmarks AI Agents on Kubernetes Tasks Where Frontier Models Fail

Artificial AnalysisMay 28

Artificial Analysis Benchmarks AI Agents on Kubernetes Tasks Where Frontier Models Fail

Artificial Analysis and IBM Research launched ITBench-AA, a benchmark evaluating AI agents on autonomous Kubernetes incident diagnosis. The results show that even frontier models struggle with complex IT troubleshooting, with the highest-performing models currently scoring below 50%.