Planning is where LLMs move from “saying” to “doing.” Tencent Hy, in collaboration with the Gaoling School of Artificial Intelligence at Renmin University of China, is excited to open-source PlanningBench - a scalable, verifiable framework for evaluating and training LLM planning capabilities. With PlanningBench, you get: ✅ 30+ real-world planning tasks ✅ Automated verification ✅ Evaluation and training support See how top-tier LLMs perform on PlanningBench 👇 Resources: arXiv: https://t.co/N5xTRdo9KR GitHub: https://t.co/XftHZrKGyB HuggingFace: https://t.co/nBbddXnEDx #PlanningBench #TencentHunyuan #OpenSource 📷
Tencent Hunyuan Open-Sources PlanningBench to Advance LLM Planning Capabilities
Tencent HunyuanTencent Hunyuan, in collaboration with Renmin University of China, open-sourced PlanningBench, a framework for evaluating and training large language model (LLM) planning capabilities. This release aims to help LLMs move from generating text to autonomously executing multi-step actions in complex scenarios.
- Evaluation features
- Automated verification, adaptive difficulty control, instance-level verification checklists
- Training support
- Reinforcement learning on verified data
- Task domains
- Scheduling & Timetabling (28.42%), Project & Production Operations (21.90%), Routing & Traveling (17.21%), and 3 more
Planning is a fundamental capability for AI agents, enabling them to coordinate goals, constraints, and resources to achieve complex objectives. PlanningBench's evaluations show that current frontier models still struggle to produce complete solutions under coupled constraints, a challenge also observed in benchmarks like ITBench-AA for agentic tasks.
PlanningBench provides a controllable source of planning data for diagnosing and improving generalizable planning abilities in LLMs. The framework is available on GitHub and Hugging Face. This advances the development of more capable AI agents, a focus also seen in platforms like Arena.ai's Agent Mode.
Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →




