Honored to see Gemma 4 31B on TERMS-Bench, a benchmark for LLM negotiation agents based on economic negotiation! 🤝 - Environment verifies outcomes (no LLM-as-judge) - Top open-weight model alongside frontier peers - Allow diagnosing why and where agents fail https://t.co/cYnpVpVxkU
Google Gemma 4 31B Leads Open-Weight Models in Negotiation Benchmark
Google's Gemma 4 31B achieved the highest score among open-weight models on TERMS-Bench, a diagnostic benchmark for AI agents (autonomous systems that plan and act independently). Unlike 'LLM-as-judge' grading, this framework uses a deterministic environment to verify if an agent maximizes surplus (the available bargaining profit) while obeying price bounds.
- Benchmark (TERMS-Bench SE+)
- 0.640
- Agreement rate (AGR+)
- 99.8%
- Open-weight rank
- #1
- Parameters
- 31 billion
- Availability
- Open weights, Google API
Negotiation represents a high-stakes reasoning shift for agentic AI, requiring models to maintain internal beliefs about an opponent's hidden constraints. This performance validates Gemma 4 31B as a viable alternative to proprietary systems. It follows Gemma 4's top vision rankings, cementing its position as a frontier-class open model.
You can use Gemma 4 31B for autonomous procurement or marketplace workflows where maximizing financial utility is critical. The model's high surplus efficiency suggests it can secure better deals than larger proprietary models at lower cost. The 31B model is available as an open-weight release or via the Google API.
Google Gemma
@googlegemma
26retweets284likes
View on XStill wondering? A few quick answers below.
TERMS-Bench is a diagnostic benchmark designed to evaluate how well AI agents perform in bilateral economic negotiations. Unlike traditional evaluations that use other models to grade performance, this framework uses a verifiable environment to measure surplus efficiency, agreement rates, and whether an agent follows hard rules like price bounds and individual rationality.
Google Gemma 4 31B ranked as the top-performing open-weight model on the benchmark, outperforming several larger proprietary frontier models in surplus efficiency. It captured a high percentage of available profit in negotiation scenarios while maintaining a near-perfect agreement rate, proving that open-weight models are now commercially viable for complex autonomous commerce tasks.
The benchmark uses a structured environment that enforces game-theoretic rules, such as turn budgets and price constraints. It calculates metrics like surplus efficiency and agreement calibration based on the actual economic outcomes of the negotiation. This deterministic approach allows researchers to diagnose exactly why and where an agent fails during a multi-step interaction.
The benchmark measures four primary axes: surplus efficiency, which is the fraction of profit captured; agreement rate, which tracks how often a deal is reached; opponent modeling, which tests the agent's ability to infer a counterpart's hidden constraints; and procedural robustness, which flags violations of price bounds or irrational deal acceptance.
Yes, Gemma 4 31B is an open-weight model that developers can download and run on their own hardware or access through the Google API. Its performance on negotiation tasks makes it a strong candidate for developers building local or edge-based agents for procurement, sales, or automated marketplace interactions.




