Google Gemma 4 31B Leads Open-Weight Models in Negotiation Benchmark

Google Gemma

May 28, 2026 · Updated Jun 12, 2026

Google's Gemma 4 31B ranked as the top-performing open-weight model on TERMS-Bench, a new evaluation for AI agents conducting economic negotiations. The benchmark uses a verifiable environment instead of LLM grading to measure an agent's ability to maximize profit while following strict financial constraints.

Google's Gemma 4 31B achieved the highest score among open-weight models on TERMS-Bench, a diagnostic benchmark for AI agents (autonomous systems that plan and act independently). Unlike 'LLM-as-judge' grading, this framework uses a deterministic environment to verify if an agent maximizes surplus (the available bargaining profit) while obeying price bounds.

Benchmark (TERMS-Bench SE+): 0.640
Agreement rate (AGR+): 99.8%
Open-weight rank: #1
Parameters: 31 billion
Availability: Open weights, Google API

Negotiation represents a high-stakes reasoning shift for agentic AI, requiring models to maintain internal beliefs about an opponent's hidden constraints. This performance validates Gemma 4 31B as a viable alternative to proprietary systems. It follows Gemma 4's top vision rankings, cementing its position as a frontier-class open model.

You can use Gemma 4 31B for autonomous procurement or marketplace workflows where maximizing financial utility is critical. The model's high surplus efficiency suggests it can secure better deals than larger proprietary models at lower cost. The 31B model is available as an open-weight release or via the Google API.

View the full update on terms-bench.github.io

Google Gemma

@googlegemmaMay 28

Honored to see Gemma 4 31B on TERMS-Bench, a benchmark for LLM negotiation agents based on economic negotiation! 🤝 - Environment verifies outcomes (no LLM-as-judge) - Top open-weight model alongside frontier peers - Allow diagnosing why and where agents fail https://t.co/cYnpVpVxkU

26284

View on X

Still wondering? A few quick answers below.

TERMS-Bench is a diagnostic benchmark designed to evaluate how well AI agents perform in bilateral economic negotiations. Unlike traditional evaluations that use other models to grade performance, this framework uses a verifiable environment to measure surplus efficiency, agreement rates, and whether an agent follows hard rules like price bounds and individual rationality.

Google Gemma 4 31B ranked as the top-performing open-weight model on the benchmark, outperforming several larger proprietary frontier models in surplus efficiency. It captured a high percentage of available profit in negotiation scenarios while maintaining a near-perfect agreement rate, proving that open-weight models are now commercially viable for complex autonomous commerce tasks.

The benchmark uses a structured environment that enforces game-theoretic rules, such as turn budgets and price constraints. It calculates metrics like surplus efficiency and agreement calibration based on the actual economic outcomes of the negotiation. This deterministic approach allows researchers to diagnose exactly why and where an agent fails during a multi-step interaction.

The benchmark measures four primary axes: surplus efficiency, which is the fraction of profit captured; agreement rate, which tracks how often a deal is reached; opponent modeling, which tests the agent's ability to infer a counterpart's hidden constraints; and procedural robustness, which flags violations of price bounds or irrational deal acceptance.

Yes, Gemma 4 31B is an open-weight model that developers can download and run on their own hardware or access through the Google API. Its performance on negotiation tasks makes it a strong candidate for developers building local or edge-based agents for procurement, sales, or automated marketplace interactions.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Google →

Keep reading

Google Gemma 4 Claims Top Rankings on Japanese Swallow Leaderboard v2

Google confirmed that its Gemma 4 open-weight model achieved high-ranking results on the Swallow Leaderboard v2, a rigorous Japanese language benchmark. This validation establishes the model as a leading choice for developers building regional applications that require frontier-level reasoning in Japanese.

Arena Ranks Google Gemma 4 as Top Open Vision Model

ArenaMay 8

Arena Ranks Google Gemma 4 as Top Open Vision Model

Google's Gemma-4-31b and Gemma-4-26b-a4b have entered the Vision Arena leaderboard as the #2 and #4 ranked open models. These releases shift the price-performance frontier by delivering vision reasoning capabilities that rival proprietary systems at a fraction of the cost.

GoogleApr 20

Google enables Gemma 4 31B to autonomously debug and execute code

Google's Gemma 4 31B can now use an ADK Agent and a code execution sandbox to autonomously navigate complex, multi-step tasks. This update brings frontier-level agentic capabilities like self-debugging and tool use to an open-weight model that can run on-device or at the edge.

Google Gemini 3.5 Flash Beats Larger Models on Agentic Benchmark

Google AI StudioMay 22

Google Gemini 3.5 Flash Beats Larger Models on Agentic Benchmark

Gemini 3.5 Flash has ranked first on the APEX-Agents-AA benchmark, outperforming larger frontier models in autonomous task execution. The result confirms that high-speed, low-cost models are now capable of handling complex agentic workflows previously reserved for larger architectures.

What is TERMS-Bench for AI agents?

How did Google Gemma 4 31B perform on the negotiation benchmark?

How does TERMS-Bench evaluate AI negotiation agents without using an LLM judge?

What specific metrics does TERMS-Bench use to rank models like Gemma 4?

Is Google Gemma 4 31B available for developers to use?

Keep reading

Google Gemma 4 Claims Top Rankings on Japanese Swallow Leaderboard v2

Google Gemma 4 Claims Top Rankings on Japanese Swallow Leaderboard v2

Arena Ranks Google Gemma 4 as Top Open Vision Model

Arena Ranks Google Gemma 4 as Top Open Vision Model

Google enables Gemma 4 31B to autonomously debug and execute code

Google enables Gemma 4 31B to autonomously debug and execute code

Google Gemini 3.5 Flash Beats Larger Models on Agentic Benchmark

Google Gemini 3.5 Flash Beats Larger Models on Agentic Benchmark

Keep reading

Google Gemma 4 Claims Top Rankings on Japanese Swallow Leaderboard v2

Google Gemma 4 Claims Top Rankings on Japanese Swallow Leaderboard v2

Arena Ranks Google Gemma 4 as Top Open Vision Model

Arena Ranks Google Gemma 4 as Top Open Vision Model

Google enables Gemma 4 31B to autonomously debug and execute code

Google enables Gemma 4 31B to autonomously debug and execute code

Google Gemini 3.5 Flash Beats Larger Models on Agentic Benchmark

Google Gemini 3.5 Flash Beats Larger Models on Agentic Benchmark