HeadsUpAI

Google Launches Flex and Priority Tiers to Balance Agentic Workload Costs

· Updated

Google introduced two new synchronous service tiers to the Gemini API: flex and priority. The flex tier offers a 50% price reduction for latency-tolerant workloads, while priority provides maximum reliability for critical apps. Both use a single service_tier parameter, removing the need for batch processing.

As developers move from chatbots to autonomous agents, managing diverse workloads is a bottleneck. Background tasks like data enrichment or internal reasoning don't require immediate responses but consume high token volumes. These tiers enable agentic engineering where developers route background thinking to cheaper compute without splitting architecture between synchronous and asynchronous systems.

Implement these tiers by updating the service_tier configuration in GenerateContent or Interactions API requests. Use flex for CRM updates and priority for real-time support bots. The priority tier, available for Tier 2 and 3 paid projects, includes a graceful downgrade that routes overflow traffic to the Standard tier to prevent failure.

Google AI Developers
Google AI Developers
@googleaidevs
X

Balance cost & reliability with our new Flex & Priority inference tiers in the Gemini API! Flex: Pay 50% less for cost-sensitive & latency-tolerant workloads Priority: Highest reliability for your most critical, interactive apps (with premium pricing) Together with the async Batch API, these synchronous tiers give you a complete set of options for any workload. Just swap the tier with a single line of code and keep building.

7retweets88likes
View on X

Share this update