Google Launches Gemini Embedding 2, Its First Multimodal Embedding Model

Google

Mar 18, 2026 · Updated Apr 25, 2026

Gemini Embedding 2 maps text, images, video, audio, and documents into a single embedding space — Google's first multimodal embedding model, now in public preview. One API call handles interleaved multimodal inputs, eliminating separate per-modality pipelines.

Gemini Embedding 2 is Google's first natively multimodal embedding model, now in public preview via the Gemini API and Vertex AI. It maps text, images, video, audio, and PDFs into a unified embedding space — up to 8,192 text tokens, 6 images, 120 seconds of video, native audio (no transcription needed), and 6-page PDFs per request.

Most embedding pipelines require separate models per modality then fusion logic to compare across them. gemini-embedding-2-preview ingests interleaved inputs natively, collapsing that into one step and simplifying multimodal RAG, semantic search, and clustering. Dimensions flex from 128 to 3072 via Matryoshka Representation Learning — a nesting technique that lets developers tune performance vs. storage.

Use the Gemini API or Vertex AI to build multimodal search — query a video archive by text prompt, or use an image to retrieve matching content. Google reports outperformance over leading models on text, image, and video benchmarks.

View the full update on blog.google

Google AI Developers

@googleaidevsMar 10

Start building with Gemini Embedding 2, our most capable and first fully multimodal embedding model built on the Gemini architecture. Now available in preview via the Gemini API and in Vertex AI. https://t.co/jPE8KpN7Rf

340

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Google →

Keep reading

Google Launches Gemini Embedding 2, First Natively Multimodal Embedding Model

Gemini Embedding 2, now in preview via the Gemini API, is Google's first natively multimodal embedding model — enabling semantic understanding across text, images, videos, audio, and documents in a unified representation space.

Google Brings Gemini 3.5 Flash to Everyone for Free Visual Research

GeminiMay 21

Google Brings Gemini 3.5 Flash to Everyone for Free Visual Research

Google is rolling out Gemini 3.5 Flash globally to all users for free via the web and mobile app. The update shifts the high-speed model from a developer tool to a consumer assistant capable of analyzing complex diagrams and math papers. This move democratizes frontier-level multimodal reasoning for everyday research and document exploration.

Google DeepMindMay 20

Google DeepMind Launches Gemini Omni to Reimage and Edit Video Content

Google DeepMind introduced Gemini Omni Flash, a multimodal model that allows users to transform existing video scenes using natural language prompts. By combining generative media systems with Gemini's reasoning, the model can instantly swap environments or add objects while maintaining the original video's action.

Google AI StudioMar 18

Gemini API Adds Per-Project Monthly Spend Caps in AI Studio

Google launched Project Spend Caps for the Gemini API, letting developers set a monthly dollar limit per project in AI Studio. Caps have up to a 10-minute activation delay, and Google revamped usage tiers to auto-upgrade developers as usage scales.