HeadsUpAI

Google Launches Gemini Embedding 2, Its First Multimodal Embedding Model

ยท Updated

Gemini Embedding 2 is Google's first natively multimodal embedding model, now in public preview via the Gemini API and Vertex AI. It maps text, images, video, audio, and PDFs into a unified embedding space โ€” up to 8,192 text tokens, 6 images, 120 seconds of video, native audio (no transcription needed), and 6-page PDFs per request.

Most embedding pipelines require separate models per modality then fusion logic to compare across them. gemini-embedding-2-preview ingests interleaved inputs natively, collapsing that into one step and simplifying multimodal RAG, semantic search, and clustering. Dimensions flex from 128 to 3072 via Matryoshka Representation Learning โ€” a nesting technique that lets developers tune performance vs. storage.

Use the Gemini API or Vertex AI to build multimodal search โ€” query a video archive by text prompt, or use an image to retrieve matching content. Google reports outperformance over leading models on text, image, and video benchmarks.

Google AI Developers
Google AI Developers
@googleaidevs
X

Start building with Gemini Embedding 2, our most capable and first fully multimodal embedding model built on the Gemini architecture. Now available in preview via the Gemini API and in Vertex AI. https://t.co/jPE8KpN7Rf

340retweets
View on X

Share this update