Details

  • Google AI Developers announced three new updates to the Gemini API File Search tool, aimed at simplifying multimodal Retrieval-Augmented Generation (RAG) systems with improved precision.
  • Key addition: Multimodal support using Gemini Embedding 2 model, enabling reasoning across images and text in a unified embedding space.
  • Gemini Embedding 2, Google's first natively multimodal model, maps text, images, videos, audio, and documents (up to 6 PDF pages) into one semantic space, supporting over 100 languages and interleaved inputs.
  • Supports large inputs: up to 8,192 text tokens, 6 images, 120 seconds video, 180 seconds audio per request; uses Matryoshka Representation Learning for flexible dimensions (3072, 1536, 768 recommended).
  • Outperforms prior Google models and rivals like Amazon Nova 2 on benchmarks such as MTEB Multilingual (69.9 mean) and TextCaps recall@1 (89.6% for text-image).
  • Updates build on Gemini Embedding 2's general availability, enhancing File Search for agentic tasks like codebase analysis or cross-referencing files.

Impact

Google's multimodal upgrade to Gemini API File Search pressures rivals like OpenAI's text-focused embeddings and Amazon's Nova 2 by unifying text, image, video, and audio processing in one model, slashing pipeline complexity and latency by up to 70% per user reports. This widens access to advanced RAG for developers building AI agents, boosting precision in real-world tasks like semantic search over diverse data. While Voyage offers strong multimodal alternatives, Gemini Embedding 2's top benchmark scores and native interleaving position Google to accelerate enterprise adoption of multimodal AI.