Google Launches Multimodal Function Calling in Gemini Interactions API

Details

Google AI Developers announced multimodal function calling now available in the Gemini Interactions API, enabling agents to process images natively.
Tools return actual images rather than text descriptions, with Gemini 3 handling mixed text and image results directly.
Key features include native image processing by Gemini 3 for tasks like describing images, analyzing documents, or visual decision-making.
The Interactions API unifies model interactions and agent workflows, simplifying state management and multi-turn conversations.
Supports use cases like reading images from filesystem, capturing screenshots, fetching from APIs, rendering charts, or processing scanned documents.
Official guide: https://t.co/GNMBvXy5SG; Docs: https://t.co/L69vkNnIJa.

Impact

Google's rollout of multimodal function calling in the Gemini Interactions API positions it as a leader in agentic AI workflows, allowing developers to build visual agents that process real images from tools without intermediate descriptions. This directly pressures rivals like OpenAI's GPT-4o and Anthropic's Claude, which support multimodal inputs but lag in seamless native image returns within function calling loops, based on current API docs. By integrating this into a stateful Interactions API, Google lowers barriers for multi-turn applications like screenshot analysis or chart rendering, accelerating adoption in enterprise automation and developer tools. It aligns with the push toward on-device and edge inference by enabling lighter, vision-aware agents, potentially steering R&D toward hybrid text-vision systems over the next 12-24 months amid GPU constraints. While competitors may quickly match, Google's early mover advantage in Gemini 3 could capture market share in visual agentic apps.

Google Launches Multimodal Function Calling in Gemini Interactions API

Details

Impact

Social

CONTENT

INFO