Details
- Google AI Developers announced multimodal function calling now available in the Gemini Interactions API, enabling agents to process images natively.
- Tools return actual images rather than text descriptions, with Gemini 3 handling mixed text and image results directly.
- Key features include native image processing by Gemini 3 for tasks like describing images, analyzing documents, or visual decision-making.
- The Interactions API unifies model interactions and agent workflows, simplifying state management and multi-turn conversations.
- Supports use cases like reading images from filesystem, capturing screenshots, fetching from APIs, rendering charts, or processing scanned documents.
- Official guide: https://t.co/GNMBvXy5SG; Docs: https://t.co/L69vkNnIJa.
Impact
Google's rollout of multimodal function calling in the Gemini Interactions API positions it as a leader in agentic AI workflows, allowing developers to build visual agents that process real images from tools without intermediate descriptions. This directly pressures rivals like OpenAI's GPT-4o and Anthropic's Claude, which support multimodal inputs but lag in seamless native image returns within function calling loops, based on current API docs. By integrating this into a stateful Interactions API, Google lowers barriers for multi-turn applications like screenshot analysis or chart rendering, accelerating adoption in enterprise automation and developer tools. It aligns with the push toward on-device and edge inference by enabling lighter, vision-aware agents, potentially steering R&D toward hybrid text-vision systems over the next 12-24 months amid GPU constraints. While competitors may quickly match, Google's early mover advantage in Gemini 3 could capture market share in visual agentic apps.
