Details
- Google announces Gemini 3.1 Flash Live via the Gemini Live API, enabling developers to build conversational agents with speech-to-speech processing, multilingual switching in over 90 languages, sub-second latency, and a single API call replacing multi-model pipelines like STT, LLM, and TTS.
- Integration demoed with Agora's real-time voice and video stack, including a Reachy Mini robot demo with 70+ tool emotes mapped to physical motors and a Foodgora food ordering agent handling real-time cart updates and recommendations.
- Key features include barge-in support for interruptions, tool use with function calling, audio transcriptions, proactive audio control, affective dialog adapting tone to user expression, and multimodal processing of audio, video, and text streams.
- Improvements over prior models like Gemini 2.5 Flash: higher task completion in noisy environments, better instruction-following, more natural low-latency dialogue recognizing acoustic nuances, and enhanced noise filtering for real-world reliability.
- Available in preview in Google AI Studio; supports WebSocket streaming for bi-directional stateful connections, native audio I/O, and visual context at ~1 FPS; use cases span field technicians, accessibility tools, healthcare assistants, and live coding.
Impact
Gemini 3.1 Flash Live pressures rivals like OpenAI's GPT-4o realtime API and Anthropic's Claude voice mode by delivering sub-second latency and native multimodal streaming in a single call, reducing complexity over chained pipelines. It widens access for developers via Google AI Studio preview, potentially accelerating adoption in agentic apps for healthcare, robotics, and customer service. Enhanced noise handling and 90+ language support align with global, real-world deployment needs, narrowing gaps in conversational fluency while emphasizing tool integration for practical utility over raw speed.
