Details
- Google for Developers announced Gemini 3.1 Flash TTS, a new text-to-speech model now available in preview via the Gemini API in Google AI Studio.
- The model enables low-latency, controllable audio generation from text inputs, supporting single- and multi-speaker voices in formats like LINEAR16, MP3, and OGG\_OPUS.
- It offers granular expressive control using intuitive audio tags to direct style, pace, and delivery with high precision.
- Accessible through Cloud Text-to-Speech in preview stage, building on prior models like Gemini 2.5 Flash TTS for everyday cost-efficient applications.
- Complements recent Gemini 3.1 Flash family advancements, including Live Preview for real-time audio dialogue and image generation capabilities.
- Developers can test it immediately in Google AI Studio, with input text and output audio modalities optimized for fluid interactions.
Impact
Google's Gemini 3.1 Flash TTS preview strengthens its multimodal AI suite, integrating expressive TTS with low-latency Flash models to enable more natural voice applications. This pressures rivals like OpenAI's GPT-4o audio and ElevenLabs by offering API-accessible, tag-controlled synthesis at potentially lower costs via Google Cloud. It accelerates adoption in voice agents and multilingual Search Live, now in 200+ countries, narrowing gaps in real-time dialogue where Gemini leads benchmarks like ComplexFuncBench Audio at 90.8%. Enterprises gain reliable tools for customer experience without switching providers.
