Details
- Google introduced Gemini 3.1 Flash TTS, a new text-to-speech model designed for developers and enterprises building AI-based speech applications.
- The model features native multi-speaker dialogue capability, allowing users to cast and direct multiple characters in a single API call with unique voice profiles.
- Audio tags enable precise control over vocal style, pacing, accent, and expression by embedding natural language commands directly into text input; over 200 audio tags are available for granular control.
- Supports 70+ languages with per-locale accent control, enabling localized expressive speech without separate pipelines.
- Achieves an Elo score of 1,211 on the Artificial Analysis TTS leaderboard, positioning it in the "most attractive quadrant" for balancing speech quality and cost.
- All audio output is watermarked with SynthID to identify AI-generated content and prevent misinformation.
- Available now in public preview on Google AI Studio, Vertex AI, and Google Vids.
Impact
Gemini 3.1 Flash TTS advances Google's competitive position in AI voice synthesis by combining high-fidelity audio quality with fine-grained controllability through inline audio tags, a feature that narrows competition with specialized TTS providers. The multi-speaker dialogue capability and 70+ language support with locale-specific accents lower barriers for developers building conversational AI and localized voice applications at scale. SynthID watermarking addresses growing regulatory concerns around synthetic media identification, aligning with emerging AI transparency standards. The cost-performance balance positions this as an accessible option for enterprises seeking production-ready speech generation without premium pricing.
