Details
- Qwen introduced Qwen3-LiveTranslate-Flash, a live translation engine that handles speech, text, lip-reading, and gestures at the same time.
- The system understands 18 languages and six regional dialects, and can vocalize replies in 10 languages without using the cloud.
- Its vision layer detects on-screen captions, signs, and facial cues to enhance accuracy when audio is poor or background noise is high.
- Demos show end-to-end translation in under 500 ms on a single NVIDIA L4 GPU, an improvement over the previous 700 ms with Qwen2.
- The SDK will be available for Android, iOS, and WebRTC in October 2025, with an enterprise on-premises appliance coming in early 2026 including privacy controls.
- Pricing is set at US$0.002 per translated word for cloud users, with discounts for volumes above 5 million words per month.
- A "Flash" mode can lower video frame rates to 12 fps, reducing compute expenses by 30 percent compared to the top-tier Qwen3-LiveTranslate-Pro.
- Beta testers include Alibaba’s DingTalk for multilingual conferences and China Eastern Airlines for spoken in-flight messages.
Impact
This rollout intensifies the race with Google Interpreter Mode and Meta’s SeamlessM4T, as neither currently features integrated lip-reading. With latency now below half a second, Qwen's solution could unlock real-time translation in new arenas like esports and legal transcription. Compliance-ready on-premise hardware and aggressive pricing position Qwen as a formidable challenger to Microsoft Azure Speech in the enterprise segment.