Google Releases Multi-Token Prediction Drafters for 3x Faster Gemma 4 Inference

Details

Google AI Developers announced Multi-Token Prediction (MTP) drafters for Gemma 4 models, achieving up to 3x speedup in workflows by addressing memory-bandwidth bottlenecks in standard LLM inference.
Drafters are tiny, efficient models that run alongside the main Gemma 4 target model, using speculative decoding to generate multiple tokens ahead while decoupling generation from verification, preserving output quality.
Released today under the open-source Apache 2.0 license, matching Gemma 4's licensing; weights available for immediate download.
Access points include the announcement blog, Kaggle, and Hugging Face for easy integration.
Builds on Gemma 4 family—recently launched with sizes from Effective 2B to 31B, multimodal capabilities (text, image, audio on small models), 256K context window, and top leaderboard rankings like #3 for 31B on Arena AI.

Impact

This release intensifies competition in open-weight inference optimization, pressuring rivals like Meta's Llama and Mistral by delivering verifiable 3x speedups on Gemma 4, which already ranks among top open models on leaderboards. It lowers on-device and edge deployment barriers, accelerating adoption for agentic workflows and code generation without quality loss. By enhancing hardware efficiency, it widens developer access to frontier capabilities, potentially shifting market dynamics toward more performant local AI over cloud-dependent alternatives.

Google Releases Multi-Token Prediction Drafters for 3x Faster Gemma 4 Inference

Details

Impact

Social

CONTENT

INFO