Google's Gemma 4 AI models get 3x speed boost by predicting future tokens
Google has released Multi-Token Prediction (MTP) drafters for its Gemma 4 open models, which use speculative decoding to predict future tokens and achieve up to 3x faster generation. The Gemma 4 models, launched this spring, are built on the same technology as Google's frontier Gemini AI but are designed to run locally on user hardware. They can run at full precision on a single high-power AI accelerator or on a consumer GPU with quantization. Google also switched the Gemma 4 license to Apache 2.0, more permissive than previous custom licenses. MTP addresses hardware limitations of local AI by speeding up token generation.
Key facts
- Google released Multi-Token Prediction (MTP) drafters for Gemma 4.
- MTP uses speculative decoding to predict future tokens.
- Gemma 4 models can achieve up to 3x faster generation with MTP.
- Gemma 4 was launched in spring 2026.
- Gemma 4 is built on the same technology as Gemini AI.
- Gemma 4 is designed to run locally on user hardware.
- Gemma 4 can run at full precision on a single high-power AI accelerator.
- Gemma 4 license is Apache 2.0.
- Gemma 4 can run on a consumer GPU with quantization.
- MTP addresses hardware limitations of local AI.
Entities
Institutions
- Ars Technica