workblogaboutcontactresume ↓

§03//BLOG

§ log.009 — Gemma 4 just got 3x faster for free

2 min read350 wordsAIGemmaopen-sourceinference

Google dropped Multi-Token Prediction drafters for all Gemma 4 models under Apache 2.0. Up to 3x speedup, zero quality loss. And these run on consumer hardware.

Google released MTP drafters for all Gemma 4 models two days ago. Under Apache 2.0. Free.

MTP stands for Multi-Token Prediction. It's a form of speculative decoding — a lightweight drafter model predicts several tokens ahead in one shot, then the main Gemma 4 model verifies them all in a single forward pass. Instead of generating one token at a time like normal, you get 2-3 tokens per pass. Same output, same quality, just faster.

The speedups are real: up to 3x on the 31B dense model, about 2.2x on the 27B MoE. And since the main model is still doing the verification, there's zero degradation in output quality. No hallucination tax, no reasoning penalty. You just get the same answers faster.

what actually changed

Google's been working on MTP since last year, but the drafters were tied to their internal infrastructure. Now they're released as standalone model weights that run on consumer hardware — Apple Silicon, consumer NVIDIA, edge devices. The drafter model shares the target model's KV cache, so the overhead is basically free.

Supported out of the box in Transformers, MLX, vLLM, and Ollama. LM Studio uses llama.cpp under the hood, so it doesn't support MTP natively — you'd get the generic 1.3-1.5x speedup from a draft model, not the 2-3x from Gemma's optimized drafter.

I was actually considering switching back to LM Studio for more granular control. Feels like every week there's another reason to stay on Ollama.

why this matters

Inference speed for local models has been the bottleneck for a while. You can run frontier-ish models on consumer hardware now, but they're slow enough that you feel it — especially in an agent loop where you're making sequential calls. A 2-3x speedup means the difference between "waiting 10 seconds for a response" and "it's fast enough to not interrupt your flow."

For my setup — WrapsRL runs on cloud inference, but my personal assistant runs locally. Faster local inference means less friction. Less friction means I actually use it more.

Google's been releasing good stuff under permissive licenses lately. I'm paying attention.

$ echo "two tokens at once"

two tokens at once

§end of file