The hottest topic in the local AI community right now is MTP (Multi-Token Prediction) landing in llama.cpp. The headline number: inference speed doubled. Not on paper — real, measurable tokens-per-second doubling.

What is MTP?

Traditional LLM inference predicts one token at a time. The model finishes one word before knowing what the next one is. Like reading a book one character at a time.

MTP takes a different approach: let the model predict several future tokens at once. During training, the model learns to predict the next n tokens simultaneously. During inference, these predictions feed into speculative decoding — the model takes a guess at upcoming content, and when it's right, it skips ahead without recomputing every single token.

Meta first proposed this idea in their 2024 paper, "Better & Faster Large Language Models via Multi-token Prediction." People found it compelling at the time, but actual implementation in inference engines only happened recently.

MTP in llama.cpp

Recent versions of llama.cpp added support for MTP draft models. Community testing shows dramatic speed improvements when paired with Qwen 3.x series models.

Qwen 3.x models were trained with MTP support baked in — they come with a native draft head, no need to train a separate speculative decoding model. Upgrade to the latest llama.cpp, load a Qwen 3.x GGUF, and MTP just works.

Key numbers circulating in the community:

  • Qwen3.5-27B + MTP on RTX 3090: 207 tok/s reported, nearly double the speed without MTP
  • Accept rate on Qwen 3.x: community tests show 70-90% of draft tokens are accepted. Most predictions are correct; only a fraction need recomputation
  • Memory overhead is negligible: The MTP draft head is just a few extra linear layers on top of the model, minimal VRAM impact

Why this matters

Honestly, local inference progress over the past year has been mostly about quantization and smaller models. Architectural breakthroughs have been rare. MTP is one of those rare optimizations that gives speed "for free" — no quality loss, no extra VRAM.

For people running models on consumer GPUs, this is a big deal:

  • Models that were too slow to use are now viable
  • Real-time scenarios (voice assistants, code completion) are much smoother
  • Combined with GGUF quantization, local inference value just jumped significantly

Community direction

The hottest combo right now is Qwen 3.x + MTP. Qwen 3.6's coding ability at 27B already punches above its weight, and MTP makes its inference speed competitive with smaller models.

Discussions on llama.cpp's GitHub around related PRs and issues are very active. Some are trying to extend MTP support to other model families, but for now Qwen shows the best results because it was trained natively with MTP support.

If you have a decent GPU lying around, now might be a good time to give local inference another try.