Weibo AI delivered a surprising result

A 3 billion parameter model that beats Claude Opus 4.5, DeepSeek V3.2, and Gemini 3 Pro on math reasoning and code generation. This isn't a slide deck number game — the paper is on arXiv, weights are on Hugging Face, and 375 upvotes with 198 comments on Hacker News confirm the community is taking it seriously.

The model is called VibeThinker-3B, from Weibo AI (微博AI). Paper: arXiv:2606.16140.

The benchmark numbers

BenchmarkVibeThinker-3BNotes
AIME2694.3 (97.1 with claim-level test-time scaling)American Invitational Math Exam 2026
LiveCodeBench v680.2 Pass@1Live coding benchmark
LeetCode (unseen contests)96.1% acceptance ratePost-training-cutoff problems
IFEval93.4Instruction following

94.3 on AIME26 puts it in the same range as models with 100x-1000x more parameters. That's unusual.

How it works: the Spectrum-to-Signal paradigm

The approach has three stages:

  1. Curriculum-based supervised fine-tuning (SFT) — staged training by difficulty, not dumping all data at once
  2. Multi-domain reinforcement learning — RL optimization across math and code simultaneously
  3. Offline self-distillation — the model distills more refined reasoning paths from its own outputs

The team also proposes the Parametric Compression-Coverage Hypothesis: verifiable reasoning (math proofs, code) can be compressed into compact "reasoning cores," while open-domain knowledge needs broad parameter coverage. Small reasoning models aren't cheap substitutes for large ones — they're a complementary path.

Community reaction

198 HN comments reflect a mix of excitement and skepticism.

One user described it as "a smart person who doesn't know anything about a given topic, but with the ability to learn." Accurate — VibeThinker-3B can't write encyclopedic articles or orchestrate APIs, but give it a math problem or coding challenge and it produces solid reasoning chains.

RTX 3090 owners (24GB VRAM) confirmed it runs locally via vLLM. Some are already testing it for source code security review. Others found it useless at hunting security bugs — on a benchmark built from Mythos-discovered vulnerabilities, it found zero.

A key limitation from the official model card: this model was not trained on tool-calling or agent-based data. Not recommended for function calling, API orchestration, or autonomous coding agents. It's a pure reasoning model.

The Qwen2.5 base model was noted as "ancient by LLM standards," suggesting better results might be possible with newer foundations.

Comparison with peers

ModelParametersFocus
VibeThinker-3B3BPure reasoning (math + code)
DeepSeek V3.2671B (MoE)General purpose
GLM-5Tens of billionsGeneral (Chinese-optimized)
Gemini 3 ProUndisclosedGeneral multimodal
Claude Opus 4.5UndisclosedGeneral (Anthropic flagship)

Two to three orders of magnitude fewer parameters, matching or beating on specific benchmarks — that's a meaningful signal.

Limitations

How to use it

Weights are on Hugging Face: WeiboAI/VibeThinker-3B. Deploy via vLLM or similar inference frameworks. Recommended: at least 24GB VRAM. Best suited for verifiable tasks like math reasoning and code generation, not general chatbot use.

GitHub: https://github.com/WeiboAI/VibeThinker arXiv paper: https://arxiv.org/abs/2606.16140

The bigger picture

The most interesting thing about VibeThinker-3B isn't the scores — it's the thesis behind it: reasoning ability and knowledge volume can be decoupled. If this holds, we might see more "small but sharp" reasoning models paired with tool use to fill knowledge gaps. 3B parameters for reasoning, search engines or RAG for facts. That combo could be more efficient than simply scaling up.