Weibo AI delivered a surprising result
A 3 billion parameter model that beats Claude Opus 4.5, DeepSeek V3.2, and Gemini 3 Pro on math reasoning and code generation. This isn't a slide deck number game — the paper is on arXiv, weights are on Hugging Face, and 375 upvotes with 198 comments on Hacker News confirm the community is taking it seriously.
The model is called VibeThinker-3B, from Weibo AI (微博AI). Paper: arXiv:2606.16140.
The benchmark numbers
| Benchmark | VibeThinker-3B | Notes |
|---|---|---|
| AIME26 | 94.3 (97.1 with claim-level test-time scaling) | American Invitational Math Exam 2026 |
| LiveCodeBench v6 | 80.2 Pass@1 | Live coding benchmark |
| LeetCode (unseen contests) | 96.1% acceptance rate | Post-training-cutoff problems |
| IFEval | 93.4 | Instruction following |
94.3 on AIME26 puts it in the same range as models with 100x-1000x more parameters. That's unusual.
How it works: the Spectrum-to-Signal paradigm
The approach has three stages:
- Curriculum-based supervised fine-tuning (SFT) — staged training by difficulty, not dumping all data at once
- Multi-domain reinforcement learning — RL optimization across math and code simultaneously
- Offline self-distillation — the model distills more refined reasoning paths from its own outputs
The team also proposes the Parametric Compression-Coverage Hypothesis: verifiable reasoning (math proofs, code) can be compressed into compact "reasoning cores," while open-domain knowledge needs broad parameter coverage. Small reasoning models aren't cheap substitutes for large ones — they're a complementary path.
Community reaction
198 HN comments reflect a mix of excitement and skepticism.
One user described it as "a smart person who doesn't know anything about a given topic, but with the ability to learn." Accurate — VibeThinker-3B can't write encyclopedic articles or orchestrate APIs, but give it a math problem or coding challenge and it produces solid reasoning chains.
RTX 3090 owners (24GB VRAM) confirmed it runs locally via vLLM. Some are already testing it for source code security review. Others found it useless at hunting security bugs — on a benchmark built from Mythos-discovered vulnerabilities, it found zero.
A key limitation from the official model card: this model was not trained on tool-calling or agent-based data. Not recommended for function calling, API orchestration, or autonomous coding agents. It's a pure reasoning model.
The Qwen2.5 base model was noted as "ancient by LLM standards," suggesting better results might be possible with newer foundations.
Comparison with peers
| Model | Parameters | Focus |
|---|---|---|
| VibeThinker-3B | 3B | Pure reasoning (math + code) |
| DeepSeek V3.2 | 671B (MoE) | General purpose |
| GLM-5 | Tens of billions | General (Chinese-optimized) |
| Gemini 3 Pro | Undisclosed | General multimodal |
| Claude Opus 4.5 | Undisclosed | General (Anthropic flagship) |
Two to three orders of magnitude fewer parameters, matching or beating on specific benchmarks — that's a meaningful signal.
Limitations
- No tool calling or agent support — can't function as a general assistant
- Python-heavy — community tests show best results on Python; other languages may underperform
- Benchmark vs real-world gap — VentureBeat's headline was "the AI world arguing over benchmarks again"
- Aging base model — Qwen2.5 isn't the latest open-source foundation
How to use it
Weights are on Hugging Face: WeiboAI/VibeThinker-3B. Deploy via vLLM or similar inference frameworks. Recommended: at least 24GB VRAM. Best suited for verifiable tasks like math reasoning and code generation, not general chatbot use.
GitHub: https://github.com/WeiboAI/VibeThinker arXiv paper: https://arxiv.org/abs/2606.16140
The bigger picture
The most interesting thing about VibeThinker-3B isn't the scores — it's the thesis behind it: reasoning ability and knowledge volume can be decoupled. If this holds, we might see more "small but sharp" reasoning models paired with tool use to fill knowledge gaps. 3B parameters for reasoning, search engines or RAG for facts. That combo could be more efficient than simply scaling up.




