Weibo's VibeThinker-3B: A 3B Parameter Model That Beats Claude Opus 4.5 on Reasoning

Weibo AI delivered a surprising result

A 3 billion parameter model that beats Claude Opus 4.5, DeepSeek V3.2, and Gemini 3 Pro on math reasoning and code generation. This isn't a slide deck number game — the paper is on arXiv, weights are on Hugging Face, and 375 upvotes with 198 comments on Hacker News confirm the community is taking it seriously.

The model is called VibeThinker-3B, from Weibo AI (微博AI). Paper: arXiv:2606.16140.

The benchmark numbers

Benchmark	VibeThinker-3B	Notes
AIME26	94.3 (97.1 with claim-level test-time scaling)	American Invitational Math Exam 2026
LiveCodeBench v6	80.2 Pass@1	Live coding benchmark
LeetCode (unseen contests)	96.1% acceptance rate	Post-training-cutoff problems
IFEval	93.4	Instruction following

94.3 on AIME26 puts it in the same range as models with 100x-1000x more parameters. That's unusual.

How it works: the Spectrum-to-Signal paradigm

The approach has three stages:

Curriculum-based supervised fine-tuning (SFT) — staged training by difficulty, not dumping all data at once
Multi-domain reinforcement learning — RL optimization across math and code simultaneously
Offline self-distillation — the model distills more refined reasoning paths from its own outputs

The team also proposes the Parametric Compression-Coverage Hypothesis: verifiable reasoning (math proofs, code) can be compressed into compact "reasoning cores," while open-domain knowledge needs broad parameter coverage. Small reasoning models aren't cheap substitutes for large ones — they're a complementary path.

Community reaction

198 HN comments reflect a mix of excitement and skepticism.

One user described it as "a smart person who doesn't know anything about a given topic, but with the ability to learn." Accurate — VibeThinker-3B can't write encyclopedic articles or orchestrate APIs, but give it a math problem or coding challenge and it produces solid reasoning chains.

RTX 3090 owners (24GB VRAM) confirmed it runs locally via vLLM. Some are already testing it for source code security review. Others found it useless at hunting security bugs — on a benchmark built from Mythos-discovered vulnerabilities, it found zero.

A key limitation from the official model card: this model was not trained on tool-calling or agent-based data. Not recommended for function calling, API orchestration, or autonomous coding agents. It's a pure reasoning model.

The Qwen2.5 base model was noted as "ancient by LLM standards," suggesting better results might be possible with newer foundations.

Comparison with peers

Model	Parameters	Focus
VibeThinker-3B	3B	Pure reasoning (math + code)
DeepSeek V3.2	671B (MoE)	General purpose
GLM-5	Tens of billions	General (Chinese-optimized)
Gemini 3 Pro	Undisclosed	General multimodal
Claude Opus 4.5	Undisclosed	General (Anthropic flagship)

Two to three orders of magnitude fewer parameters, matching or beating on specific benchmarks — that's a meaningful signal.

Limitations

No tool calling or agent support — can't function as a general assistant
Python-heavy — community tests show best results on Python; other languages may underperform
Benchmark vs real-world gap — VentureBeat's headline was "the AI world arguing over benchmarks again"
Aging base model — Qwen2.5 isn't the latest open-source foundation

How to use it

Weights are on Hugging Face: WeiboAI/VibeThinker-3B. Deploy via vLLM or similar inference frameworks. Recommended: at least 24GB VRAM. Best suited for verifiable tasks like math reasoning and code generation, not general chatbot use.

GitHub: https://github.com/WeiboAI/VibeThinker arXiv paper: https://arxiv.org/abs/2606.16140

The bigger picture

The most interesting thing about VibeThinker-3B isn't the scores — it's the thesis behind it: reasoning ability and knowledge volume can be decoupled. If this holds, we might see more "small but sharp" reasoning models paired with tool use to fill knowledge gaps. 3B parameters for reasoning, search engines or RAG for facts. That combo could be more efficient than simply scaling up.

Weibo AI delivered a surprising result

The model is called VibeThinker-3B, from Weibo AI (微博AI). Paper: arXiv:2606.16140.

The benchmark numbers

Benchmark	VibeThinker-3B	Notes
AIME26	94.3 (97.1 with claim-level test-time scaling)	American Invitational Math Exam 2026
LiveCodeBench v6	80.2 Pass@1	Live coding benchmark
LeetCode (unseen contests)	96.1% acceptance rate	Post-training-cutoff problems
IFEval	93.4	Instruction following

94.3 on AIME26 puts it in the same range as models with 100x-1000x more parameters. That's unusual.

How it works: the Spectrum-to-Signal paradigm

The approach has three stages:

Curriculum-based supervised fine-tuning (SFT) — staged training by difficulty, not dumping all data at once
Multi-domain reinforcement learning — RL optimization across math and code simultaneously
Offline self-distillation — the model distills more refined reasoning paths from its own outputs

Community reaction

198 HN comments reflect a mix of excitement and skepticism.

The Qwen2.5 base model was noted as "ancient by LLM standards," suggesting better results might be possible with newer foundations.

Comparison with peers

Model	Parameters	Focus
VibeThinker-3B	3B	Pure reasoning (math + code)
DeepSeek V3.2	671B (MoE)	General purpose
GLM-5	Tens of billions	General (Chinese-optimized)
Gemini 3 Pro	Undisclosed	General multimodal
Claude Opus 4.5	Undisclosed	General (Anthropic flagship)

Two to three orders of magnitude fewer parameters, matching or beating on specific benchmarks — that's a meaningful signal.

Limitations

No tool calling or agent support — can't function as a general assistant
Python-heavy — community tests show best results on Python; other languages may underperform
Benchmark vs real-world gap — VentureBeat's headline was "the AI world arguing over benchmarks again"
Aging base model — Qwen2.5 isn't the latest open-source foundation

How to use it

GitHub: https://github.com/WeiboAI/VibeThinker arXiv paper: https://arxiv.org/abs/2606.16140

Weibo's VibeThinker-3B: A 3B Parameter Model That Beats Claude Opus 4.5 on Reasoning | 2026-06-24

More articles

Nub, LookAway, Apposters: Three Practical Tools Worth Trying | 2026-06-25

Gemini 3.5 Flash Gets Built-In Computer Use: Google Bakes Screen Control Into Its Workhorse Model | 2026-06-25

ByteDance Launches Seedance 2.5: 30-Second Single-Take AI Videos | 2026-06-23

Three Open-Source Tools Worth Checking Out: Unlimited OCR, TikZ Editor, FUTO Swipe | 2026-06-24

Weibo's VibeThinker-3B: A 3B Parameter Model That Beats Claude Opus 4.5 on Reasoning | 2026-06-24

Weibo AI delivered a surprising result

The benchmark numbers

How it works: the Spectrum-to-Signal paradigm

Community reaction

Comparison with peers

Limitations

How to use it

The bigger picture

More articles

Nub, LookAway, Apposters: Three Practical Tools Worth Trying | 2026-06-25

Gemini 3.5 Flash Gets Built-In Computer Use: Google Bakes Screen Control Into Its Workhorse Model | 2026-06-25

ByteDance Launches Seedance 2.5: 30-Second Single-Take AI Videos | 2026-06-23

Three Open-Source Tools Worth Checking Out: Unlimited OCR, TikZ Editor, FUTO Swipe | 2026-06-24

Weibo AI delivered a surprising result

The benchmark numbers

How it works: the Spectrum-to-Signal paradigm

Community reaction

Comparison with peers

Limitations

How to use it

The bigger picture