A project blew up on Hacker News last week — 768 upvotes, 210 comments, and over 2100 GitHub stars by today. Its name is Needle.
It does one thing: distill Gemini 3.1's tool calling ability into a model with only 26 million parameters.
To put 26M in perspective:
- Qwen's smallest model has 600M parameters (0.6B) — that's 23x bigger than Needle
- Llama 3.2 starts at 1B
- Needle is 0.026B
It's small enough to run locally on your MacBook. You can even fine-tune it.
How it works
The team is from Cactus Compute, led by Henry Ndubuaku. They built something called a Simple Attention Network — not a standard Transformer.
The architecture looks like an Encoder-Decoder, but neither side uses FFN (feed-forward networks). Everything relies on attention mechanisms. Encoder has 12 layers, Decoder has 8. Embedding dimension is just 512, 8 attention heads, 4 KV heads.
Pretraining: 16 TPU v6e, 200B tokens, 27 hours. Post-training: 2B tokens of single-shot function call data, 45 minutes. That's it.
Performance
On the narrow task of single-shot function calling, Needle beats:
- FunctionGemma-270M (Google, 270M params)
- Qwen-0.6B (Alibaba, 600M)
- Graninte-350M (Intel, 350M)
- LFM-2.5-350M (Liquid FM, 350M)
On Cactus's own inference engine, it reaches 6000 tok/s prefill and 1200 tok/s decode. For comparison, a 7B model on the same hardware might hit a few hundred tok/s.
The honest caveat: Needle is specialized for tool calling. It's not a general-purpose chat model and doesn't support multi-turn conversations. Its sweet spot is personal AI assistants — setting alarms, checking weather, toggling lights, controlling smart home devices.
How to use it
The team built an all-in-one CLI toolchain:
git clone https://github.com/cactus-compute/needle.git
cd needle && source ./setup
needle playground
This opens a web UI where you can test and fine-tune. Python API is also available:
from needle import SimpleAttentionNetwork, load_checkpoint, generate
params, config = load_checkpoint("checkpoints/needle.pkl")
model = SimpleAttentionNetwork(config)
result = generate(model, params, tokenizer,
query="What's the weather in Beijing?",
tools='[{"name":"get_weather","description":"Get weather","parameters":...}]'
)
Fine-tuning is straightforward — prepare a JSONL file with query, tools, and answers per line, at least 120 examples per tool:
needle finetune my_data.jsonl
Controversy
The HN comments weren't all praise. Someone dug up Google's ToS — "You may not use the Services to develop models that compete with the Services or reverse engineer model weights." It's a legal gray area.
Someone also pointed out that 26M sounds way more impressive than 0.026B. Fair point.
What it means
I think the most interesting thing about Needle isn't how capable it is — it's the direction it represents.
For the past two years everyone's been obsessed with bigger models. 7B is too small, 70B is barely adequate, 400B is the real deal. Needle shows that for certain specific tasks, a tiny model with well-designed data and architecture can go toe-to-toe with the giants.
And 26M parameters means it can run on your phone, your watch, your glasses. No internet needed, no cloud inference, privacy issues mostly solved.
Smart home, wearable devices, in-car assistants — that's where Needle is really headed.
Code and weights are fully open source (MIT) at github.com/cactus-compute/needle. Go pull it and try it out.




