Qwen Enters the Physical World
On June 16, Alibaba's Qwen team released the Qwen-Robot Suite, the first time the Qwen model family has ventured into physical robot control. The suite includes three independent foundation models covering manipulation, navigation, and world modeling — the three core problems in robotics.
The models are already in pilot testing with selected Alibaba Cloud enterprise clients.
What Each Model Solves
Qwen-RobotManip is a Vision-Language-Action (VLA) model built on Qwen3.5-4B. Its core problem: different robot arms from different vendors have completely incompatible action spaces — a grasping policy that works on one robot fails on another. RobotManip uses a unified 80-dimensional action representation to align motion encodings across hardware. When developers switch between robots, they only need fine-tuning, not full retraining.
Training data came from over 38,100 hours of open-source manipulation video. A human-to-robot synthesis pipeline converted 1,933 hours of egocentric human video into 24,808 hours of robot demonstrations spanning 15 robot platforms.
Qwen-RobotNav is built on Qwen3-VL with 2B, 4B, and 8B parameter variants. It unifies five navigation task families — instruction following, goal navigation, object search, tracking, and autonomous driving — into a single framework that outputs 8-waypoint trajectories with position and heading. Switching tasks doesn't require swapping models; it uses a parameterized observation interface to dynamically adjust memory strategies.
Qwen-RobotWorld is a language-conditioned video world model with a 60-layer dual-stream MMDiT architecture and roughly 20 billion parameters. It uses natural language as a unified action interface — input the current observation and a language instruction, and it predicts future video. The training corpus includes 8.6 million video-text pairs with over 200 million frames. It can serve both as a pre-execution rehearsal system for error correction and as a synthetic data generator for the other two models' training pipelines.
Benchmark Results
RobotManip achieved state-of-the-art across multiple cross-embodiment benchmarks:
| Benchmark | Previous SOTA | Qwen-RobotManip |
|---|---|---|
| LIBERO-Plus | 84.4% | 91.4% |
| RoboTwin-C2R Hard | 47.9% | 69.4% |
| EBench | 27.1% | 45.6% |
| RoboTwin-IF | 49.6% | 72.2% |
Cross-embodiment transfer rate jumped from 7.5% (previous best) to 23.9% — a 3.2x improvement. On the RoboChallenge Table30 real-world task track with 30 tasks, RobotManip took first and second place (codenamed "Lira" and "Atlas"), leading third place by 20 percentage points.
RobotNav achieved SOTA across all five navigation domains: 76.5% success rate on VLN-CE RxR, 75.6% on HM3Dv2 object-goal navigation (RGB only), and 91.4 PDMS on NAVSIM. Zero-shot deployment on a Unitree Go2 quadruped robot using only its built-in low-resolution camera achieved 196ms latency.
RobotWorld ranked first overall on both EWMBench and DreamGen Bench, with a motion fidelity HSD of 0.566 — 33% better than the runner-up. It scored a perfect 1.0 on four physics-adherence categories in WorldModelBench.
Competitive Landscape
Google DeepMind has Gemini Robotics. NVIDIA has Isaac GR00T. Physical Intelligence has its pi model series. In China, Alibaba faces competition from Huawei and Baidu.
Alibaba positions itself as "full stack" — operating across chips, agentic cloud, models, model-serving platforms, and applications. Robot Suite extends this chain into the physical layer. One analyst drew an analogy to CUDA: NVIDIA succeeded by inserting a universal software interface between GPU hardware and application developers, commoditizing hardware differentiation at the platform layer. Alibaba wants to do the same in robotics — build a universal abstraction layer between robotic software and fragmented hardware.
The models are already partnered with Chinese robotics manufacturers including Agibot, with Alibaba Cloud bundling model licensing and compute to compress development cycles.
The Real Test Is Just Beginning
Benchmark scores look impressive, but the old problem in robotics remains: what works in the lab doesn't always work on the factory floor. Lighting changes, grasping angles shift, sensors get noisy — and model performance can degrade completely. Previous demos from companies like PI looked spectacular, but there's still a significant gap between demo videos and production deployment.
Alibaba didn't release these as pure research papers — they're already piloting with enterprise clients. That's a pragmatic approach. The 3.2x improvement in cross-embodiment transfer shows the unified alignment framework genuinely works, but robustness, latency, and fault recovery in real deployments still need more validation.
China's embodied AI sector is moving from lab validation to commercial deployment, with the consumer robotics market expected to take shape within two to three years. Whoever establishes the de facto standard at the model layer gains first-mover advantage in the next round. Alibaba moved fast on this one.




