I. From "Vibe Coding" to "Long-Horizon Tasks": The Paradigm Shift in AI Engineering Capabilities
To understand the significance of GLM-5.1, one must first grasp the clear evolutionary path that AI programming capabilities are undergoing:
| Stage | Core Characteristics | Typical Duration |
|---|---|---|
| AI Coding | A programmer's efficiency tool, local productivity boost | Minutes |
| Vibe Coding | A more accessible mode of expression, turning ideas into prototypes quickly | Tens of minutes |
| Agentic Engineering | AI autonomously plans, executes, and iterates like an engineer | Hours |
| Long-Horizon Task | Works persistently and delivers outcomes like a senior expert | 8+ hours |
GLM-5.1 is precisely built for this fourth stage. The Zhipu AI team believes that "how long it can work" will replace "how smart it is" as the next-stage standard for measuring model intelligence.
What is a long-horizon task? It refers to an end-to-end complex project delivery task that requires multiple rounds of interaction, cross-step progression, remembering prior constraints, and possessing stateful memory. In the past, even the most powerful open-source models often hit a bottleneck early on when facing complex tasks—after initial gains, they would repeatedly try known optimization methods but fail to actively switch strategies when one path proved ineffective.
GLM-5.1 breaks this limitation. It is currently the world's only open-source model capable of sustained work at the 8-hour level, and one of the few models with this capability besides Claude Opus 4.6. Under the same evaluation standards of the METR benchmark, GLM-5.1 can independently and continuously work on a single task for over 8 hours, autonomously planning, executing, self-evolving, and ultimately delivering complete engineering-grade results.
II. Hardcore Data: A Historic Breakthrough for Open-Source Models
Programming Capability: The Strongest in Open-Source History
GLM-5.1's report card can almost be described as "flipping the table":
| Benchmark | GLM-5.1 | Claude Opus 4.6 | GPT-5.4 |
|---|---|---|---|
| Programming Evaluation Total Score | 45.3 | 47.9 | — |
| SWE-Bench Pro | 58.4 🏆 | 57.3 | — |
| SWE-bench Verified | 77.8% | — | — |
| Terminal-Bench 2.0 | 63.5 | 65.4 | — |
| NL2Repo | 42.7 | 49.8 | — |
Based on the comprehensive average score across the three most representative code evaluation benchmarks (SWE-Bench Pro, Terminal-Bench 2.0, NL2Repo), GLM-5.1 ranks third globally, first among domestic Chinese models, and first among open-source models.
Compared to the previous generation GLM-5's score of 35.4, GLM-5.1's programming evaluation score surged by nearly 10 points, an improvement of 28%—a generational leap.
Reasoning Capability: Comprehensive Alignment
| Benchmark | GLM-5.1 | Claude Opus 4.6 | Gemini 3.1 Pro |
|---|---|---|---|
| AIME 2026 | 95.3 | — | 98.7 |
| HMMT Feb. 2026 | 82.6 | 84.3 | 91.8 |
| GPQA-Diamond | 86.2 | 91.3 | — |
| HLE (with tools) | 52.3 | 53.1 | — |
Comprehensive Positioning
Zhipu AI's official positioning is very clear: GLM-5.1's comprehensive capabilities are fully aligned with Claude Opus 4.6, making it the first Chinese model to achieve comprehensive alignment in overall capabilities and placing it at the forefront of global open-source models.
III. Three Long-Horizon Task Demonstrations: The Model Works 8 Hours While You Sleep
The Zhipu AI team detailed three highly convincing real-world long-horizon task scenarios in their official blog. These are not simple code completions but complete engineering cycles requiring several hours of continuous work.
Scenario 1: Building a Linux Desktop from Scratch in 8 Hours
Sketch the architecture during the day, hand it to GLM-5.1 before sleep, and wake up to a complete system. The entire process took exactly 8 hours, executing over 1200 steps. The first meaningful output appeared at the 20-minute mark, and after 8 hours, it produced a fully functional Linux desktop system—complete with a desktop environment, window manager, status bar, applications, VPN manager, Chinese font support, a game library, etc., with supporting files totaling 4.8MB. This is equivalent to the development workload of a 4-person team for one week, with no human participation in testing or review throughout.
Scenario 2: 655 Iterations Breaking the Vector Database Optimization Bottleneck
Vector databases are the core engine behind AI search and recommendation systems. GLM-5.1 doesn't just tweak parameters—it autonomously completed the entire optimization chain from full-scan → IVF bucket recall → half-precision compression → quantization coarse ranking → two-level routing → early pruning. Over 655 iterations, it continuously ran benchmarks, identified bottlenecks, and adjusted strategies, ultimately increasing query throughput from 3108 QPS to 21472 QPS, a 6.9x improvement over the initial version.
The optimization trajectory showed a typical "stepwise" pattern: the model performed incremental tuning within a fixed strategy, and when gains plateaued, it proactively analyzed logs, located bottlenecks, and then jumped to a structurally different solution. Each jump was accompanied by a brief performance dip, followed by a new peak—this "break-fix" cycle itself is a hallmark of effective optimization.
Scenario 3: 1000 Rounds of Tool Calls Optimizing Real ML Workloads
On the KernelBench Level 3 optimization benchmark covering 50 real-world machine learning computational workloads, GLM-5.1 independently performed continuous optimization on each workload. During over 24 hours of uninterrupted iteration, it autonomously completed multiple rounds of compile—test—analyze—rewrite cycles, ultimately achieving a 3.6x geometric mean speedup, significantly higher than the 1.49x speedup of torch.compile max-autotune mode.
The model can autonomously write custom Triton Kernels and CUDA Kernels, employing techniques like cuBLASLt epilogue fusion, shared memory tiling, and CUDA Graph optimization—areas traditionally highly dependent on expert experience.
Core Insight: Stronger the Longer It Runs
Unlike previous models (including GLM-5) which exhausted their capabilities early on, GLM-5.1 performs better the longer it runs. A comparison on KernelBench shows that GLM-5 rose quickly initially but plateaued early, while GLM-5.1 continued to rise for longer, ultimately reaching 1.4x the performance of GLM-5. The key lies in how far the model can extend the window of "effective optimization."
IV. Technical Deep Dive: What Makes GLM-5.1 So Powerful?
GLM-5.1 is a post-training reinforcement upgrade of GLM-5, with the same architecture and parameter scale. The differences lie mainly in training strategies and optimization focus.
Core Specifications
| Parameter | Specification |
|---|---|
| Total Parameters | 744B (MoE architecture, 256 experts) |
| Active Parameters | 40B |
| Context Window | 200K tokens |
| Max Output | 128K tokens |
| Architectural Features | MLA + DeepSeek Sparse Attention |
| Open-Source License | MIT |
Key Innovation 1: DeepSeek Sparse Attention (DSA)
Traditional Transformer attention has O(L²) computational complexity, requiring about 16 billion computations for a 128K context. DSA replaces dense computation with a dynamic fine-grained filtering mechanism:
- Indexer First: A small neural network quickly scans all tokens to compute importance scores.
- Top-k Filtering: Only the top-2048 most relevant tokens are retained.
- Sparse Attention: Full attention computation is performed only on the filtered tokens.
This reduces the computation for a 128K sequence to about 800 million operations, theoretically a 20x reduction, with a practical 1.5-2x reduction in GPU cost. Crucially, all tokens are still scanned by the indexer, just not involved in the core computation—no long-range dependencies are lost, achieving truly lossless sparsity.
Key Innovation 2: Slime Asynchronous Reinforcement Learning Framework
This is Zhipu's self-developed RL training framework (already open-sourced), named "Slime." It prevents degradation in long-horizon tasks through three mechanisms:
- Decoupling Generation and Training: The inference engine and training engine are deployed on different GPUs. The inference engine continuously generates trajectories, while the training engine asynchronously samples and updates policies, eliminating synchronization bottlenecks.
- Multi-Task Coordinator: A central server manages different task services, supporting 1000+ concurrent rollouts, enabling balanced data collection across tasks.
- Token-in-Token-out (TITO): Directly uses the precise token stream generated by the inference engine to construct learning trajectories, avoiding mismatch issues from re-tokenization.
Key Innovation 3: Progressive Alignment Strategy
Post-training follows a four-stage progressive path:
- Multi-task SFT: Introduces complex interleaved thinking patterns, expands Agent and coding data scale.
- Reasoning & Agent-Specific RL: Mixes four domains: mathematics, science, code, and tool-integrated reasoning.
- General RL: Multi-dimensional optimization objectives (correctness, emotional intelligence, specific task capabilities) + hybrid reward system.
- Cross-Stage Online Distillation: Mitigates capability degradation, ensuring the model retains abilities from all stages.
Key Innovation 4: Three Thinking Modes
GLM-5.1 supports three different thinking modes, providing optimal strategies for different scenarios:
- Interleaved Thinking: Thinks before each response and tool call, improving instruction following.
- Retained Thinking: Automatically retains multi-turn thinking blocks in Coding Agent scenarios, reusing existing reasoning.
- Turn-Level Thinking: Enables/disables reasoning as needed—reduces latency for lightweight requests, improves accuracy for complex tasks.
V. The Pricing Game-Changer: 94.6% of the Capability at 20% of the Price
Technical prowess is one thing, but for most developers and enterprises, price is the real deciding factor. GLM-5.1's pricing is a game-changer:
| Model | Input Price (/M tokens) | Output Price (/M tokens) |
|---|---|---|
| GLM-5.1 | $1.00 | $3.20 |
| Claude Opus 4.6 | $5.00 | $25.00 |
| GPT-5.4 | $2.50 | $15.00 |
GLM-5.1's input cost is 1/5th of Claude Opus and 1/2.5th of GPT-5.4; its output cost is even more dramatic—only 1/7.8th of Claude's and 1/4.7th of GPT-5.4's.
For GLM Coding Plan subscribers, the lowest Lite plan costs only $30/quarter (discounted to $27), providing 3x the usage quota of Claude Pro, and all plans support GLM-5.1.
VI. Open-Source Ecosystem & Domestic Chip Adaptation
GLM-5.1 is fully open-sourced under the MIT License, already released on HuggingFace, ModelScope, and GitHub, with an FP8 quantized version provided.
Local Deployment Support
| Inference Framework | Minimum Version |
|---|---|
| vLLM | 0.19.0+ |
| SGLang | 0.5.10+ |
| KTransformers | 0.5.3+ |
| Transformers | 0.5.3+ |
| xLLM | 0.8.0+ |
Day 0 Adaptation for Domestic Chips
Moore Threads' XiYun C-series GPUs have completed Day 0 full adaptation for GLM-5.1, achieving "out-of-the-box, performance lossless" deployment. Previously, GLM-5 was fully adapted to seven major domestic chip platforms: Huawei Ascend, Moore Threads, Hygon, Cambricon, Kunlunxin, Biren, and Iluvatar Core.
On a single domestic computing node, GLM-5.1's performance is already comparable to dual-GPU international clusters, with deployment costs reduced by 50% in long-sequence scenarios.
Coding Tool Integration
GLM-5.1 has been integrated into the GLM Coding Plan (Max/Pro/Lite), supporting:
- Claude Code (manual switch by default)
- Over 20 mainstream development tools like Cursor, Cline, OpenCode, Kilo Code, Roo Code, Droid
- Z Code GUI Interface: Supports remote SSH development and initiating tasks from mobile phones.
VII. Market Response & Strategic Significance
On the day of GLM-5.1's release, Zhipu AI (HKEX: 02513.HK) stock price rose 11.15% to HK$742.5, with a trading volume of HK$1.025 billion. The GLM Coding Plan sold out instantly and entered a waitlist. This validates the strong market demand for high-quality domestic AI models.
The deeper strategic significance lies in:
- The Gap Between Open-Source and Closed-Source is Rapidly Narrowing: The gap shrank from 12.5 points between GLM-5 and Opus 4.6 (35.4 vs 47.9) to only 2.6 points (45.3 vs 47.9), surpassing Claude Opus on SWE-Bench Pro for the first time.
- Long-Horizon Tasks Open a New Arena: When a model can work continuously for 8 hours like a senior engineer, traditional benchmarks are no longer sufficient to measure its value.
- Domestic AI Autonomy and Control: The full-stack adaptation of the 744B MoE model to domestic chips reflects a strategic layout for technological autonomy and control.
VIII. Limitations and Outlook
Remaining Challenges
- Reasoning Dimension Gap: There remains a gap with top closed-source models on deep reasoning benchmarks like GPQA-Diamond (86.2 vs 91.3).
- Error Accumulation in Long-Horizon Tasks: In chain tasks, a suboptimal modification at one step can silently break tests in subsequent steps. GLM-5.1's performance on multi-step chain tasks still lags significantly behind Claude Opus 4.5.
- Insufficient Independent Evaluation: Apart from official channels, mainstream third-party evaluation institutions have not yet released complete independent evaluation reports.
Future Direction
Zhipu AI team's ultimate goal is the Fully Autonomous Agent—a model that works 24/7, decomposing goals, executing deliveries, self-evaluating and correcting, and self-evolving, thereby eliminating the need for human intervention.
As Zhipu wrote in their blog:
Making a model run for 8 hours is not difficult; what's truly hard is making the work in the 8th hour still effective.
GLM-5.1 is a step towards that goal. Right now, try giving it an instruction, and then walk away for 8 hours.
