GLM-5.1: A New Paradigm for Long-Context Tasks in Open-Source AI, Delivering 94.6% of Claude Opus Capability at Just 20% of the Cost

I. From "Vibe Coding" to "Long-Horizon Tasks": The Paradigm Shift in AI Engineering Capabilities

To understand the significance of GLM-5.1, one must first grasp the clear evolutionary path that AI programming capabilities are undergoing:

Stage	Core Characteristics	Typical Duration
AI Coding	A programmer's efficiency tool, local productivity boost	Minutes
Vibe Coding	A more accessible mode of expression, turning ideas into prototypes quickly	Tens of minutes
Agentic Engineering	AI autonomously plans, executes, and iterates like an engineer	Hours
Long-Horizon Task	Works persistently and delivers outcomes like a senior expert	8+ hours

GLM-5.1 is precisely built for this fourth stage. The Zhipu AI team believes that "how long it can work" will replace "how smart it is" as the next-stage standard for measuring model intelligence.

What is a long-horizon task? It refers to an end-to-end complex project delivery task that requires multiple rounds of interaction, cross-step progression, remembering prior constraints, and possessing stateful memory. In the past, even the most powerful open-source models often hit a bottleneck early on when facing complex tasks—after initial gains, they would repeatedly try known optimization methods but fail to actively switch strategies when one path proved ineffective.

GLM-5.1 breaks this limitation. It is currently the world's only open-source model capable of sustained work at the 8-hour level, and one of the few models with this capability besides Claude Opus 4.6. Under the same evaluation standards of the METR benchmark, GLM-5.1 can independently and continuously work on a single task for over 8 hours, autonomously planning, executing, self-evolving, and ultimately delivering complete engineering-grade results.

II. Hardcore Data: A Historic Breakthrough for Open-Source Models

Programming Capability: The Strongest in Open-Source History

GLM-5.1's report card can almost be described as "flipping the table":

Benchmark	GLM-5.1	Claude Opus 4.6	GPT-5.4
Programming Evaluation Total Score	45.3	47.9	—
SWE-Bench Pro	58.4 🏆	57.3	—
SWE-bench Verified	77.8%	—	—
Terminal-Bench 2.0	63.5	65.4	—
NL2Repo	42.7	49.8	—

Based on the comprehensive average score across the three most representative code evaluation benchmarks (SWE-Bench Pro, Terminal-Bench 2.0, NL2Repo), GLM-5.1 ranks third globally, first among domestic Chinese models, and first among open-source models.

Compared to the previous generation GLM-5's score of 35.4, GLM-5.1's programming evaluation score surged by nearly 10 points, an improvement of 28%—a generational leap.

Reasoning Capability: Comprehensive Alignment

Benchmark	GLM-5.1	Claude Opus 4.6	Gemini 3.1 Pro
AIME 2026	95.3	—	98.7
HMMT Feb. 2026	82.6	84.3	91.8
GPQA-Diamond	86.2	91.3	—
HLE (with tools)	52.3	53.1	—

Comprehensive Positioning

Zhipu AI's official positioning is very clear: GLM-5.1's comprehensive capabilities are fully aligned with Claude Opus 4.6, making it the first Chinese model to achieve comprehensive alignment in overall capabilities and placing it at the forefront of global open-source models.

III. Three Long-Horizon Task Demonstrations: The Model Works 8 Hours While You Sleep

The Zhipu AI team detailed three highly convincing real-world long-horizon task scenarios in their official blog. These are not simple code completions but complete engineering cycles requiring several hours of continuous work.

Scenario 1: Building a Linux Desktop from Scratch in 8 Hours

Sketch the architecture during the day, hand it to GLM-5.1 before sleep, and wake up to a complete system. The entire process took exactly 8 hours, executing over 1200 steps. The first meaningful output appeared at the 20-minute mark, and after 8 hours, it produced a fully functional Linux desktop system—complete with a desktop environment, window manager, status bar, applications, VPN manager, Chinese font support, a game library, etc., with supporting files totaling 4.8MB. This is equivalent to the development workload of a 4-person team for one week, with no human participation in testing or review throughout.

Scenario 2: 655 Iterations Breaking the Vector Database Optimization Bottleneck

Vector databases are the core engine behind AI search and recommendation systems. GLM-5.1 doesn't just tweak parameters—it autonomously completed the entire optimization chain from full-scan → IVF bucket recall → half-precision compression → quantization coarse ranking → two-level routing → early pruning. Over 655 iterations, it continuously ran benchmarks, identified bottlenecks, and adjusted strategies, ultimately increasing query throughput from 3108 QPS to 21472 QPS, a 6.9x improvement over the initial version.

The optimization trajectory showed a typical "stepwise" pattern: the model performed incremental tuning within a fixed strategy, and when gains plateaued, it proactively analyzed logs, located bottlenecks, and then jumped to a structurally different solution. Each jump was accompanied by a brief performance dip, followed by a new peak—this "break-fix" cycle itself is a hallmark of effective optimization.

Scenario 3: 1000 Rounds of Tool Calls Optimizing Real ML Workloads

On the KernelBench Level 3 optimization benchmark covering 50 real-world machine learning computational workloads, GLM-5.1 independently performed continuous optimization on each workload. During over 24 hours of uninterrupted iteration, it autonomously completed multiple rounds of compile—test—analyze—rewrite cycles, ultimately achieving a 3.6x geometric mean speedup, significantly higher than the 1.49x speedup of torch.compile max-autotune mode.

The model can autonomously write custom Triton Kernels and CUDA Kernels, employing techniques like cuBLASLt epilogue fusion, shared memory tiling, and CUDA Graph optimization—areas traditionally highly dependent on expert experience.

Core Insight: Stronger the Longer It Runs

Unlike previous models (including GLM-5) which exhausted their capabilities early on, GLM-5.1 performs better the longer it runs. A comparison on KernelBench shows that GLM-5 rose quickly initially but plateaued early, while GLM-5.1 continued to rise for longer, ultimately reaching 1.4x the performance of GLM-5. The key lies in how far the model can extend the window of "effective optimization."

IV. Technical Deep Dive: What Makes GLM-5.1 So Powerful?

GLM-5.1 is a post-training reinforcement upgrade of GLM-5, with the same architecture and parameter scale. The differences lie mainly in training strategies and optimization focus.

Core Specifications

Parameter	Specification
Total Parameters	744B (MoE architecture, 256 experts)
Active Parameters	40B
Context Window	200K tokens
Max Output	128K tokens
Architectural Features	MLA + DeepSeek Sparse Attention
Open-Source License	MIT

Key Innovation 1: DeepSeek Sparse Attention (DSA)

Traditional Transformer attention has O(L²) computational complexity, requiring about 16 billion computations for a 128K context. DSA replaces dense computation with a dynamic fine-grained filtering mechanism:

Indexer First: A small neural network quickly scans all tokens to compute importance scores.
Top-k Filtering: Only the top-2048 most relevant tokens are retained.
Sparse Attention: Full attention computation is performed only on the filtered tokens.

This reduces the computation for a 128K sequence to about 800 million operations, theoretically a 20x reduction, with a practical 1.5-2x reduction in GPU cost. Crucially, all tokens are still scanned by the indexer, just not involved in the core computation—no long-range dependencies are lost, achieving truly lossless sparsity.

Key Innovation 2: Slime Asynchronous Reinforcement Learning Framework

This is Zhipu's self-developed RL training framework (already open-sourced), named "Slime." It prevents degradation in long-horizon tasks through three mechanisms:

Decoupling Generation and Training: The inference engine and training engine are deployed on different GPUs. The inference engine continuously generates trajectories, while the training engine asynchronously samples and updates policies, eliminating synchronization bottlenecks.
Multi-Task Coordinator: A central server manages different task services, supporting 1000+ concurrent rollouts, enabling balanced data collection across tasks.
Token-in-Token-out (TITO): Directly uses the precise token stream generated by the inference engine to construct learning trajectories, avoiding mismatch issues from re-tokenization.

Key Innovation 3: Progressive Alignment Strategy

Post-training follows a four-stage progressive path:

Multi-task SFT: Introduces complex interleaved thinking patterns, expands Agent and coding data scale.
Reasoning & Agent-Specific RL: Mixes four domains: mathematics, science, code, and tool-integrated reasoning.
General RL: Multi-dimensional optimization objectives (correctness, emotional intelligence, specific task capabilities) + hybrid reward system.
Cross-Stage Online Distillation: Mitigates capability degradation, ensuring the model retains abilities from all stages.

Key Innovation 4: Three Thinking Modes

GLM-5.1 supports three different thinking modes, providing optimal strategies for different scenarios:

Interleaved Thinking: Thinks before each response and tool call, improving instruction following.
Retained Thinking: Automatically retains multi-turn thinking blocks in Coding Agent scenarios, reusing existing reasoning.
Turn-Level Thinking: Enables/disables reasoning as needed—reduces latency for lightweight requests, improves accuracy for complex tasks.

V. The Pricing Game-Changer: 94.6% of the Capability at 20% of the Price

Technical prowess is one thing, but for most developers and enterprises, price is the real deciding factor. GLM-5.1's pricing is a game-changer:

Model	Input Price (/M tokens)	Output Price (/M tokens)
GLM-5.1	$1.00	$3.20
Claude Opus 4.6	$5.00	$25.00
GPT-5.4	$2.50	$15.00

GLM-5.1's input cost is 1/5th of Claude Opus and 1/2.5th of GPT-5.4; its output cost is even more dramatic—only 1/7.8th of Claude's and 1/4.7th of GPT-5.4's.

For GLM Coding Plan subscribers, the lowest Lite plan costs only $30/quarter (discounted to $27), providing 3x the usage quota of Claude Pro, and all plans support GLM-5.1.

VI. Open-Source Ecosystem & Domestic Chip Adaptation

GLM-5.1 is fully open-sourced under the MIT License, already released on HuggingFace, ModelScope, and GitHub, with an FP8 quantized version provided.

Local Deployment Support

Inference Framework	Minimum Version
vLLM	0.19.0+
SGLang	0.5.10+
KTransformers	0.5.3+
Transformers	0.5.3+
xLLM	0.8.0+

Day 0 Adaptation for Domestic Chips

Moore Threads' XiYun C-series GPUs have completed Day 0 full adaptation for GLM-5.1, achieving "out-of-the-box, performance lossless" deployment. Previously, GLM-5 was fully adapted to seven major domestic chip platforms: Huawei Ascend, Moore Threads, Hygon, Cambricon, Kunlunxin, Biren, and Iluvatar Core.

On a single domestic computing node, GLM-5.1's performance is already comparable to dual-GPU international clusters, with deployment costs reduced by 50% in long-sequence scenarios.

Coding Tool Integration

GLM-5.1 has been integrated into the GLM Coding Plan (Max/Pro/Lite), supporting:

Claude Code (manual switch by default)
Over 20 mainstream development tools like Cursor, Cline, OpenCode, Kilo Code, Roo Code, Droid
Z Code GUI Interface: Supports remote SSH development and initiating tasks from mobile phones.

VII. Market Response & Strategic Significance

On the day of GLM-5.1's release, Zhipu AI (HKEX: 02513.HK) stock price rose 11.15% to HK$742.5, with a trading volume of HK$1.025 billion. The GLM Coding Plan sold out instantly and entered a waitlist. This validates the strong market demand for high-quality domestic AI models.

The deeper strategic significance lies in:

The Gap Between Open-Source and Closed-Source is Rapidly Narrowing: The gap shrank from 12.5 points between GLM-5 and Opus 4.6 (35.4 vs 47.9) to only 2.6 points (45.3 vs 47.9), surpassing Claude Opus on SWE-Bench Pro for the first time.
Long-Horizon Tasks Open a New Arena: When a model can work continuously for 8 hours like a senior engineer, traditional benchmarks are no longer sufficient to measure its value.
Domestic AI Autonomy and Control: The full-stack adaptation of the 744B MoE model to domestic chips reflects a strategic layout for technological autonomy and control.

VIII. Limitations and Outlook

Remaining Challenges

Reasoning Dimension Gap: There remains a gap with top closed-source models on deep reasoning benchmarks like GPQA-Diamond (86.2 vs 91.3).
Error Accumulation in Long-Horizon Tasks: In chain tasks, a suboptimal modification at one step can silently break tests in subsequent steps. GLM-5.1's performance on multi-step chain tasks still lags significantly behind Claude Opus 4.5.
Insufficient Independent Evaluation: Apart from official channels, mainstream third-party evaluation institutions have not yet released complete independent evaluation reports.

Future Direction

Zhipu AI team's ultimate goal is the Fully Autonomous Agent—a model that works 24/7, decomposing goals, executing deliveries, self-evaluating and correcting, and self-evolving, thereby eliminating the need for human intervention.

As Zhipu wrote in their blog:

Making a model run for 8 hours is not difficult; what's truly hard is making the work in the 8th hour still effective.

GLM-5.1 is a step towards that goal. Right now, try giving it an instruction, and then walk away for 8 hours.

I. From "Vibe Coding" to "Long-Horizon Tasks": The Paradigm Shift in AI Engineering Capabilities

To understand the significance of GLM-5.1, one must first grasp the clear evolutionary path that AI programming capabilities are undergoing:

Stage	Core Characteristics	Typical Duration
AI Coding	A programmer's efficiency tool, local productivity boost	Minutes
Vibe Coding	A more accessible mode of expression, turning ideas into prototypes quickly	Tens of minutes
Agentic Engineering	AI autonomously plans, executes, and iterates like an engineer	Hours
Long-Horizon Task	Works persistently and delivers outcomes like a senior expert	8+ hours

II. Hardcore Data: A Historic Breakthrough for Open-Source Models

Programming Capability: The Strongest in Open-Source History

GLM-5.1's report card can almost be described as "flipping the table":

Benchmark	GLM-5.1	Claude Opus 4.6	GPT-5.4
Programming Evaluation Total Score	45.3	47.9	—
SWE-Bench Pro	58.4 🏆	57.3	—
SWE-bench Verified	77.8%	—	—
Terminal-Bench 2.0	63.5	65.4	—
NL2Repo	42.7	49.8	—

Compared to the previous generation GLM-5's score of 35.4, GLM-5.1's programming evaluation score surged by nearly 10 points, an improvement of 28%—a generational leap.

Reasoning Capability: Comprehensive Alignment

Benchmark	GLM-5.1	Claude Opus 4.6	Gemini 3.1 Pro
AIME 2026	95.3	—	98.7
HMMT Feb. 2026	82.6	84.3	91.8
GPQA-Diamond	86.2	91.3	—
HLE (with tools)	52.3	53.1	—

Comprehensive Positioning

III. Three Long-Horizon Task Demonstrations: The Model Works 8 Hours While You Sleep

Scenario 1: Building a Linux Desktop from Scratch in 8 Hours

Scenario 2: 655 Iterations Breaking the Vector Database Optimization Bottleneck

Scenario 3: 1000 Rounds of Tool Calls Optimizing Real ML Workloads

Core Insight: Stronger the Longer It Runs

IV. Technical Deep Dive: What Makes GLM-5.1 So Powerful?

GLM-5.1 is a post-training reinforcement upgrade of GLM-5, with the same architecture and parameter scale. The differences lie mainly in training strategies and optimization focus.

Core Specifications

Parameter	Specification
Total Parameters	744B (MoE architecture, 256 experts)
Active Parameters	40B
Context Window	200K tokens
Max Output	128K tokens
Architectural Features	MLA + DeepSeek Sparse Attention
Open-Source License	MIT

Key Innovation 1: DeepSeek Sparse Attention (DSA)

Indexer First: A small neural network quickly scans all tokens to compute importance scores.
Top-k Filtering: Only the top-2048 most relevant tokens are retained.
Sparse Attention: Full attention computation is performed only on the filtered tokens.

Key Innovation 2: Slime Asynchronous Reinforcement Learning Framework

This is Zhipu's self-developed RL training framework (already open-sourced), named "Slime." It prevents degradation in long-horizon tasks through three mechanisms:

Decoupling Generation and Training: The inference engine and training engine are deployed on different GPUs. The inference engine continuously generates trajectories, while the training engine asynchronously samples and updates policies, eliminating synchronization bottlenecks.
Multi-Task Coordinator: A central server manages different task services, supporting 1000+ concurrent rollouts, enabling balanced data collection across tasks.
Token-in-Token-out (TITO): Directly uses the precise token stream generated by the inference engine to construct learning trajectories, avoiding mismatch issues from re-tokenization.

Key Innovation 3: Progressive Alignment Strategy

Post-training follows a four-stage progressive path:

Multi-task SFT: Introduces complex interleaved thinking patterns, expands Agent and coding data scale.
Reasoning & Agent-Specific RL: Mixes four domains: mathematics, science, code, and tool-integrated reasoning.
General RL: Multi-dimensional optimization objectives (correctness, emotional intelligence, specific task capabilities) + hybrid reward system.
Cross-Stage Online Distillation: Mitigates capability degradation, ensuring the model retains abilities from all stages.

Key Innovation 4: Three Thinking Modes

GLM-5.1 supports three different thinking modes, providing optimal strategies for different scenarios:

Interleaved Thinking: Thinks before each response and tool call, improving instruction following.
Retained Thinking: Automatically retains multi-turn thinking blocks in Coding Agent scenarios, reusing existing reasoning.
Turn-Level Thinking: Enables/disables reasoning as needed—reduces latency for lightweight requests, improves accuracy for complex tasks.

V. The Pricing Game-Changer: 94.6% of the Capability at 20% of the Price

Technical prowess is one thing, but for most developers and enterprises, price is the real deciding factor. GLM-5.1's pricing is a game-changer:

Model	Input Price (/M tokens)	Output Price (/M tokens)
GLM-5.1	$1.00	$3.20
Claude Opus 4.6	$5.00	$25.00
GPT-5.4	$2.50	$15.00

GLM-5.1's input cost is 1/5th of Claude Opus and 1/2.5th of GPT-5.4; its output cost is even more dramatic—only 1/7.8th of Claude's and 1/4.7th of GPT-5.4's.

For GLM Coding Plan subscribers, the lowest Lite plan costs only $30/quarter (discounted to $27), providing 3x the usage quota of Claude Pro, and all plans support GLM-5.1.

VI. Open-Source Ecosystem & Domestic Chip Adaptation

GLM-5.1 is fully open-sourced under the MIT License, already released on HuggingFace, ModelScope, and GitHub, with an FP8 quantized version provided.

Local Deployment Support

Inference Framework	Minimum Version
vLLM	0.19.0+
SGLang	0.5.10+
KTransformers	0.5.3+
Transformers	0.5.3+
xLLM	0.8.0+

Day 0 Adaptation for Domestic Chips

On a single domestic computing node, GLM-5.1's performance is already comparable to dual-GPU international clusters, with deployment costs reduced by 50% in long-sequence scenarios.

Coding Tool Integration

GLM-5.1 has been integrated into the GLM Coding Plan (Max/Pro/Lite), supporting:

Claude Code (manual switch by default)
Over 20 mainstream development tools like Cursor, Cline, OpenCode, Kilo Code, Roo Code, Droid
Z Code GUI Interface: Supports remote SSH development and initiating tasks from mobile phones.

VII. Market Response & Strategic Significance

The deeper strategic significance lies in:

The Gap Between Open-Source and Closed-Source is Rapidly Narrowing: The gap shrank from 12.5 points between GLM-5 and Opus 4.6 (35.4 vs 47.9) to only 2.6 points (45.3 vs 47.9), surpassing Claude Opus on SWE-Bench Pro for the first time.
Long-Horizon Tasks Open a New Arena: When a model can work continuously for 8 hours like a senior engineer, traditional benchmarks are no longer sufficient to measure its value.
Domestic AI Autonomy and Control: The full-stack adaptation of the 744B MoE model to domestic chips reflects a strategic layout for technological autonomy and control.

VIII. Limitations and Outlook

Remaining Challenges

Reasoning Dimension Gap: There remains a gap with top closed-source models on deep reasoning benchmarks like GPQA-Diamond (86.2 vs 91.3).
Error Accumulation in Long-Horizon Tasks: In chain tasks, a suboptimal modification at one step can silently break tests in subsequent steps. GLM-5.1's performance on multi-step chain tasks still lags significantly behind Claude Opus 4.5.
Insufficient Independent Evaluation: Apart from official channels, mainstream third-party evaluation institutions have not yet released complete independent evaluation reports.

Future Direction

As Zhipu wrote in their blog:

Making a model run for 8 hours is not difficult; what's truly hard is making the work in the 8th hour still effective.

GLM-5.1 is a step towards that goal. Right now, try giving it an instruction, and then walk away for 8 hours.

GLM-5.1: A New Paradigm for Long-Context Tasks in Open-Source AI, Delivering 94.6% of Claude Opus Capability at Just 20% of the Cost

More articles

2026-07-09 Picks: Alibaba Bailian, Chanmama, Baidu AgentBuilder

Kimi K2.7 Code Released: Agent Workflow Rivals Opus 4.8 | 2026-07-09

2026-07-08 Picks: Pulpie, Karakeep, OfficeCLI

Slopo, Memora, deptrust: Three AI-Era Developer Tools Worth Trying | 2026-07-06

GLM-5.1: A New Paradigm for Long-Context Tasks in Open-Source AI, Delivering 94.6% of Claude Opus Capability at Just 20% of the Cost

I. From "Vibe Coding" to "Long-Horizon Tasks": The Paradigm Shift in AI Engineering Capabilities

II. Hardcore Data: A Historic Breakthrough for Open-Source Models

Programming Capability: The Strongest in Open-Source History

Reasoning Capability: Comprehensive Alignment

Comprehensive Positioning

III. Three Long-Horizon Task Demonstrations: The Model Works 8 Hours While You Sleep

Scenario 1: Building a Linux Desktop from Scratch in 8 Hours

Scenario 2: 655 Iterations Breaking the Vector Database Optimization Bottleneck

Scenario 3: 1000 Rounds of Tool Calls Optimizing Real ML Workloads

Core Insight: Stronger the Longer It Runs

IV. Technical Deep Dive: What Makes GLM-5.1 So Powerful?

Core Specifications

Key Innovation 1: DeepSeek Sparse Attention (DSA)

Key Innovation 2: Slime Asynchronous Reinforcement Learning Framework

Key Innovation 3: Progressive Alignment Strategy

Key Innovation 4: Three Thinking Modes

V. The Pricing Game-Changer: 94.6% of the Capability at 20% of the Price

VI. Open-Source Ecosystem & Domestic Chip Adaptation

Local Deployment Support

Day 0 Adaptation for Domestic Chips

Coding Tool Integration

VII. Market Response & Strategic Significance

VIII. Limitations and Outlook

Remaining Challenges

Future Direction

More articles

2026-07-09 Picks: Alibaba Bailian, Chanmama, Baidu AgentBuilder

Kimi K2.7 Code Released: Agent Workflow Rivals Opus 4.8 | 2026-07-09

2026-07-08 Picks: Pulpie, Karakeep, OfficeCLI

Slopo, Memora, deptrust: Three AI-Era Developer Tools Worth Trying | 2026-07-06

I. From "Vibe Coding" to "Long-Horizon Tasks": The Paradigm Shift in AI Engineering Capabilities

II. Hardcore Data: A Historic Breakthrough for Open-Source Models

Programming Capability: The Strongest in Open-Source History

Reasoning Capability: Comprehensive Alignment

Comprehensive Positioning

III. Three Long-Horizon Task Demonstrations: The Model Works 8 Hours While You Sleep

Scenario 1: Building a Linux Desktop from Scratch in 8 Hours

Scenario 2: 655 Iterations Breaking the Vector Database Optimization Bottleneck

Scenario 3: 1000 Rounds of Tool Calls Optimizing Real ML Workloads

Core Insight: Stronger the Longer It Runs

IV. Technical Deep Dive: What Makes GLM-5.1 So Powerful?

Core Specifications

Key Innovation 1: DeepSeek Sparse Attention (DSA)

Key Innovation 2: Slime Asynchronous Reinforcement Learning Framework

Key Innovation 3: Progressive Alignment Strategy

Key Innovation 4: Three Thinking Modes

V. The Pricing Game-Changer: 94.6% of the Capability at 20% of the Price

VI. Open-Source Ecosystem & Domestic Chip Adaptation

Local Deployment Support

Day 0 Adaptation for Domestic Chips

Coding Tool Integration

VII. Market Response & Strategic Significance

VIII. Limitations and Outlook

Remaining Challenges

Future Direction