
GLM-5.1
glm-5-1-ee840643·10 events·first seen 1mo agoAliases: GLM-5.1, GLM 5, GLM-5, GLM 5.1, GLM-5.2
Co-occurring entities
More like this (12)
Recent events (10)
Z.ai's GLM-5.1 Open-Weights Model Targets Multi-Hour Agentic Coding Tasks with Iterative Self-Evaluation
Z.ai released GLM-5.1, a 754B parameter mixture-of-experts open-weights model optimized for long-running agentic coding tasks, capable of cycling through planning, execution, and strategy revision hundreds of times over sessions lasting up to eight hours. The model achieves top open-weights scores on the Artificial Analysis Intelligence Index and third place on Arena's Code leaderboard, while leading SWE-Bench Pro in Z.ai's own evaluations at 58.4 percent. Weights are available on HuggingFace under MIT license, with API pricing roughly 40 percent higher than its predecessor but still below comparable proprietary models. No technical report has been published, leaving architecture and training details undisclosed.
GLM-5.2 announced as model built for long-horizon tasks
ZAI.org published a blog post on Hugging Face announcing GLM-5.2, a model positioned for long-horizon tasks. The post appears to be a model release announcement from the GLM (General Language Model) lineage. Limited body content is available, but the framing suggests capabilities relevant to extended reasoning or agentic workflows.
GLM-5.1 Open-Weights Model Targets Long-Running Agentic Tasks; Andrew Ng on Coding Agent Acceleration by Software Domain
Z.ai released GLM-5.1, an open-weights mixture-of-experts LLM (754B total / 40B active parameters) designed for sustained agentic coding tasks lasting up to eight hours, featuring iterative planning-execution-evaluation loops with thousands of tool calls. The model claims top open-weights performance on Artificial Analysis Intelligence Index and SWE-Bench Pro, available under MIT license via HuggingFace. The accompanying editorial by Andrew Ng offers a tiered framework for how much coding agents accelerate different software work categories—frontend most, then backend, infrastructure, and research least—with practical implications for team organization. A secondary item references data-center opposition and LLM helpfulness failure modes.
GLM-5.2 claims top frontend coding performance; IndexShare speculative decoding introduced
A Latent Space AI news digest highlights GLM-5.2 as a new open-weights model claiming top performance on frontend coding tasks. The digest also covers IndexShare, a technique for speculative decoding. The body is truncated but the headline signals a notable open-weights model release and an inference optimization development.
Latest open artifacts (#19): Qwen 3.5, GLM 5, MiniMax 2.5 — Chinese labs' latest push of the frontier
A Interconnects newsletter roundup covering recent open-weight model releases from Chinese AI labs, specifically Qwen 3.5, GLM 5, and MiniMax 2.5. The piece frames these as a continued frontier push from Chinese research organizations. The body content is minimal beyond the title and greeting, suggesting this is either a stub or the full content was not captured.
Latest open artifacts (#21): Open model bonanza — Gemma 4, DeepSeek V4, Kimi K2.6, MiMo 2.5, GLM-5.1 & others
Interconnects' recurring open-weights roundup covers a dense cluster of recent releases including Gemma 4, DeepSeek V4, Kimi K2.6, MiMo 2.5, and GLM-5.1, characterizing the period as a flagship-after-flagship cadence. The piece also includes commentary on CAISI's assessment of DeepSeek V4. As a tier-2 commentary source, this is a synthesis and analysis layer rather than primary announcements.
K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts
K-BrowseComp is a new 400-problem benchmark for evaluating web-browsing agents in Korean-language contexts, with a 300-problem manually validated subset and a 100-problem adversarially constructed synthetic split. Frontier models including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1 achieve only 30–46% on the verified subset, a significant drop from English BrowseComp performance, while Korean proprietary models score 0–10%. The benchmark exploits the asymmetry between problem creation and solving difficulty, and the adversarial synthetic split caps the strongest model at 26%, positioning it as a targeted stress test for agentic web-browsing capability.
Claw-SWE-Bench: A benchmark for evaluating agent harnesses on multilingual coding tasks
Researchers introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol designed to fairly compare heterogeneous agent harnesses ("claws") on GitHub issue-resolution tasks. The benchmark contains 350 instances across 8 languages and 43 repositories, with an 80-instance Lite subset for cost-efficient validation. Key findings show adapter design dominates raw model choice: a minimal adapter scores 19.1% Pass@1 versus 73.4% for a full adapter using the same GLM 5.1 backbone, and harness choice and model choice each shift Pass@1 by roughly 27-29 percentage points. The work also introduces cost accounting as a first-class evaluation axis alongside accuracy.
Kimi K2.6: Moonshot AI's 1T-Parameter Vision-Language Model Matches Open-Weights Peers, Trails Top Closed Models
Moonshot AI released Kimi K2.6, a 1 trillion-parameter mixture-of-experts vision-language model with 32B active parameters, designed for long-horizon autonomous coding sessions lasting multiple days and multi-agent orchestration scaling to 300 parallel subagents executing up to 4,000 steps. The model matches Qwen3.6 Max Preview and DeepSeek-V4-Pro on the Artificial Analysis Intelligence Index (scoring 54 vs. their 52) while trailing closed models like GPT-5.5 and Claude Opus 4.7. Weights are freely downloadable from Hugging Face under a modified MIT license permitting commercial use, with API access priced at $0.95/$0.16/$4.00 per million input/cached/output tokens. Notable features include a 256K token context window, native INT4 quantization, a 'preserve thinking' mode for multi-turn reasoning continuity, and a research preview 'claw groups' feature enabling cross-developer agent collaboration.
OpenAI GPT-5.4 Pro and GPT-5.4 Thinking challenge Gemini 3.1 Pro Preview for top AI model position
OpenAI released GPT-5.4 in two variants (Pro and Thinking), featuring expanded context windows up to 1.05M tokens, native computer use, tool search capabilities, and adjustable reasoning levels. In independent benchmarks by Artificial Analysis, GPT-5.4 Pro at xhigh reasoning nearly ties Gemini 3.1 Pro Preview on the Intelligence Index (57 vs 57.2 points) but at roughly 3.3x the cost, while leading on coding and agentic sub-indices. The release leapfrogs Claude Opus 4.6 on most benchmarks but faces stiff competition from Google's Gemini 3.1 Pro Preview, which maintains a price and multimodal advantage.