5Hacker News (AI-filtered, score >= 200)·33h ago

Semgrep: GLM 5.2 outperforms Claude on cybersecurity benchmarks

Semgrep published a blog post reporting that GLM 5.2 beats Claude on their internal cybersecurity benchmarks, framed as a 'we have Mythos at home' comparison. The post appears to evaluate models on cyber-specific tasks relevant to Semgrep's code security tooling. This is a practitioner-level benchmark comparison from a security-focused company, providing real-world signal on model performance in a specialized domain.

Evaluation and Benchmarking Open Weights Progress Semgrep Claude GLM-5.1 Anthropic

Related guides (4)

Claude

Claude: Anthropic's AI Assistant Family

Read asBeginner In-depth

Anthropic

Anthropic: The AI Safety Company at the Center of the Frontier

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner

Open Weights ProgressTopic guide

Open Weights Progress: How Freely Available AI Models Caught Up to the Frontier

Read asBeginner

Related events (8)

4Hacker News·7d ago·source ↗

HN community discussion: GLM 5.2 vs. Claude Opus comparison

A Hacker News thread with 347 points and 244 comments compares GLM 5.2 against Claude Opus. The high engagement suggests active community interest in how a Chinese open-weights frontier model stacks up against Anthropic's flagship. No body content is available beyond the title and engagement metrics.

Frontier Model Releases Open Weights Progress Claude Opus 4.6 Zhipu AI GLM-5.1 +1 more

8The Batch·27d ago·source ↗

Claude Mythos Preview: Limited-Release Frontier Model with Exceptional Cybersecurity Capabilities

Anthropic has published a 244-page model card for Claude Mythos Preview, a frontier model not yet commercially available, which autonomously discovered thousands of high-severity vulnerabilities in popular operating systems and browsers during testing. To mitigate risks before potential deployment, Anthropic assembled Project Glasswing, a consortium of over 40 organizations including AWS, Apple, Google, Microsoft, and CrowdStrike, funded with $100M in model credits to patch vulnerabilities proactively. The model substantially outperforms Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro across multiple benchmarks including CyberGym (83.1%), Terminal-Bench 2.0 (82%), GPQA Diamond (94.5%), HLE (64.7%), and GraphWalks long-context (80%). The Batch notes parallels to OpenAI's GPT-2 limited-release strategy and characterizes the announcement as having elements of a publicity stunt alongside genuine safety concerns.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.1 Pro GraphWalks Linux Foundation +18 more

5Hugging Face Blog·1mo ago·source ↗

CyberSecEval 2 - A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models

CyberSecEval 2 is a benchmark framework designed to evaluate both the cybersecurity risks and capabilities of large language models. The framework appears to be hosted or featured on Hugging Face's leaderboard infrastructure, extending prior cybersecurity evaluation work. It assesses LLMs across multiple dimensions of security-relevant behavior, including potential for misuse and defensive capabilities.

Evaluation and Benchmarking AI Safety Research CyberSecEval 2 LlamaGuard Hugging Face +1 more

7The Batch·19d ago·source ↗

The Batch: Claude Mythos 5 / Fable 5 debut, Apple AFM 3, Google Live Translate, OpenAI IPO filing, FrontierCode benchmark

Anthropic launched Claude Fable 5 (a safety-guardrailed model) and Claude Mythos 5 (same underlying model with safeguards removed, for vetted cyberdefense/infrastructure users via Project Glasswing with US government collaboration), both priced at $10/$50 per million tokens. Apple released five new Apple Foundation Models (AFM 3) spanning on-device and cloud tiers, built with Google and Nvidia infrastructure. Additional headlines cover Google's Gemini 3.5 Live Translate (70+ languages, real-time), OpenAI's confidential SEC IPO filing, a NotebookLM upgrade to Gemini 3.5, and Cognition's FrontierCode benchmark for code-quality evaluation where Claude Opus 4.8 leads at 34.3%.

Frontier Model Releases Evaluation and Benchmarking Claude Mythos Claude Opus 4.6 Google +19 more

8The Batch·28d ago·source ↗

Anthropic Releases Claude Mythos Preview with Extraordinary Cybersecurity Capabilities, Forms Project Glasswing Consortium

Anthropic has published a 244-page model card for Claude Mythos Preview, a large language model not yet commercially available, which broadly outperforms Claude Opus 4.6 and is described as 'strikingly capable' at identifying and exploiting code vulnerabilities. To mitigate risks before potential release, Anthropic assembled Project Glasswing, a consortium including AWS, Apple, Google, Microsoft, CrowdStrike, Nvidia, and 40+ other organizations, funded with $100 million in API credits and $4 million in open-source security donations. This marks the first time Anthropic has published a model card without making the model commercially available, signaling an unusual safety-first deployment posture. The issue also includes commentary from Andrew Ng on AI's impact on software engineering jobs, arguing against an 'AI jobpocalypse' narrative.

Frontier Model Releases AI Safety Research JPMorganChase Linux Foundation Claude Opus 4.6 +14 more

5Don'T Worry About The Vase·7d ago·source ↗

Zvi Mowshowitz commentary: GLM-5.2 as new best open model

Zvi Mowshowitz covers the release of GLM-5.2, characterizing it as the new best open model. The post is a tier-2 commentary piece on what appears to be a significant open-weights model release. The body is truncated, so specific benchmark claims or technical details are not available from this excerpt.

Frontier Model Releases Open Weights Progress Zvi Mowshowitz GLM-5.1

6The Batch·1mo ago·source ↗

Data Points: DeepSWE Benchmark, DeepSeek V4 Price Cuts, MAI-Image-2.5, Mythos Security Findings, MCP Stateless Update

This edition of The Batch covers five distinct AI developments: Datacurve's DeepSWE benchmark claims to fix critical grading flaws in SWE-bench Pro with hand-written verifiers and harder tasks; DeepSeek permanently cuts V4 Pro prices by 75%; Microsoft's MAI-Image-2.5 debuts third on the Arena leaderboard; Anthropic's Claude Mythos Preview found over 10,000 high/critical vulnerabilities in the first month of Project Glasswing, with remediation badly lagging discovery; and the Model Context Protocol proposes removing stateful sessions to enable stateless, load-balanced remote servers. Each item reflects meaningful movement in evaluation methodology, inference economics, multimodal generation, AI-assisted security, and agent tooling infrastructure.

Frontier Model Releases Evaluation and Benchmarking Datacurve DeepSeek V4 Microsoft +17 more

5Google Deepmind Blog·1mo ago·source ↗

Advancing Gemini's Security Safeguards

DeepMind announces that Gemini 2.5 has been made their most secure model family to date. The post describes advances in security safeguards for the Gemini 2.5 model family. No specific technical details are provided in the excerpt beyond the high-level claim of improved security posture.

Frontier Model Releases AI Safety Research Gemini 2.5 Google DeepMind