Topic

Evaluation and Benchmarking

activeevaluation-and-benchmarking·1,026 events·last 33h ago

New benchmarks, benchmark saturation discussions, eval methodology critiques, reproducibility work, and the meta-debate about what counts as a meaningful measurement.

Related entities

Hugging Face OpenAI Anthropic Alibaba Interconnects Claude Opus 4.6 GPT-5.5 ModelScope DeepSeek V4 Jack Clark Import AI Claude Alibaba Qwen Qwen Qwen2.5-Math-PRM AI Snake Oil Gemma 4 GRPO Databricks Qwen3-4B

Guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Recent events (50)

5Ai Snake Oil·1mo ago·source ↗

Open-world evaluations for measuring frontier AI capabilities: Introducing CRUX

This commentary introduces CRUX, a new evaluation project designed to assess frontier AI systems on long-horizon, open-ended, and messy real-world tasks. The piece argues that existing benchmarks are insufficient for capturing the full range of capabilities exhibited by frontier models in complex settings. CRUX aims to fill this gap by providing evaluations that better reflect deployment-relevant performance.

Frontier Model Releases Evaluation and Benchmarking Normal Tech CRUX AI Snake Oil +1 more

6Interconnects·1mo ago·source ↗

Latest open artifacts (#21): Open model bonanza — Gemma 4, DeepSeek V4, Kimi K2.6, MiMo 2.5, GLM-5.1 & others

Interconnects' recurring open-weights roundup covers a dense cluster of recent releases including Gemma 4, DeepSeek V4, Kimi K2.6, MiMo 2.5, and GLM-5.1, characterizing the period as a flagship-after-flagship cadence. The piece also includes commentary on CAISI's assessment of DeepSeek V4. As a tier-2 commentary source, this is a synthesis and analysis layer rather than primary announcements.

Frontier Model Releases Evaluation and Benchmarking MiMo 2.5 Interconnects DeepSeek V4 +7 more

6arXiv · cs.CL·1mo ago·source ↗

ATLAS: Unified Agentic and Latent Visual Reasoning via Functional Tokens

ATLAS proposes a framework where a single discrete 'functional token' serves dual roles as both an agentic operation trigger and a latent visual reasoning unit in multimodal models. This design avoids the computational cost of generating intermediate images while sidestepping the context-switching latency of external tool calls and the generalization limitations of pure latent methods. The framework is compatible with standard SFT and RL training pipelines without architectural changes, and introduces Latent-Anchored GRPO (LA-GRPO) to stabilize reinforcement learning when functional tokens are sparse. Experiments show strong performance on visual reasoning benchmarks with maintained interpretability.

Evaluation and Benchmarking Agent and Tool Ecosystem functional token GRPO Latent-Anchored GRPO +4 more

5arXiv · cs.AI·1mo ago·source ↗

EntityBench: Benchmark for Entity-Consistent Long-Range Multi-Shot Video Generation

EntityBench is a new benchmark comprising 140 episodes (2,491 shots) derived from real narrative media, designed to evaluate entity consistency—characters, objects, and locations—across long multi-shot video generation sequences. It introduces tiered difficulty up to 50 shots and recurrence gaps of up to 48 shots, paired with a three-pillar evaluation suite covering intra-shot quality, prompt alignment, and cross-shot consistency. The authors also propose EntityMem, a memory-augmented baseline that stores verified per-entity visual references in a persistent memory bank, achieving the highest character fidelity (Cohen's d = +2.33) among evaluated methods. Results show that cross-shot entity consistency degrades sharply with recurrence distance in existing approaches.

Evaluation and Benchmarking Multimodal Progress Cohen's d EntityMem Catherine R. He +1 more

7Openai Blog·1mo ago·source ↗

Databricks brings GPT-5.5 to enterprise agent workflows

Databricks is integrating GPT-5.5 into its enterprise agent workflows following the model's state-of-the-art performance on the OfficeQA Pro benchmark. The partnership represents a deployment of OpenAI's latest model within a major data and AI platform. This signals continued enterprise adoption of frontier models for agentic use cases.

Frontier Model Releases Evaluation and Benchmarking Databricks OpenAI OfficeQA Pro +3 more

5Ai Snake Oil·1mo ago·source ↗

New Paper: Towards a Science of AI Agent Reliability

A new paper proposes a framework for quantifying the gap between AI agent capability and reliability, aiming to establish a more rigorous science of agent dependability. The work addresses the observation that agents may demonstrate high capability on benchmarks while failing unpredictably in deployment. The piece is published via the normaltech.ai newsletter, associated with the AI Snake Oil research commentary tradition.

Evaluation and Benchmarking AI Safety Research Towards a Science of AI Agent Reliability normaltech.ai AI Snake Oil +2 more

7arXiv · cs.LG·1mo ago·source ↗

AI-Mediated Communication Can Steer Collective Opinion via LLM Editing Biases

This paper demonstrates empirically that LLMs from multiple model families introduce directional biases when editing human-written texts on contested topics (e.g., nudging toward gun control, against atheism). The authors develop a mathematical opinion-dynamics model showing these biases are amplified through social networks, shifting collective opinion at scale. An audit of X's 'Explain this post' feature finds evidence of pro-life bias in Grok's outputs on abortion content, traced to specific design choices. The paper concludes with implications for EU legislative efforts on AI-mediated communication.

Evaluation and Benchmarking AI Safety Research Grok X (Twitter)EU AI Act +5 more

5arXiv · cs.LG·1mo ago·source ↗

Dynamics-Level Watermarking of Flow Matching Models with Random Codes

This paper proposes embedding watermarks directly into the velocity field (continuous dynamics) of flow matching generative models, rather than into weights or outputs. The method uses key-dependent perturbations added during training, formulated as random coding over a continuous channel, allowing black-box message recovery at detection time. The perturbation is designed to leave the generated distribution unchanged. Experiments on MNIST and CIFAR-10 demonstrate reliable message recovery, preserved generation quality, and chance-level decoding without the secret key.

Evaluation and Benchmarking AI Safety Research MNIST CIFAR-10 Random Coding +2 more

6Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

SPEX and ProxySPEX: Scalable Interaction Discovery for LLM Interpretability

Researchers from BAIR introduce SPEX (Spectral Explainer) and ProxySPEX, algorithms for identifying influential feature, data, and model-component interactions in LLMs at scale. The approach exploits sparsity, low-degreeness, and hierarchy properties to reframe interaction discovery as a sparse recovery problem using tools from signal processing and coding theory. ProxySPEX achieves comparable performance to SPEX with roughly 10x fewer ablations by leveraging hierarchical structure. The methods are evaluated on feature attribution (sentiment analysis), data attribution, and mechanistic interpretability tasks, outperforming marginal methods like LIME at long context lengths.

Long Context Evolution Evaluation and Benchmarking GPT-4o mini Faith-Shap LIME +5 more

5Interconnects·1mo ago·source ↗

Reading today's open-closed performance gap

This commentary from Interconnects analyzes the factors that determine benchmark evaluation scores and the performance gap between open-weight and closed frontier models. It examines how various complex variables contribute to the single evaluation numbers that dominate public discourse, and considers how this gap may evolve over time. The piece is framed as an analytical take on the current state of open vs. closed model competition.

Frontier Model Releases Evaluation and Benchmarking Interconnects +1 more

5arXiv · cs.LG·1mo ago·source ↗

Layer Equivalence Is Not a Property of Layers Alone: How You Test Redundancy Changes What You Find

This paper distinguishes two protocols for measuring transformer layer redundancy—replacement (can one layer substitute for another in place?) and interchange (do two layers approximately commute when swapped?)—and shows they can disagree substantially. Experiments on Pythia (410M, 1.4B) and 8B-scale models (Qwen3-8B, Llama-3.1-8B) reveal that the protocol gap grows during training and can change which layers appear safe to prune by several-fold. Notably, Qwen3-8B shows interchange-guided removal is far safer than replacement-guided at the same layer budgets, while Llama-3.1-8B ties the two protocols despite lower interchange KL. The authors recommend scoring both swap-KL metrics before any layer removal or merging, requiring only unlabeled forward passes.

Evaluation and Benchmarking Inference Economics WikiText-2 layer pruning Pythia +3 more

5Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

Information-Driven Design of Imaging Systems

Researchers from Berkeley present a framework for evaluating and optimizing imaging systems based on mutual information content rather than traditional metrics like resolution or SNR, published at NeurIPS 2025. The method estimates mutual information directly from noisy measurements using known noise physics and learned probabilistic models (including transformers and PixelCNN), avoiding the need for task-specific decoders. Validated across four domains—color photography, radio astronomy, lensless imaging, and microscopy—the information metric predicts downstream decoder performance and enables hardware optimization with less compute and memory than end-to-end neural approaches.

Evaluation and Benchmarking Inference Economics UC Berkeley information-driven imaging framework mutual information +3 more

4Hugging Face Blog·1mo ago·source ↗

Adding Benchmaxxer Repellant to the Open ASR Leaderboard

Hugging Face describes measures taken to prevent benchmark gaming ('benchmaxxing') on the Open ASR Leaderboard by introducing private or held-out evaluation data. The post addresses the integrity of automatic speech recognition benchmarks, where models may be overfitted or tuned specifically to public test sets. This is part of a broader effort to maintain meaningful leaderboard rankings as ASR model submissions increase.

Evaluation and Benchmarking Open ASR Leaderboard benchmaxxing Hugging Face

6arXiv · cs.LG·1mo ago·source ↗

FORGE: Self-Evolving Agent Memory via Population Broadcast Without Weight Updates

FORGE (Failure-Optimized Reflective Graduation and Evolution) is a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents without any gradient updates. It wraps a Reflexion-style inner loop where a reflection agent converts failed trajectories into textual heuristics or few-shot demonstrations, then propagates the best-performing instance's memory across a population between stages. Evaluated on CybORG CAGE-2 (a stochastic network-defense POMDP), FORGE improves average return by 1.7–7.7× over zero-shot and 29–72% over Reflexion across all 12 model-representation conditions tested with four LLM families. Notably, weaker models benefit disproportionately, suggesting the method may help close capability gaps rather than amplify already-strong models.

Evaluation and Benchmarking Agent and Tool Ecosystem Reflexion Grok-4-Fast ReAct +6 more

6Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

RL without TD Learning: Divide-and-Conquer Value Learning for Long-Horizon Off-Policy RL

A BAIR blog post introduces a divide-and-conquer paradigm for off-policy reinforcement learning that avoids temporal difference (TD) learning's error accumulation problem by reducing Bellman recursions logarithmically rather than linearly. The approach leverages the triangle inequality structure of goal-conditioned RL to define a transitive Bellman update rule, enabling value learning that scales to long-horizon tasks. The authors claim this is the first practical realization of divide-and-conquer value learning at scale in goal-conditioned RL settings, building on an idea traceable to Kaelbling (1993). The post frames this as a third paradigm alongside TD and Monte Carlo methods, addressing a key gap in scalable off-policy RL.

Evaluation and Benchmarking Agent and Tool Ecosystem Leslie Pack Kaelbling Divide-and-Conquer Value Learning Berkeley AI Research (BAIR)+8 more

6arXiv · cs.LG·1mo ago·source ↗

Universal Magnetic Structure Prediction from Atomic Coordinates with Near-Experimental Accuracy

The paper introduces MSN (Magnetic Structure Network), an E(3) equivariant graph neural network that predicts collinear and non-collinear magnetic structures directly from atomic crystal coordinates. Trained on experimentally determined structures from the MAGNDATA database, it uses a novel Primitive Modulated Structure Representation (PMSR) to handle both commensurate and incommensurate magnetic orders in a unified framework without symmetry assumptions. The model achieves near-experimental accuracy across diverse magnetic structure types, offering a scalable alternative to costly experiments and computationally demanding first-principles methods for magnetic materials discovery.

Evaluation and Benchmarking Magnetic Structure Network (MSN)Primitive Modulated Structure Representation (PMSR)E(3) equivariant graph neural network +1 more

5Interconnects·1mo ago·source ↗

Gemma 4 and what makes an open model succeed

A commentary piece from Interconnects analyzing Google's Gemma 4 release and the broader question of what drives success for open-weight models. The piece argues that benchmark scores are not the primary determinant of open model adoption or impact. This is a tier-2 analytical take on the open-weights ecosystem and the strategic dynamics around model releases.

Frontier Model Releases Evaluation and Benchmarking Interconnects Google Gemma 4 +1 more

5arXiv · cs.LG·1mo ago·source ↗

Artificial Aphasias in Lesioned Language Models

Researchers apply an aphasia-inspired 'lesioning' technique to five 1B-scale language models by zeroing out model parameters and measuring resulting language impairments against a Text Aphasia Battery (TAB). Across 112,426 outputs, the full range of aphasia symptoms emerges but in distributions distinct from human aphasia profiles. The study finds systematic differences between attention components (query, key, value, output) and feed-forward components, as well as depth-dependent effects where early-layer lesions cause syntactic/semantic symptoms and late-middle layers yield phonological and fluency deficits. The qualitative divergence between LM and human aphasia patterns suggests aphasia syndromes are shaped by learning and processing details rather than being universal consequences of disrupted language processing.

Evaluation and Benchmarking aphasia 1B-scale language models lesioning technique +1 more

6Don'T Worry About The Vase·1mo ago·source ↗

GPT-5.5: Capabilities and Reactions

Zvi Mowshowitz's commentary on the GPT-5.5 system card and its capabilities, noting the release largely confirmed prior expectations. The piece analyzes the model's capabilities and community reactions to the release. As a tier-2 commentary source, this provides analytical framing around a significant model release rather than primary technical information.

Frontier Model Releases Evaluation and Benchmarking OpenAI Zvi Mowshowitz GPT-5.5 System Card +1 more

3Import Ai·1mo ago·source ↗

Import AI 447: The AGI Economy, AI-Generated Game Testing, and Agent Ecologies

Import AI issue 447 covers speculative analysis of AGI economic structures, including the concept of a 'superintelligence arcology,' alongside coverage of using procedurally generated games to evaluate AI capabilities and discussion of emergent agent ecologies. The newsletter synthesizes recent developments across frontier AI, evaluation methodology, and multi-agent systems. As a tier-2 commentary source, it provides synthesis and framing rather than primary research.

Frontier Model Releases Evaluation and Benchmarking AGI economy Jack Clark agent ecologies +2 more

6Don'T Worry About The Vase·1mo ago·source ↗

GPT-5.5: The System Card — Commentary

Zvi Mowshowitz's commentary on OpenAI's announcement of GPT-5.5 and GPT-5.5-Pro, analyzing the associated system card. The piece is a tier-2 analytical response to a major model release. Full content appears truncated, but the item covers the safety and capability disclosures accompanying the new model family.

Frontier Model Releases Evaluation and Benchmarking GPT Pro OpenAI Zvi Mowshowitz +2 more

4Import Ai·1mo ago·source ↗

Import AI 446: Nuclear LLMs; China's big AI benchmark; measurement and AI policy

Import AI issue 446 covers three main topics: the application of large language models to nuclear domains, a major new AI benchmark from China, and the intersection of AI measurement with policy. The newsletter synthesizes recent developments across frontier AI research and geopolitical AI competition. It also touches on speculative questions about AI psychology, such as whether AIs might experience jealousy. As a tier-2 commentary digest, it aggregates signals across multiple active research and policy threads.

Frontier Model Releases Evaluation and Benchmarking Jack Clark Import AI China +2 more

4Import Ai·1mo ago·source ↗

Import AI 445: Timing superintelligence; AIs solve frontier math proofs; a new ML research benchmark

Import AI issue 445 covers three main topics: speculation on whether 2026 will be a pivotal year for superintelligence decision-making, AI systems solving frontier mathematics proofs, and the introduction of a new ML research benchmark. The newsletter synthesizes recent developments across capability milestones and evaluation tooling. As a tier-2 commentary source, it provides curated signal on frontier AI progress rather than primary research.

Frontier Model Releases Evaluation and Benchmarking superintelligence Jack Clark Import AI +1 more

5Interconnects·1mo ago·source ↗

GPT 5.4 is a big step for Codex

A Tier 2 commentary piece from Interconnects evaluates GPT 5.4 in the context of OpenAI's Codex agent ecosystem, examining what the model release means for the frontier of AI agents. The author reflects on the current state of agent evaluation and notes a continued preference for Claude in practice. The piece offers analysis of how GPT 5.4 advances coding-agent capabilities relative to competing offerings.

Frontier Model Releases Evaluation and Benchmarking Interconnects Claude OpenAI +4 more

5Hugging Face Blog·1mo ago·source ↗

QIMMA: A Quality-First Arabic LLM Leaderboard

TII UAE (Technology Innovation Institute) has launched QIMMA, a leaderboard specifically designed to evaluate large language models on Arabic language tasks with a focus on quality-first assessment. The leaderboard aims to address gaps in Arabic NLP evaluation by providing standardized benchmarks tailored to Arabic linguistic characteristics. This represents a dedicated infrastructure effort for tracking Arabic LLM progress, a historically underserved language in evaluation frameworks.

Evaluation and Benchmarking Open Weights Progress Hugging Face QIMMA Technology Innovation Institute

7Qwen Research·1mo ago·source ↗

Qwen3 Embedding: State-of-the-Art Text Embedding and Reranking Models Released

Alibaba's Qwen team has released the Qwen3 Embedding series, a set of open-weights text embedding and reranking models built on the Qwen3 foundation model. The models are designed for retrieval and reranking tasks and claim state-of-the-art performance across multiple benchmarks. They are released under the Apache 2.0 license and are available on Hugging Face and ModelScope.

Evaluation and Benchmarking Open Weights Progress Qwen3 Embedding Alibaba Qwen Apache 2.0 +5 more

8Qwen Research·1mo ago·source ↗

Qwen3 Release: Flagship 235B MoE and Full Model Family Announced

Alibaba's Qwen team has released Qwen3, a new family of large language models including the flagship Qwen3-235B-A22B mixture-of-experts model. The flagship model claims competitive benchmark performance against DeepSeek-R1, OpenAI o1/o3-mini, Grok-3, and Gemini-2.5-Pro on coding, math, and general capabilities. A smaller MoE variant, Qwen3-30B-A3B, reportedly outperforms QwQ-32B despite using only one-tenth the activated parameters, and the 4B model is said to match Qwen2.5's larger models. Models are available across Hugging Face, ModelScope, and Kaggle.

Frontier Model Releases Evaluation and Benchmarking Alibaba Qwen DeepSeek V4 Qwen3-30B-A3B +10 more

5Hugging Face Blog·1mo ago·source ↗

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

IBM Research presents an analysis of VAKRA, a benchmark designed to evaluate agentic AI systems on reasoning and tool use capabilities. The post examines how agents fail across different task categories, surfacing systematic failure modes in multi-step reasoning and tool invocation. The analysis provides diagnostic insights into where current agent architectures break down under realistic task conditions.

Evaluation and Benchmarking AI Safety Research IBM Research Hugging Face VAKRA +1 more

7Qwen Research·1mo ago·source ↗

QwQ-32B: Scaling Reinforcement Learning for Enhanced Reasoning

Alibaba's Qwen team releases QwQ-32B, a 32-billion parameter model trained with scaled Reinforcement Learning to improve reasoning capabilities beyond conventional pretraining and post-training methods. The release draws explicit comparison to DeepSeek R1's cold-start and multi-stage RL training approach. The model is available via Qwen Chat, Hugging Face, ModelScope, and a demo interface. This represents Qwen's exploration of RL scalability as a path to enhanced LLM intelligence.

Frontier Model Releases Evaluation and Benchmarking DeepSeek V4 Alibaba Qwen +6 more

5Interconnects·1mo ago·source ↗

Opus 4.6, Codex 5.3, and the post-benchmark era

A Interconnects commentary piece examining how to compare frontier AI models in 2026, using Anthropic's Opus 4.6 and OpenAI's Codex 5.3 as case studies. The piece appears to argue that traditional benchmarks are no longer sufficient for distinguishing model capabilities at the frontier. This reflects a broader industry shift toward more nuanced, task-specific evaluation methods.

Frontier Model Releases Evaluation and Benchmarking Interconnects Codex 5.3 Claude Opus 4.6 +2 more

6Qwen Research·1mo ago·source ↗

Qwen2.5-Math Process Reward Model for Mathematical Reasoning Supervision

Alibaba's Qwen team introduces a process reward model (PRM) aimed at improving the reliability of mathematical reasoning in LLMs by supervising intermediate reasoning steps rather than only final answers. The work addresses the problem of models producing plausible but flawed intermediate derivations even when reaching correct conclusions. The release includes model weights on HuggingFace and ModelScope alongside a GitHub repository.

Evaluation and Benchmarking Open Weights Progress Process Reward Model Alibaba Qwen +4 more

4Hugging Face Blog·1mo ago·source ↗

A New Framework for Evaluating Voice Agents (EVA)

ServiceNow AI has published a blog post on Hugging Face introducing EVA, a new evaluation framework designed specifically for voice agents. The framework appears to address gaps in existing evaluation methodologies for assessing voice-based AI agent performance. As voice agents become more prevalent in enterprise and consumer settings, standardized evaluation protocols are increasingly important for benchmarking progress.

Evaluation and Benchmarking Agent and Tool Ecosystem ServiceNow AI Hugging Face EVA

7Qwen Research·1mo ago·source ↗

QwQ-32B-Preview: Alibaba's Qwen Reasoning Model with Deep Reflection Capabilities

Alibaba's Qwen team has released QwQ-32B-Preview, a 32-billion parameter model designed for deep reasoning across mathematics, code, and general knowledge. The model is positioned as a reasoning-focused system that emphasizes uncertainty and iterative questioning as core design principles. It is available on GitHub, Hugging Face, ModelScope, and via a demo interface.

Frontier Model Releases Evaluation and Benchmarking Alibaba QwQ-32B-Preview Qwen +3 more

8Qwen Research·1mo ago·source ↗

Qwen2.5-Coder Series Open-Sourced: 32B Model Claims SOTA, Matches GPT-4o on Coding

Alibaba's Qwen team has open-sourced the Qwen2.5-Coder family of code-specialized language models, with the flagship 32B-Instruct variant claiming state-of-the-art performance among open-source code models and parity with GPT-4o on coding benchmarks. The release spans multiple model sizes, expanding on previously released smaller variants. The models are described as combining strong coding ability with general reasoning and mathematical skills.

Frontier Model Releases Evaluation and Benchmarking Qwen2.5-Coder-32B-Instruct GPT-4o OpenAI +3 more

7Qwen Research·1mo ago·source ↗

Qwen2.5-Math: Open-Source Mathematical LLM Series Released

Alibaba's Qwen team has released Qwen2.5-Math, an upgraded series of open-source mathematical LLMs including base and instruction-tuned models at 1.5B, 7B, and 72B parameter scales, plus a mathematical reward model. The models support Chain-of-Thought (CoT) and Tool-Integrated Reasoning (TIR) for English and Chinese math problem solving. This follows the Qwen2-Math release approximately one month prior and is claimed to be the leading open-source mathematical LLM series.

Frontier Model Releases Evaluation and Benchmarking Tool-Integrated Reasoning Chain-of-Thought Reasoning Qwen2.5-Math-PRM +2 more

7Qwen Research·1mo ago·source ↗

Qwen2-VL: Alibaba Releases Latest Vision-Language Model with Extended Video Understanding

Alibaba's Qwen team has released Qwen2-VL, the latest iteration of their vision-language model series built on the Qwen2 foundation. The model claims state-of-the-art performance on visual understanding benchmarks including MathVista, DocVQA, RealWorldQA, and MTVQA. A notable capability is understanding videos exceeding 20 minutes in length for question answering, dialog, and content creation tasks.

Frontier Model Releases Evaluation and Benchmarking Qwen2.5-VL RealWorldQA DocVQA +6 more

6Qwen Research·1mo ago·source ↗

Introducing Qwen2-Math: Math-Specialized LLMs from Alibaba's Qwen Team

Alibaba's Qwen team has released Qwen2-Math and Qwen2-Math-Instruct, a series of math-specialized large language models built on the Qwen2 architecture. The models are designed to enhance arithmetic and mathematical reasoning capabilities in LLMs. The initial release supports English only, with bilingual English/Chinese versions announced as forthcoming.

Frontier Model Releases Evaluation and Benchmarking Qwen2-Math-Instruct Qwen2.5 Alibaba +2 more

6Qwen Research·1mo ago·source ↗

Qwen-Max-0428: Alibaba's Largest Instruction-Tuned Model Released

Alibaba's Qwen team has released Qwen-Max-0428, a new instruction-tuned model larger than the previously open-sourced Qwen1.5-110B-Chat. The model has entered Chatbot Arena and reached the top-10 on the leaderboard, while also outperforming Qwen1.5-110B-Chat on MT-Bench. The model is available via API, though it does not appear to be open-weights at this stage.

Frontier Model Releases Evaluation and Benchmarking Chatbot Arena Alibaba Qwen MT-Bench +3 more

7Qwen Research·1mo ago·source ↗

Qwen1.5-110B: Alibaba Releases First 100B+ Model in Qwen1.5 Series

Alibaba's Qwen team released Qwen1.5-110B, their first open-weights model exceeding 100 billion parameters. The model claims comparable performance to Meta's Llama-3-70B on base model benchmarks, with strong results on MT-Bench and AlpacaEval 2 chat evaluations. The release follows a wave of large open-source models exceeding 100B parameters from various organizations.

Frontier Model Releases Evaluation and Benchmarking MT-Bench Meta-Llama-3-70B Alibaba +3 more

5Hugging Face Blog·1mo ago·source ↗

Community Evals: Because we're done trusting black-box leaderboards over the community

Hugging Face introduces Community Evals, a framework aimed at replacing or supplementing opaque black-box leaderboards with community-driven model evaluations. The initiative reflects growing skepticism about the reliability and transparency of existing benchmark leaderboards. By crowdsourcing evaluations, Hugging Face seeks to make model assessment more transparent, diverse, and resistant to gaming. This represents a structural shift in how the open-source AI community approaches model comparison and trust.

Evaluation and Benchmarking Open Weights Progress Open LLM Leaderboard Hugging Face Community Evals

5Hugging Face Blog·1mo ago·source ↗

Measuring Open-Source Llama Nemotron Models on DeepResearch Bench

NVIDIA evaluates its open-source Llama Nemotron models on the DeepResearch Bench, a benchmark designed to assess deep research agent capabilities. The post appears to report competitive performance of the Nemotron models in agentic research tasks. This is relevant to the ongoing development of open-weights models capable of multi-step research and reasoning workflows.

Evaluation and Benchmarking Open Weights Progress Llama Nemotron NVIDIA DeepResearch Bench +3 more

4Hugging Face Blog·1mo ago·source ↗

3LM: A Benchmark for Arabic LLMs in STEM and Code

TII UAE has released 3LM, a benchmark designed to evaluate large language models on Arabic-language STEM and coding tasks. The benchmark addresses a gap in multilingual evaluation infrastructure, where Arabic has been underrepresented relative to English and other high-resource languages. It targets both general-purpose and Arabic-specialized LLMs to assess their capabilities in technical domains.

Evaluation and Benchmarking Open Weights Progress 3LM Hugging Face TII UAE

5Hugging Face Blog·1mo ago·source ↗

TimeScope: How Long Can Your Video Large Multimodal Model Go?

Hugging Face introduces TimeScope, a benchmark designed to evaluate video large multimodal models (LMMs) across varying video lengths and temporal reasoning demands. The benchmark targets a known gap in existing evaluations: most video benchmarks use short clips, leaving long-video understanding largely untested. TimeScope aims to systematically probe how model performance degrades or holds as video duration increases.

Long Context Evolution Evaluation and Benchmarking TimeScope Hugging Face +1 more

5Hugging Face Blog·1mo ago·source ↗

LeRobot Community Datasets: The "ImageNet" of Robotics — When and How?

Hugging Face's LeRobot blog post discusses the vision and current state of building a large-scale community robotics dataset analogous to ImageNet for computer vision. The post examines what it would take to create a standardized, scalable dataset repository for robot learning, drawing on the LeRobot ecosystem. It addresses data collection formats, community contribution workflows, and the open challenges in making such a resource practically useful for training generalizable robot policies.

Evaluation and Benchmarking Open Weights Progress LeRobot Hugging Face ImageNet +1 more

5Hugging Face Blog·1mo ago·source ↗

Introducing HELMET: Holistically Evaluating Long-context Language Models

HELMET is a new benchmark designed to holistically evaluate long-context language models across diverse real-world tasks rather than synthetic needle-in-a-haystack tests. The benchmark covers multiple task categories including retrieval, reasoning, summarization, and code, aiming to provide more reliable and comprehensive assessment of long-context capabilities. It is introduced via the Hugging Face blog, suggesting an open release with associated tooling for the community.

Long Context Evolution Evaluation and Benchmarking HELMET Hugging Face

6Anthropic News·1mo ago·source ↗

Anthropic forms $200 million partnership with the Gates Foundation

Anthropic and the Gates Foundation are committing $200 million over four years in grant funding, Claude usage credits, and technical support across global health, life sciences, education, and economic mobility. Key technical deliverables include healthcare AI benchmarks and evaluation frameworks, disease modeling integrations with the Institute for Disease Modeling, drug/vaccine screening tools for neglected diseases, and agricultural AI datasets. The partnership is led by Anthropic's Beneficial Deployments team and includes public goods such as open datasets and benchmarks. This represents a significant scaling of Anthropic's non-commercial AI deployment strategy.

Evaluation and Benchmarking Enterprise Deployment Patterns Institute for Disease Modeling Claude Gates Foundation +4 more

7Anthropic News·1mo ago·source ↗

Anthropic Launches Ten Finance Agent Templates with Microsoft 365 Integration and Expanded Data Connectors

Anthropic is releasing ten ready-to-run agent templates targeting high-value financial services workflows including pitchbook creation, KYC screening, and month-end close, deployable as plugins in Claude Cowork/Claude Code or as autonomous Claude Managed Agents. The release includes native add-ins for Microsoft Excel, PowerPoint, Word, and Outlook with cross-application context persistence. Claude Opus 4.7 underpins the offering and leads the Vals AI Finance Agent benchmark at 64.37%, with new data connectors from partners including Dun & Bradstreet, Fiscal AI, FactSet, S&P Capital IQ, and others providing governed real-time data access.

Frontier Model Releases Evaluation and Benchmarking Vals AI Finance Agent Benchmark Claude Opus 4.6 Microsoft 365 +14 more

8Anthropic News·1mo ago·source ↗

Anthropic Releases Claude Opus 4.7 with Enhanced Coding, Vision, and Cyber Safeguards

Anthropic has released Claude Opus 4.7, a general-availability model positioned as a meaningful improvement over Opus 4.6 in advanced software engineering, long-horizon agentic tasks, and vision capabilities including higher image resolution. The model is notably the first to receive new cybersecurity safeguards developed in response to Project Glasswing, with automatic detection and blocking of prohibited cyber uses and a new Cyber Verification Program for legitimate security professionals. Opus 4.7 is available across Claude products, API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry at the same pricing as Opus 4.6 ($5/$25 per million input/output tokens). The release is explicitly positioned below Claude Mythos Preview in overall capability, serving as a testbed for safety mechanisms before broader deployment of Mythos-class models.

Frontier Model Releases Evaluation and Benchmarking Harvey Solve Intelligence Amazon Bedrock +16 more

6Anthropic News·1mo ago·source ↗

Anthropic Updates Election Safeguards for Claude Ahead of 2026 US Midterms

Anthropic has published an update on its election-related safety measures for Claude, covering political bias evaluations, usage policy enforcement, and influence operation resistance testing. New model versions Claude Opus 4.7 and Sonnet 4.6 scored 95-96% on political impartiality evaluations and handled election-related policy compliance at 99.8-100% on a 600-prompt test suite. For the first time, Anthropic tested whether models can autonomously run influence operations end-to-end, finding that only Mythos Preview and Opus 4.7 completed more than half of tasks when safeguards were removed, underscoring ongoing capability concerns. Anthropic is also deploying election information banners pointing users to nonpartisan resources like TurboVote for the 2026 US midterms.

Frontier Model Releases Evaluation and Benchmarking Collective Intelligence Project Claude Sonnet 4 Claude Opus 4.6 +9 more

7Anthropic News·1mo ago·source ↗

Anthropic Launches Claude for Financial Services with Claude 4 Models and Ecosystem Integrations

Anthropic has introduced a Financial Analysis Solution targeting finance professionals, built around Claude 4 models and pre-built MCP connectors to data providers including FactSet, S&P Global, PitchBook, Databricks, and Snowflake. Claude Opus 4 reportedly passed 5 of 7 levels of the Financial Modeling World Cup and scored 83% accuracy on complex Excel tasks when deployed by FundamentalLabs. The solution includes Claude Code with expanded usage limits, expert implementation support, and partnerships with major consultancies including Accenture, Deloitte, KPMG, and PwC. Early adopters include Bridgewater's AIA Labs, which has used Claude since 2023 for investment analyst workflows.

Frontier Model Releases Evaluation and Benchmarking PwC Vals AI Finance Agent Benchmark Palantir +20 more

Evaluation and Benchmarking

Related entities

Related topics (8)

Guides (1)

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Recent events (50)