5Hugging Face Blog·8d ago

AllenAI releases olmo-eval evaluation workbench for model development

AllenAI published a blog post on Hugging Face introducing olmo-eval, an evaluation workbench designed to integrate into the model development loop. The tool appears aimed at streamlining evaluation workflows for researchers iterating on open-weights models. This is relevant to the OLMo model family ecosystem and the broader open-weights evaluation infrastructure space.

Evaluation and Benchmarking Open Weights Progress OLMo AllenAI Hugging Face olmo-eval

Related guides (3)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Open Weights ProgressTopic guide

Open Weights Progress: How Freely Available AI Models Caught Up to the Frontier

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

OlmoEarth v1.1: A More Efficient Family of Models

AllenAI has released OlmoEarth v1.1, described as a more efficient family of models, published via the Hugging Face blog. The post appears to detail improvements in model efficiency for the OlmoEarth line, which is focused on Earth/geoscience domains. As an open-weights release from a major academic AI lab, it continues the trend of domain-specialized open models.

Open Weights Progress Inference Economics AllenAI OlmoEarth v1.1 Hugging Face

5Hugging Face Blog·1mo ago·source ↗

Announcing Evaluation on the Hub

Hugging Face announced Evaluation on the Hub, a new feature enabling users to evaluate any model on any dataset directly within the Hugging Face Hub infrastructure. The tool aims to lower the barrier to standardized model evaluation by integrating evaluation workflows into the existing model and dataset hosting platform. This represents an infrastructure step toward more accessible and reproducible benchmarking in the ML community.

Evaluation and Benchmarking Agent and Tool Ecosystem Evaluation on the Hub Hugging Face

5Hugging Face Blog·1mo ago·source ↗

OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments

This Hugging Face blog post introduces OpenEnv, a framework for evaluating tool-using AI agents in real-world environments. The piece appears to address the challenge of benchmarking agentic systems that interact with external tools and environments, moving beyond static benchmarks toward dynamic, practical evaluation settings. As a tier-2 commentary piece, it likely discusses methodology, design choices, and results from applying OpenEnv to assess agent capabilities.

Evaluation and Benchmarking Agent and Tool Ecosystem Hugging Face OpenEnv

5Hugging Face Blog·2d ago·source ↗

Hugging Face benchmarks open models on agentic tool-use tasks

Hugging Face published a blog post examining whether open models are sufficiently capable for agentic use cases, focusing on benchmarking them against real-world tooling. The post addresses the practical question of which open-weights models can reliably handle tool-calling and multi-step agentic workflows. This is relevant to practitioners evaluating open models for agent deployments.

Evaluation and Benchmarking Open Weights Progress Hugging Face +1 more

5Hugging Face Blog·1mo ago·source ↗

Community Evals: Because we're done trusting black-box leaderboards over the community

Hugging Face introduces Community Evals, a framework aimed at replacing or supplementing opaque black-box leaderboards with community-driven model evaluations. The initiative reflects growing skepticism about the reliability and transparency of existing benchmark leaderboards. By crowdsourcing evaluations, Hugging Face seeks to make model assessment more transparent, diverse, and resistant to gaming. This represents a structural shift in how the open-source AI community approaches model comparison and trust.

Evaluation and Benchmarking Open Weights Progress Open LLM Leaderboard Hugging Face Community Evals

6arXiv · cs.CL·5d ago·source ↗

Every Eval Ever: unified schema and community repository for AI evaluation results

Researchers introduce Every Eval Ever, a shared schema and crowdsourced repository designed to standardize AI evaluation results across incompatible formats, frameworks, and sources. The system ingests results from evaluation harnesses, papers, leaderboards, and custom repositories into a single JSON document format, with optional per-instance output storage. The repository, hosted on Hugging Face, currently covers 22,235 models, 2,273 unique benchmarks, and 31 evaluation formats. The work addresses a persistent infrastructure problem in AI evaluation science: divergent scores for nominally identical evaluations and scattered, incomparable metadata.

Evaluation and Benchmarking Agent and Tool Ecosystem Hugging Face Every Eval Ever

5Hugging Face Blog·3d ago·source ↗

MolmoMotion: Language-guided 3D motion forecasting from Allen AI

Allen AI published a blog post on Hugging Face introducing MolmoMotion, a system for language-guided 3D motion forecasting. The work extends the Molmo model family into motion prediction tasks, combining natural language conditioning with 3D spatial reasoning. The post appears to be an announcement or demonstration of the capability, though the body content was not available for detailed review.

Frontier Model Releases Multimodal Progress MolmoMotion Molmo Hugging Face +1 more

5Hugging Face Blog·1mo ago·source ↗

EMO: Pretraining Mixture of Experts for Emergent Modularity

AllenAI introduces EMO, a pretraining approach for Mixture of Experts (MoE) models that aims to produce emergent modularity during training. The work explores how MoE architectures can develop specialized expert routing without explicit supervision. Published on the Hugging Face blog, this represents research-level work on improving MoE training dynamics and efficiency.

Training Infrastructure Frontier Model Releases AllenAI Mixture of Experts Hugging Face +2 more