Almanac
benchmark

GAIA

benchmarkactivegaia-ce7ffa09·6 events·first seen 28d ago

Aliases: GAIA

Co-occurring entities

More like this (12)

Recent events (6)

6Hugging Face Blog·28d ago·source ↗

Hugging Face Transformers Code Agent Beats GAIA Benchmark

Hugging Face reports that their Transformers-based code agent has achieved a top score on the GAIA benchmark, a challenging evaluation for general AI assistants requiring multi-step reasoning and tool use. The result positions Hugging Face's open agent framework competitively against proprietary systems. The post details the agent architecture and tooling approach used to achieve the result.

6Hugging Face Blog·28d ago·source ↗

Gaia2 and ARE: Empowering the community to study agents

Hugging Face has released Gaia2 and the Agent Reasoning Evaluation (ARE) framework, aimed at enabling the research community to study and benchmark AI agents. The post describes new tools and datasets for evaluating agent capabilities, building on the original GAIA benchmark. This represents an expansion of the agent evaluation ecosystem with community-oriented tooling.

6arXiv · cs.AI·20d ago·source ↗

FluxMem: Connectivity-Evolving Memory Framework for LLM Agents

FluxMem proposes a heterogeneous graph-based memory framework for LLM agents that continuously evolves its topology through three stages: initial connection formation, feedback-driven refinement, and long-term consolidation. Unlike static memory repositories, FluxMem repairs missing links, prunes interference, aligns abstraction granularity, and distills successful trajectories into reusable procedural circuits. The system is guided by a single metric for memory generalizability and evolutionary maturity, achieving state-of-the-art results on LoCoMo, Mind2Web, and GAIA benchmarks.

6arXiv · cs.AI·26h ago·source ↗

Bayesian audit framework for public AI evaluation archives challenges frontier model claims

A new arXiv preprint proposes a Bayesian inference and decision-audit framework for interpreting public AI evaluation archives (LiveBench, Open LLM Leaderboard v2, LMArena, GAIA, tau-bench) as longitudinal time series rather than terminal leaderboards. The paper demonstrates that a single terminal snapshot is compatible with multiple distinct performance histories, yielding ambiguous timing estimates for reaching capability ceilings. A candidate selection-aware frontier model is shown to fail synthetic recovery, objective-archive prediction, preference transfer, and uncertainty calibration, with fixed audit gates rejecting its stronger claims. The work proposes an archive-and-adjudication protocol to reconstruct evaluation histories and falsify unsupported frontier capability claims.

6arXiv · cs.AI·2d ago·source ↗

Parallel-Synthesis framework enables LLM agents to consume KV caches directly, cutting synthesis latency 2.5x–11x

Researchers introduce Parallel-Synthesis, a plug-and-play framework that allows a synthesizer LLM to directly consume KV caches produced by parallel worker agents instead of concatenating their textual outputs. The system combines a cache mapper for calibrating independently generated branch caches with a fine-tuned synthesizer adapter, trained via distillation from standard text-concatenation synthesis. Evaluated across nine datasets spanning math, science QA, code generation, GAIA, and multi-agent database diagnosis, it matches or outperforms text-based synthesis on seven datasets while reducing time-to-first-token by 2.5x–11x. The work proposes a fundamentally different interface for multi-agent synthesis that avoids redundant prefill computation inherent in sequential text merging.

5arXiv · cs.CL·5d ago·source ↗

EvoArena benchmark and EvoMem memory paradigm for LLM agents in dynamic environments

Researchers introduce EvoArena, a benchmark suite that evaluates LLM agents in dynamic environments by modeling changes as progressive update sequences across terminal, software, and social domains. Alongside it, they propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories to help agents reason about environmental change. Current agents score only 39.6% average accuracy on EvoArena, while EvoMem yields consistent gains on EvoArena and also improves performance on GAIA and LoCoMo benchmarks. The work highlights a significant gap between static-benchmark performance and real-world dynamic deployment requirements.