Almanac
← Events
6arXiv cs.AI (Artificial Intelligence)·42h ago

OpenThoughts-Agent: Open data curation pipeline for broadly capable agentic models

The OpenThoughts-Agent (OT-Agent) project releases a fully open data curation pipeline for training agentic language models, addressing the gap left by prior efforts (SWE-Smith, SERA, Nemotron-Terminal) that target single benchmarks. The team conducts over 100 controlled ablation experiments and assembles a 100K-example training set, fine-tuning Qwen3-32B to achieve 44.8% average accuracy across seven agentic benchmarks — a 3.9 percentage point improvement over the strongest existing open agentic model (Nemotron-Terminal-32B at 40.9%). Training data, pipeline, experimental data, and models are publicly released at openthoughts.ai.

Related guides (3)

Related events (8)

7Openai Blog·1mo ago·source ↗

Inside OpenAI's In-House Data Agent

OpenAI describes the architecture and capabilities of an internal AI data agent built on GPT-5 and Codex, designed to reason over large datasets and return reliable analytical insights within minutes. The system incorporates memory components to handle complex, multi-step data queries at scale. This represents a concrete internal deployment of frontier models in an agentic, tool-using workflow. The post offers a rare look at how OpenAI itself operationalizes its own models for enterprise-style data analysis.

6arXiv · cs.CL·7d ago·source ↗

OmniAgent: POMDP-based active perception agent for long video understanding with test-time scaling

Researchers introduce OmniAgent, a multimodal agent that reformulates long video understanding as a POMDP-based iterative Observation-Thought-Action cycle, selectively distilling audio-visual cues into persistent textual memory rather than processing all frames uniformly. The system uses Agentic Supervised Fine-Tuning and a novel reinforcement learning method (TAURA) with turn-level entropy for credit assignment. OmniAgent demonstrates positive test-time scaling and achieves state-of-the-art open-source results across ten benchmarks, with its 7B model outperforming Qwen2.5-VL-72B on LVBench (50.5% vs. 47.3%).

5Hugging Face Blog·1mo ago·source ↗

OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments

This Hugging Face blog post introduces OpenEnv, a framework for evaluating tool-using AI agents in real-world environments. The piece appears to address the challenge of benchmarking agentic systems that interact with external tools and environments, moving beyond static benchmarks toward dynamic, practical evaluation settings. As a tier-2 commentary piece, it likely discusses methodology, design choices, and results from applying OpenEnv to assess agent capabilities.

5Github Trending·1mo ago·source ↗

OpenPipe ART: Agent Reinforcement Trainer for Multi-Step Agents via GRPO

OpenPipe has released ART (Agent Reinforcement Trainer), an open-source Python library for training multi-step agents on real-world tasks using GRPO (Group Relative Policy Optimization). The framework supports multiple model families including Qwen3, GPT-OSS, and Llama. With nearly 10k GitHub stars and 66 gained today, it is gaining notable community traction as a practical RL fine-tuning tool for agentic workflows.

7The Batch·1mo ago·source ↗

Anthropic Alignment Breakthrough, OpenAI Audio Models, DCI Retrieval, and NLA Interpretability

This digest covers four substantive AI developments: Anthropic's research showing that training Claude on ethical reasoning (rather than just aligned actions) reduced agentic misalignment from 22% to 3%, with every Claude model from Haiku 4.5 onward scoring perfectly on misalignment evals. OpenAI launched three new audio models (GPT-Realtime-2, GPT-Realtime-Translate, GPT-Realtime-Whisper) with expanded context windows and multilingual capabilities. Researchers proposed Direct Corpus Interaction (DCI), a retrieval method using command-line tools instead of vector indexes that outperforms RAG baselines by 11-30% across 13 benchmarks. Anthropic also introduced Natural Language Autoencoders (NLAs) for interpretability, revealing Claude shows evaluation awareness more often than it discloses.

7arXiv · cs.CL·42h ago·source ↗

Qwen-AgentWorld: Language world models for general agent simulation and planning

Alibaba's Qwen team introduces Qwen-AgentWorld, a pair of language world models (35B-A3B and 397B-A17B) trained to simulate agentic environments across 7 domains using over 10M interaction trajectories. The models are trained via a three-stage pipeline (CPT, SFT, RL) and evaluated on AgentWorldBench, a new benchmark constructed from 5 frontier models across 9 established benchmarks. Beyond simulation, the work demonstrates two downstream use cases: using the world model as a decoupled RL training environment and as a warm-up for agent foundation models, both yielding gains over baselines.

6Openai Blog·1mo ago·source ↗

OpenAI Upgrades Operator Agent to o3 Model

OpenAI is replacing the GPT-4o-based model powering its Operator agent with a version based on o3, while the API version of Operator remains on GPT-4o. This update is accompanied by a system card addendum documenting the change. The move brings o3's reasoning capabilities to Operator's browser-based task automation.

6arXiv · cs.AI·20d ago·source ↗

Benchmark Agent: Autonomous system for end-to-end benchmark construction

Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.