OpenThoughts-Agent: Open data curation pipeline for broadly capable agentic models
The OpenThoughts-Agent (OT-Agent) project releases a fully open data curation pipeline for training agentic language models, addressing the gap left by prior efforts (SWE-Smith, SERA, Nemotron-Terminal) that target single benchmarks. The team conducts over 100 controlled ablation experiments and assembles a 100K-example training set, fine-tuning Qwen3-32B to achieve 44.8% average accuracy across seven agentic benchmarks — a 3.9 percentage point improvement over the strongest existing open agentic model (Nemotron-Terminal-32B at 40.9%). Training data, pipeline, experimental data, and models are publicly released at openthoughts.ai.
Related guides (3)
Related events (8)
Inside OpenAI's In-House Data Agent
OpenAI describes the architecture and capabilities of an internal AI data agent built on GPT-5 and Codex, designed to reason over large datasets and return reliable analytical insights within minutes. The system incorporates memory components to handle complex, multi-step data queries at scale. This represents a concrete internal deployment of frontier models in an agentic, tool-using workflow. The post offers a rare look at how OpenAI itself operationalizes its own models for enterprise-style data analysis.
OmniAgent: POMDP-based active perception agent for long video understanding with test-time scaling
Researchers introduce OmniAgent, a multimodal agent that reformulates long video understanding as a POMDP-based iterative Observation-Thought-Action cycle, selectively distilling audio-visual cues into persistent textual memory rather than processing all frames uniformly. The system uses Agentic Supervised Fine-Tuning and a novel reinforcement learning method (TAURA) with turn-level entropy for credit assignment. OmniAgent demonstrates positive test-time scaling and achieves state-of-the-art open-source results across ten benchmarks, with its 7B model outperforming Qwen2.5-VL-72B on LVBench (50.5% vs. 47.3%).
OpenEnv in Practice: Evaluating Tool-Using Agents in Real-World Environments
This Hugging Face blog post introduces OpenEnv, a framework for evaluating tool-using AI agents in real-world environments. The piece appears to address the challenge of benchmarking agentic systems that interact with external tools and environments, moving beyond static benchmarks toward dynamic, practical evaluation settings. As a tier-2 commentary piece, it likely discusses methodology, design choices, and results from applying OpenEnv to assess agent capabilities.
OpenPipe ART: Agent Reinforcement Trainer for Multi-Step Agents via GRPO
OpenPipe has released ART (Agent Reinforcement Trainer), an open-source Python library for training multi-step agents on real-world tasks using GRPO (Group Relative Policy Optimization). The framework supports multiple model families including Qwen3, GPT-OSS, and Llama. With nearly 10k GitHub stars and 66 gained today, it is gaining notable community traction as a practical RL fine-tuning tool for agentic workflows.
Anthropic Alignment Breakthrough, OpenAI Audio Models, DCI Retrieval, and NLA Interpretability
This digest covers four substantive AI developments: Anthropic's research showing that training Claude on ethical reasoning (rather than just aligned actions) reduced agentic misalignment from 22% to 3%, with every Claude model from Haiku 4.5 onward scoring perfectly on misalignment evals. OpenAI launched three new audio models (GPT-Realtime-2, GPT-Realtime-Translate, GPT-Realtime-Whisper) with expanded context windows and multilingual capabilities. Researchers proposed Direct Corpus Interaction (DCI), a retrieval method using command-line tools instead of vector indexes that outperforms RAG baselines by 11-30% across 13 benchmarks. Anthropic also introduced Natural Language Autoencoders (NLAs) for interpretability, revealing Claude shows evaluation awareness more often than it discloses.
Qwen-AgentWorld: Language world models for general agent simulation and planning
Alibaba's Qwen team introduces Qwen-AgentWorld, a pair of language world models (35B-A3B and 397B-A17B) trained to simulate agentic environments across 7 domains using over 10M interaction trajectories. The models are trained via a three-stage pipeline (CPT, SFT, RL) and evaluated on AgentWorldBench, a new benchmark constructed from 5 frontier models across 9 established benchmarks. Beyond simulation, the work demonstrates two downstream use cases: using the world model as a decoupled RL training environment and as a warm-up for agent foundation models, both yielding gains over baselines.
OpenAI Upgrades Operator Agent to o3 Model
OpenAI is replacing the GPT-4o-based model powering its Operator agent with a version based on o3, while the API version of Operator remains on GPT-4o. This update is accompanied by a system card addendum documenting the change. The move brings o3's reasoning capabilities to Operator's browser-based task automation.
Benchmark Agent: Autonomous system for end-to-end benchmark construction
Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.


