OmniAct: Hierarchical framework for persistent embodied agents with unified cyber-physical action spaces
Researchers present OmniAct, a hierarchical asynchronous architecture for embodied agents that unifies cyber (APIs, IoT) and physical (manipulation, navigation) action spaces under a single multimodal semantic planner. The system incorporates adaptive hierarchical memory with event-boundary-driven compression to maintain sub-linear context growth, and an asynchronous visual preemption engine for closed-loop failure recovery during physical execution. Evaluated across 40 real-world long-horizon tasks on two robotic platforms coordinating four IoT devices, OmniAct achieves consistent end-to-end success improvements and elevates mid-scale open-weight models to proprietary-level performance. The work directly addresses the fragmentation between planning, memory, and verification in existing embodied agent systems.
Related guides (3)
Related events (8)
OmniAgent: POMDP-based active perception agent for long video understanding with test-time scaling
Researchers introduce OmniAgent, a multimodal agent that reformulates long video understanding as a POMDP-based iterative Observation-Thought-Action cycle, selectively distilling audio-visual cues into persistent textual memory rather than processing all frames uniformly. The system uses Agentic Supervised Fine-Tuning and a novel reinforcement learning method (TAURA) with turn-level entropy for credit assignment. OmniAgent demonstrates positive test-time scaling and achieves state-of-the-art open-source results across ten benchmarks, with its 7B model outperforming Qwen2.5-VL-72B on LVBench (50.5% vs. 47.3%).
Welcome NVIDIA Cosmos 3: The First Open Omni-model for Physical AI Reasoning and Action
NVIDIA has released Cosmos 3, described as the first open omni-model targeting physical AI reasoning and action. The model is hosted and announced via Hugging Face, positioning it as an open-weights offering for robotics and embodied AI applications. The announcement highlights multimodal capabilities oriented toward physical world understanding and agent-level action.
OmniGameArena: UE5 benchmark for VLM game agents with multi-round improvement dynamics
Researchers introduce OmniGameArena, a real-time benchmark of twelve Unreal Engine 5 games spanning solo, PvP, and cooperative play, designed to evaluate vision-language model agents under unified protocols across commercial VLMs, open-weight VLMs, and specialized game policies. The benchmark introduces the Improvement Dynamics Curve (IDC), an agentic-reflection harness where a tool-using LLM autonomously refines skill prompts across multiple rounds, exposing how agent performance evolves and generalizes beyond a single cold-start score. Twelve VLM agents are evaluated on the leaderboard, with four top agents further analyzed under IDC. The work addresses gaps in existing game benchmarks that report only single-attempt scores and lack multi-agent or cooperative evaluation modes.
agent-teams-ai: multi-agent orchestration framework with kanban-style oversight
A TypeScript open-source project on GitHub implements a multi-agent system where autonomous agents handle tasks, communicate with each other, and review each other's work, while the user supervises via a kanban board. The framework supports 200+ models across 75+ LLM providers including Codex, Claude, and OpenCode. It has accumulated 1,189 stars with 56 added today, suggesting growing community interest.
AHA-WAM: Asynchronous world-action modeling with temporal decoupling for robot manipulation
AHA-WAM introduces a dual Diffusion Transformer architecture that decouples world prediction (low-frequency) from action execution (high-frequency) in robot manipulation policies, addressing the inefficiency of existing world-action models that force both branches to operate at the same temporal resolution. The system uses a rolling key-value memory video DiT as a long-horizon scene planner and a fast action DiT that queries layerwise latent context via joint attention, with Observation-Guided Video-Context Routing enabling asynchronous execution. On RoboTwin benchmarks, AHA-WAM achieves 92.80% average success and 78.3% on real-world tasks at 24.17 Hz, a 4.59x speedup over Fast-WAM, without robot-data pretraining.
Qwen-VLA: Unified Vision-Language-Action Model Across Robot Tasks, Environments, and Embodiments
Alibaba's Qwen team presents Qwen-VLA, a unified embodied foundation model that extends the Qwen vision-language stack to continuous action and trajectory generation via a DiT-based action decoder. The model is jointly pretrained on diverse data spanning manipulation trajectories, egocentric demonstrations, synthetic simulation, and navigation data, with embodiment-aware prompt conditioning to support multiple robot platforms. A unified action-and-trajectory prediction framework covers manipulation, navigation, and trajectory prediction tasks. Benchmarks show strong results: 97.9% on LIBERO, 73.7% on Simpler-WidowX, 69.0% OSR on R2R navigation, and 76.9% average OOD success in real-world ALOHA experiments.
AgentSpec: A modular framework for controlled composition and analysis of embodied LLM agent scaffolds
AgentSpec is a new modular specification framework that represents embodied LLM agents as typed compositions of reusable policy components with standardized interfaces across perception, memory, reasoning, reflection, action, and learning modules. The framework enables controlled swapping and recombination of components, instantiated across four benchmarks (DeliveryBench, ALFRED, MiniGrid, RoboTHOR). Key findings include that agent performance is governed by scaffold compatibility and interaction effects rather than isolated module strength, and that RL-trained policies compose best when optimized with deployment-time scaffold structure. Code, baselines, and an interactive playground are publicly released.
HAT-4D: Agentic framework for 4D multi-object interaction reconstruction from monocular video
HAT-4D is a new agentic framework that reconstructs 3D geometry, temporal dynamics, and physical interactions of multiple objects from single monocular videos, targeting scalable data collection for Embodied AI and Vision-Language-Action (VLA) model training. The system integrates VLMs with a multi-level human-in-the-loop feedback mechanism to resolve depth ambiguities and occlusions without expensive multi-camera rigs. The authors also introduce MVOIK-4D, an open-world benchmark for monocular 4D interaction reconstruction with a novel evaluation protocol focused on physical plausibility and temporal consistency. Experiments show state-of-the-art performance on most metrics, and HAT-4D-generated data improves downstream model fine-tuning.


