Almanac
paper

Native Active Perception as Reasoning for Omni-Modal Understanding

paperactiveprovisionalnative-active-perception-as-reasoning-for-omni-modal-understanding-d95d1994·1 events·first seen 3d ago

Aliases: Native Active Perception as Reasoning for Omni-Modal Understanding

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.CL·3d ago·source ↗

OmniAgent: POMDP-based active perception agent for long video understanding with test-time scaling

Researchers introduce OmniAgent, a multimodal agent that reformulates long video understanding as a POMDP-based iterative Observation-Thought-Action cycle, selectively distilling audio-visual cues into persistent textual memory rather than processing all frames uniformly. The system uses Agentic Supervised Fine-Tuning and a novel reinforcement learning method (TAURA) with turn-level entropy for credit assignment. OmniAgent demonstrates positive test-time scaling and achieves state-of-the-art open-source results across ten benchmarks, with its 7B model outperforming Qwen2.5-VL-72B on LVBench (50.5% vs. 47.3%).