4arXiv cs.AI (Artificial Intelligence)·17h ago

Rhetor: Multi-agent system for rehearsed live product demos with real-time voice Q&A

Researchers introduce Rhetor, a multi-agent system that automates live software product demonstrations by taking a running web application and its source code as input, then producing a rehearsed demo with synchronized narration and real-time voice question answering. The system combines UI exploration with source-code analysis, uses semantic locators for browser action dispatch, and includes a pre-presentation rehearsal loop with graceful degradation. Evaluated across six pipeline sessions on four deployed applications, the system achieves high locator-firing rates (sigma-bar ~0.92 on a 53-action workload) and converges to perfect locator resolution on a public-domain reference app. The paper also proposes a ten-metric benchmark protocol for evaluating demo automation systems.

Enterprise Deployment Patterns Agent and Tool Ecosystem Excalidraw Rhetor

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From AI Demo to Production Reality

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·1mo ago·source ↗

Reversa: A Multi-Agent Framework for Reverse Engineering Legacy Software into AI-Readable Operational Specifications

Reversa is a multi-agent pipeline framework that converts legacy software systems into traceable operational specifications suitable for use by AI coding agents. The framework employs specialized agents for surface mapping, module analysis, implicit rule extraction, architecture synthesis, and specification review, with mechanisms for traceability, confidence marking, and gap preservation. An exploratory case study on migrating an ATM system from COBOL to Go produced 517 confidence-indexed claims, 53 Gherkin parity scenarios, and a partial reconstruction plan, though final validation was not completed. The system is distributed as a Node.js CLI and is positioned relative to literature on reverse engineering, LLM-based documentation, and software agents.

Enterprise Deployment Patterns Agent and Tool Ecosystem SHA-256 Go (programming language)Gherkin +3 more

5arXiv · cs.CL·18d ago·source ↗

ModeratorLM: Role-conditioned turn-taking for multi-party voice agents with 40%+ precision gains

Researchers introduce ModeratorLM, a voice agent system that conditions turn-taking behavior on an explicitly assigned conversational role in multi-party settings, built on a streaming speech LLM. A reasoning-augmented variant adds chain-of-thought over conversational context. Evaluated on real-world meeting data and the new RolePlayConv synthetic dataset, the system achieves over 40% improvement in turn-taking precision and 70% in recall while reducing false-positive interruptions versus non-role-conditioned baselines.

Agent and Tool Ecosystem Multimodal Progress ModeratorLM RolePlayConv Adaptive Turn-Taking for Real-time Multi-Party Voice Agents

3Mistral Ai News·29d ago·source ↗

Mistral AI Demonstrates Agentic Workflow for Meeting-to-Dev-Ticket Automation

Mistral AI has published a solution blog describing a multi-agent workflow called TranscriptToPRDTicket that converts meeting transcripts into Product Requirements Documents and engineering tickets using two specialized agents (PRDAgent and TicketCreationAgent) both powered by Mistral Large 2. The pipeline integrates with project management tools such as Linear and Jira, and a full implementation is provided via a Google Colab notebook. The post is primarily a deployment-pattern showcase rather than a new model or capability announcement.

Enterprise Deployment Patterns Agent and Tool Ecosystem Mistral AI TicketCreationAgent Mistral Large 2 +4 more

6arXiv · cs.AI·25d ago·source ↗

Benchmark Agent: Autonomous system for end-to-end benchmark construction

Researchers introduce Benchmark Agent, a fully autonomous agentic system that orchestrates the complete benchmark construction pipeline — from query analysis and subtask design to data annotation and quality control. The system was used to produce 15 benchmarks spanning text understanding, multimodal understanding, and domain-specific reasoning, with evaluation via human judges, LLM-as-a-judge, and consistency checks. The work addresses two persistent problems in the field: the labor intensity of benchmark creation and rapid performance saturation after release. Code and a demo will be publicly released.

Evaluation and Benchmarking Agent and Tool Ecosystem Benchmark Everything Everywhere All at Once Benchmark Agent

6arXiv · cs.CL·12d ago·source ↗

OmniAgent: POMDP-based active perception agent for long video understanding with test-time scaling

Researchers introduce OmniAgent, a multimodal agent that reformulates long video understanding as a POMDP-based iterative Observation-Thought-Action cycle, selectively distilling audio-visual cues into persistent textual memory rather than processing all frames uniformly. The system uses Agentic Supervised Fine-Tuning and a novel reinforcement learning method (TAURA) with turn-level entropy for credit assignment. OmniAgent demonstrates positive test-time scaling and achieves state-of-the-art open-source results across ten benchmarks, with its 7B model outperforming Qwen2.5-VL-72B on LVBench (50.5% vs. 47.3%).

Inference Economics Agent and Tool Ecosystem OmniAgent Qwen2.5-VL-72B LVBench +4 more

5arXiv · cs.CL·15d ago·source ↗

RePro: Retrospective Progress-Aware Self-Refinement for LLM Agent Training

Researchers introduce RePro (Retrospective Progress-Aware Training), a framework addressing the gap between step-wise RL optimization and metacognitive task-progress awareness in LLM agents. The approach uses a forward-then-reflect rollout paradigm where agents execute actions online and then retrospectively assess step-wise progress given the completed trajectory and known outcome. Evaluated on WebShop, ALFWorld, and Sokoban, RePro achieves up to 12% absolute success rate gains over baseline Qwen-family models without requiring continuous external supervision.

Agent and Tool Ecosystem Alignment and RLHF ALFWorld Sokoban RePro +2 more

5arXiv · cs.CL·20d ago·source ↗

DocTrace: Structure-Aware On-Demand Hypergraph Memory for Long-Document QA

Researchers introduce DocTrace, a multi-agent RAG framework for long-document question answering that uses query-triggered knowledge organization rather than costly query-agnostic preprocessing. The system combines a lightweight document structural tree index, on-demand hypergraph working memory, and a graph-structured experience memory that stores successful reasoning plans for reuse. Evaluated on four long-document QA datasets, DocTrace outperforms the strongest baseline (ComoRAG) by up to 8.85% F1 and 4.40% EM while reducing computational cost by 53.32%.

Long Context Evolution Agent and Tool Ecosystem ComoRAG DocTrace Trace Only What You Need: Structure-Aware On-Demand Hypergraph Memory for Long-Document Question Answering

7arXiv · cs.CL·1mo ago·source ↗

MobileGym: Verifiable Parallel Simulation Platform for Mobile GUI Agent Training

MobileGym is a browser-hosted simulation environment for mobile GUI agent research that enables deterministic outcome verification via structured JSON state and scalable online RL through hundreds of parallel instances (~400 MB/instance, ~3s cold start). The accompanying MobileGym-Bench provides 416 parameterized task templates across 28 apps with deterministic judges. A sim-to-real case study using GRPO on Qwen3-VL-4B-Instruct achieves +12.8 percentage points on the 256-task test set, with real-device execution retaining 95.1% of simulation-side training gains.

Evaluation and Benchmarking Inference Economics MobileGym-Bench GRPO MobileGym +6 more