Simon Willison describes a technique for having AI agents record video demonstrations of their browser-based work using the shot-scraper video tool. The approach enables automated capture of agent activity for debugging, documentation, or demonstration purposes. This is a practical tooling pattern relevant to anyone building or evaluating web-browsing agents.
Researchers introduce Rhetor, a multi-agent system that automates live software product demonstrations by taking a running web application and its source code as input, then producing a rehearsed demo with synchronized narration and real-time voice question answering. The system combines UI exploration with source-code analysis, uses semantic locators for browser action dispatch, and includes a pre-presentation rehearsal loop with graceful degradation. Evaluated across six pipeline sessions on four deployed applications, the system achieves high locator-firing rates (sigma-bar ~0.92 on a 53-action workload) and converges to perfect locator resolution on a public-domain reference app. The paper also proposes a ten-metric benchmark protocol for evaluating demo automation systems.
Simon Willison documents an update to his OpenAI WebRTC Audio Session tool that adds document context capabilities, allowing audio sessions to incorporate document content. The post covers practical integration of OpenAI's real-time audio API with document-grounded context. This is a hands-on tooling walkthrough relevant to practitioners building voice-enabled AI applications.
A new arXiv preprint proposes converting human browser interaction trajectories into compact natural-language skills that agents can retrieve and compose, arguing that the bottleneck for browser agents is decision-making under incomplete information rather than low-level operations. The approach organizes distilled skills into a skill graph to enable consolidation rather than unbounded accumulation. The work positions collective human browsing behavior as a scalable, under-exploited source of reusable agent priors, potentially reducing reliance on manually designed task demonstrations.
Simon Willison covers a Cloudflare feature enabling temporary accounts for AI agents, which allows agents to provision and use cloud resources ephemerally. The post highlights an emerging infrastructure pattern where AI agents are granted scoped, time-limited credentials rather than persistent access. This is relevant to the agent-tool ecosystem as it addresses identity and resource management for autonomous agents.
Hugging Face has released ScreenSuite, described as the most comprehensive evaluation suite for GUI (Graphical User Interface) agents. The suite aims to standardize and broaden benchmarking for agents that interact with visual interfaces. This addresses a gap in the evaluation ecosystem for screen-based AI agents, which are increasingly relevant as agentic systems expand into desktop and web automation tasks.
Simon Willison documents an experiment using DSPy to systematically evaluate and improve the SQL system prompts used by Datasette Agent. The post covers applying DSPy's prompt optimization framework to a real-world agentic tool, demonstrating a practical workflow for automated prompt engineering. This is a hands-on practitioner account of using DSPy for prompt evaluation in a production-adjacent context.
browser-use/video-use is a Python library enabling AI coding agents to edit videos programmatically, accumulating over 10,000 GitHub stars with strong daily momentum (+216). The project extends the browser-use agent paradigm to video editing workflows. High star count signals significant community interest in agent-driven media manipulation tooling.
Simon Willison documents a workflow for configuring custom pricing for models within AgentsView, a tool for tracking AI agent costs. The post addresses a practical need for practitioners who use models not yet priced in the tool's default database. It is a short how-to from a tier-2 commentary source with minimal body content available.