4Simon Willison's Weblog·7d ago

Simon Willison adds document context to OpenAI WebRTC Audio Session tool

Simon Willison documents an update to his OpenAI WebRTC Audio Session tool that adds document context capabilities, allowing audio sessions to incorporate document content. The post covers practical integration of OpenAI's real-time audio API with document-grounded context. This is a hands-on tooling walkthrough relevant to practitioners building voice-enabled AI applications.

Agent and Tool Ecosystem Simon Willison OpenAI OpenAI WebRTC Audio Session

Related guides (3)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Simon Willison

Simon Willison: Developer, Toolmaker, and AI's Most Useful Commentator

Read asBeginner In-depth

Related events (8)

6Openai Blog·1mo ago·source ↗

How OpenAI Delivers Low-Latency Voice AI at Scale

OpenAI published a technical overview of how it rebuilt its WebRTC stack to support real-time voice AI at global scale. The post covers infrastructure choices enabling low-latency audio delivery and conversational turn-taking. This represents a production-grade engineering disclosure about the systems underpinning OpenAI's voice products.

Inference Economics Enterprise Deployment Patterns WebRTC OpenAI Voice AI OpenAI +1 more

5The Batch·17d ago·source ↗

DeepLearning.AI launches Context Hub (chub), a crowdsourced API documentation tool for coding agents

Andrew Ng and collaborators released Context Hub (chub), an open context management system designed to give coding agents up-to-date API documentation, addressing the common failure mode where agents use outdated or hallucinated API calls due to training data cutoffs. The tool is installable via npm and exposes a CLI that agents can invoke to fetch current documentation for LLM providers, databases, payment processors, and other services. A planned future feature would allow agents to share discovered workarounds and documentation fixes across a community, enabling collective improvement over time.

Enterprise Deployment Patterns Agent and Tool Ecosystem DeepLearning.AI Claude Opus 4.6 Context Hub +4 more

7Openai Blog·1mo ago·source ↗

Introducing the Realtime API

OpenAI has launched the Realtime API, enabling developers to build low-latency speech-to-speech experiences directly into their applications. The API provides native audio input and output without requiring separate transcription and text-to-speech steps. This represents a significant infrastructure offering for voice-enabled AI applications, moving beyond text-based API paradigms.

Inference Economics Enterprise Deployment Patterns GPT-4o Realtime API OpenAI +2 more

3Simon Willison'S Weblog·11d ago·source ↗

Simon Willison comments on Siri AI announcements at WWDC 2026

Simon Willison published commentary on Apple's Siri AI announcements at WWDC 2026. The body content is empty, so specific claims or findings cannot be assessed. Given the source and timing, this likely covers Apple Intelligence or Siri capability updates shown at the conference.

Frontier Model Releases WWDC 2026 Simon Willison Siri +1 more

4Hugging Face Blog·1mo ago·source ↗

Accelerating Document AI

This Hugging Face blog post covers the state of Document AI, focusing on tools and models for processing and understanding documents using machine learning. It likely discusses transformer-based approaches for tasks like document classification, information extraction, and visual document understanding. The post appears to survey the ecosystem of models and libraries available for document intelligence workflows.

Enterprise Deployment Patterns Agent and Tool Ecosystem Hugging Face LayoutLM Document AI

6The Batch·1mo ago·source ↗

OpenAI Updates Audio Models That Reason, Transcribe, and Translate

OpenAI introduced three new audio models in its Realtime API: GPT-Realtime-2 (speech-to-speech with five configurable reasoning effort levels), GPT-Realtime-Translate (70+ input languages), and GPT-Realtime-Whisper (transcription). GPT-Realtime-2 operates as an end-to-end audio model including reasoning, with latency ranging from 1.12 seconds at minimal effort to 2.33 seconds at high effort. Benchmark results are mixed: it leads Scale AI's Audio MultiChallenge and Artificial Analysis Conversational Dynamics but trails Step-Audio R1.1 Realtime and Grok Voice Think Fast 1.0 on speech reasoning and agentic tasks. The configurable reasoning-latency tradeoff is positioned as a key differentiator for voice agent applications.

Frontier Model Releases Evaluation and Benchmarking Scale AI Audio MultiChallenge GPT-Realtime-2 Google +14 more

7Openai Blog·1mo ago·source ↗

Introducing gpt-realtime and Realtime API updates

OpenAI is releasing a new speech-to-speech model called gpt-realtime alongside expanded Realtime API capabilities. New features include MCP server support, image input, and SIP phone calling support. These updates extend the Realtime API's utility for voice-driven and multimodal agent applications.

Frontier Model Releases Inference Economics GPT-Realtime-2 SIP Realtime API +4 more

7Openai Blog·1mo ago·source ↗

OpenAI Introduces Next-Generation Audio Models in the API

OpenAI is releasing new audio models via its API, including an updated text-to-speech model that accepts natural-language style instructions (e.g., 'talk like a sympathetic customer service agent'). This marks the first time developers can programmatically control speaking style through prompts rather than fixed voice presets. The release targets voice agent developers seeking finer-grained customization of synthesized speech.

Enterprise Deployment Patterns Agent and Tool Ecosystem OpenAI TTS API instructable text-to-speech OpenAI API +2 more