7OpenAI Blog·1mo ago

Introducing gpt-realtime and Realtime API updates

OpenAI is releasing a new speech-to-speech model called gpt-realtime alongside expanded Realtime API capabilities. New features include MCP server support, image input, and SIP phone calling support. These updates extend the Realtime API's utility for voice-driven and multimodal agent applications.

Frontier Model Releases Inference Economics Agent and Tool Ecosystem Multimodal Progress GPT-Realtime-2 SIP Realtime API OpenAI Model Context Protocol

Related guides (3)

OpenAI

OpenAI: The Lab That Made AI a Household Word

Read asBeginner In-depth

Model Context ProtocolConcept

Model Context Protocol (MCP): The Universal Plug for AI Agents

Read asBeginner In-depth

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Related events (8)

7Openai Blog·1mo ago·source ↗

Introducing the Realtime API

OpenAI has launched the Realtime API, enabling developers to build low-latency speech-to-speech experiences directly into their applications. The API provides native audio input and output without requiring separate transcription and text-to-speech steps. This represents a significant infrastructure offering for voice-enabled AI applications, moving beyond text-based API paradigms.

Inference Economics Enterprise Deployment Patterns GPT-4o Realtime API OpenAI +2 more

7Latent Space·1mo ago·source ↗

GPT-Realtime-2, GPT-Translate, and new Whisper: OpenAI's new SOTA realtime voice APIs

OpenAI has released a suite of new real-time voice and audio APIs including GPT-Realtime-2, a GPT-Translate model, and an updated Whisper, all positioned as state-of-the-art for real-time voice applications. The releases appear to be part of a broader push to deploy GPT-5 capabilities across multiple product surfaces. Coverage comes from the Latent Space AI News digest, which aggregates and contextualizes the announcements.

Frontier Model Releases Agent and Tool Ecosystem GPT-Realtime-2 OpenAI Whisper +3 more

6The Batch·1mo ago·source ↗

OpenAI Updates Audio Models That Reason, Transcribe, and Translate

OpenAI introduced three new audio models in its Realtime API: GPT-Realtime-2 (speech-to-speech with five configurable reasoning effort levels), GPT-Realtime-Translate (70+ input languages), and GPT-Realtime-Whisper (transcription). GPT-Realtime-2 operates as an end-to-end audio model including reasoning, with latency ranging from 1.12 seconds at minimal effort to 2.33 seconds at high effort. Benchmark results are mixed: it leads Scale AI's Audio MultiChallenge and Artificial Analysis Conversational Dynamics but trails Step-Audio R1.1 Realtime and Grok Voice Think Fast 1.0 on speech reasoning and agentic tasks. The configurable reasoning-latency tradeoff is positioned as a key differentiator for voice agent applications.

Frontier Model Releases Evaluation and Benchmarking Scale AI Audio MultiChallenge GPT-Realtime-2 Google +14 more

7Openai Blog·1mo ago·source ↗

Advancing voice intelligence with new models in the API

OpenAI is releasing new realtime voice models via its API with capabilities spanning reasoning, translation, and transcription. The announcement targets developers building voice-enabled applications and represents an expansion of OpenAI's voice intelligence offerings beyond the existing Realtime API. The models are positioned to enable more natural and intelligent voice experiences in production deployments.

Frontier Model Releases Enterprise Deployment Patterns OpenAI voice models OpenAI Realtime API OpenAI +1 more

7Openai Blog·1mo ago·source ↗

Introducing ChatGPT and Whisper APIs

OpenAI announced the release of dedicated APIs for ChatGPT (gpt-3.5-turbo) and Whisper, enabling developers to integrate conversational AI and speech-to-text capabilities into their applications. The ChatGPT API offered significant cost reductions compared to existing GPT-3.5 endpoints. This marked a major step in OpenAI's platform strategy, opening programmatic access to its most widely used consumer models.

Inference Economics Enterprise Deployment Patterns GPT-3.5 Turbo ChatGPT OpenAI +2 more

8Openai Blog·1mo ago·source ↗

ChatGPT can now see, hear, and speak

OpenAI announced multimodal capabilities for ChatGPT, enabling the model to process images (vision), listen to voice input, and respond with synthesized speech. These features expand ChatGPT beyond text-only interaction into a multimodal assistant experience. The rollout was announced for Plus and Enterprise users first, with broader availability to follow.

Frontier Model Releases Enterprise Deployment Patterns ChatGPT GPT-4V ChatGPT Plus +3 more

9Openai Blog·1mo ago·source ↗

OpenAI Spring Update: GPT-4o Announced, Expanded Free ChatGPT Capabilities

OpenAI announced GPT-4o, a new flagship model, alongside an expansion of capabilities available to free-tier ChatGPT users. GPT-4o represents a new omnimodal architecture capable of handling text, audio, and vision in a unified model. The announcement was made via a live demo event and marks a significant shift in OpenAI's product and model strategy.

Frontier Model Releases Inference Economics ChatGPT GPT-4o OpenAI +2 more

7Openai Blog·1mo ago·source ↗

OpenAI Introduces Next-Generation Audio Models in the API

OpenAI is releasing new audio models via its API, including an updated text-to-speech model that accepts natural-language style instructions (e.g., 'talk like a sympathetic customer service agent'). This marks the first time developers can programmatically control speaking style through prompts rather than fixed voice presets. The release targets voice agent developers seeking finer-grained customization of synthesized speech.

Enterprise Deployment Patterns Agent and Tool Ecosystem OpenAI TTS API instructable text-to-speech OpenAI API +2 more