What this area covers
Multimodal AI is the project of building systems that perceive and generate across more than one sensory channel — text, images, audio, video, and increasingly physical action. The field spans vision-language models (VLMs) that read images and answer questions about them, speech systems that transcribe and understand audio, video generators that synthesize moving images, and the emerging class of computer-use and robotics agents that close the loop between perception and action.
Why it matters
Language-only models are bounded by what can be expressed in text. Multimodal capability removes that ceiling: a model that can see a screenshot, hear a voice command, watch a video, or control a GUI can participate in workflows that were previously inaccessible to AI. The practical stakes range from developer tooling (vision-assisted debugging, UI automation) to creative production (video generation, image synthesis) to physical-world deployment (robotics, embodied agents).
Phase 1 — Contrastive pretraining and the VLM foundation (2021–2023)
The modern multimodal era traces to CLIP (January 2021), which demonstrated that a neural network could learn visual concepts from natural language supervision alone — zero-shot visual classification without task-specific training data, mirroring the zero-shot transfer GPT-2 and GPT-3 had shown in language. This contrastive text-image pretraining paradigm became the substrate for most subsequent vision-language work.
OpenAI's Whisper (September 2022) did the equivalent for speech: a single model trained on 680,000 hours of multilingual web audio achieved near-human English transcription accuracy and multilingual translation, released as open weights. These two models — CLIP and Whisper — established the baselines that later systems would be measured against.
GPT-4 (March 2023) brought image understanding into a flagship production model, accepting image and text inputs and producing text outputs. By November 2023, GPT-4 Turbo with Vision extended this to the developer API alongside DALL·E 3 image generation. ChatGPT gained voice input and speech output for Plus and Enterprise users in September 2023, making multimodal interaction consumer-facing for the first time at scale.
Phase 2 — Native unification and the omnimodal bet (2024)
The pivotal architectural shift arrived in May 2024 with GPT-4o (Omni). Rather than routing audio, vision, and text through separate pipeline stages, GPT-4o processed all three natively in a single model — the first major production system to do so. OpenAI positioned it as their primary production model going forward, signaling that the pipeline era was ending.
The same month, Anthropic's Claude 3 family introduced multimodal vision capabilities alongside near-perfect long-context recall, while Meta's Llama 3.1 (July 2024) extended the open-weights frontier with multilingual support and extended context — though vision came later with Llama 3.2 (September 2024), which added image understanding and lightweight edge variants, marking Meta's first open-weights multimodal Llama release.
Video generation matured in parallel. OpenAI's Sora research preview (February 2024) introduced a transformer architecture operating on spacetime patches of video and image latent codes, trained jointly on videos and images of variable durations and resolutions. OpenAI explicitly framed scaling video generation as a path toward general-purpose physical world simulators — a claim that reframed the stakes of the field. Sora publicly launched at sora.com in December 2024 with up to 1080p, 20-second generation from text prompts or existing assets.
Alibaba's Qwen2.5-VL (January 2025) brought open-weights vision-language capability to 72B scale across three sizes, while Qwen2 had already established strong multilingual and long-context text foundations.
Phase 3 — Computer use, audio parity, and the action gap (2025)
The most consequential multimodal development of 2025 was not a new modality but a new output type: computer use. Anthropic launched a public beta of computer use for Claude 3.5 Sonnet in August 2025, enabling the model to control a computer by interpreting screenshots and issuing pixel-level cursor and keyboard commands. The 14.9% OSWorld score was roughly double the next-best AI model at the time, though well below human-level performance of 70–75%. Prompt injection was identified as the primary near-term risk.
Google DeepMind followed with a Gemini 2.5 Computer Use model preview in October 2025, entering the same space. The competitive dynamic accelerated rapidly: by February 2026, Anthropic had acquired Vercept — a team specializing in AI perception for computer use, co-founded by researchers including Ross Girshick — and Claude Sonnet 4.6 reached 72.5% on OSWorld, approaching human-level performance on tasks like navigating spreadsheets and completing web forms.
On audio, Mistral AI released Voxtral (July 2025), a family of open-weight speech understanding models (24B and 3B) that outperformed Whisper large-v3 across all tasks and were competitive with GPT-4o mini and Gemini 2.5 Flash on audio understanding. Voxtral supported long-form audio up to 30–40 minutes, native multilingual transcription, and function-calling directly from voice — extending the open-weights audio baseline well beyond Whisper's transcription-focused design.
Video generation continued to advance: Sora 2 (September 2025) added synchronized dialogue and sound effects, improved physics simulation, and enhanced steerability. Google DeepMind announced Veo 3 and Imagen 4 (May 2025) targeting professional media production, and Genie 3 (October 2025) pushed into interactive world modeling — generating navigable 3D environments in real time at 24 fps and 720p, maintaining consistency for several minutes.
Phase 4 — Embodied AI and unified open-weights models (late 2025–2026)
DeepMind extended its multimodal stack into physical-world robotics with Gemini Robotics (March 2025), Gemini Robotics On-Device (June 2025, targeting edge deployment without cloud inference), and Gemini Robotics 1.5 (October 2025), which added perception, planning, reasoning, tool use, and multi-step task execution for embodied agents. This trajectory — from screen pixels to robot actuators — represents the most ambitious extension of the multimodal perception stack.
On the open-weights side, Mistral Small 4 (March 2026) unified capabilities previously split across separate specialist models — reasoning (Magistral), multimodal (Pixtral), and coding (Devstral) — into a single 119B-parameter MoE model with native text and image input, released under Apache 2.0. This consolidation pattern, where a single open model replaces a portfolio of specialists, is a meaningful signal about where the open ecosystem is heading.
Meta's Muse Spark (April 2026), the first model from its Superintelligence Labs, introduced natively multimodal reasoning with visual chain-of-thought and multi-agent orchestration — though as a closed-weights product, marking a strategic departure from Meta's open Llama line. Google DeepMind announced Gemini Omni (May 2026), extending the omnimodal naming convention further. Claude Opus 4.7 (May 2026) shipped improved vision with higher image resolution alongside cybersecurity safeguards.
Where it's heading
The bundle points to three active frontiers. First, computer use as a convergence point: the rapid OSWorld improvement (14.9% → 72.5% in 18 months) suggests GUI control is becoming a standard capability rather than a research demo, with acquisition activity (Vercept) and dedicated model variants (Gemini 2.5 CU) signaling sustained investment. Second, video + audio unification: Sora 2's synchronized audio and Veo 3's professional media targeting suggest the next generation of video models will be expected to handle sound natively, not as an add-on. Third, embodied physical action: DeepMind's robotics line is the clearest bet that multimodal perception will eventually need to drive reliable physical-world behavior — the hardest version of the action gap problem.




