What multimodal AI is
For most of AI's history, a model did one thing: it read text and wrote text. "Multimodal" means breaking that constraint — giving AI the ability to see images, listen to speech, watch video, and increasingly to act on a computer or in a physical robot. The story of the last five years is how that capability went from a research curiosity to the default expectation for any serious AI product.
Why it matters
Think about how humans actually communicate: we point at things, describe what we see, play audio clips, share screenshots. A text-only AI misses most of that. Multimodal AI can read a chart you paste in, transcribe a meeting recording, help you navigate a website, or understand a video — all things that were simply impossible for the previous generation of tools. As these capabilities mature, they are also the foundation for AI agents that can do real work on your computer without you having to describe every step in words.
How it evolved: four waves
Wave 1 — Connecting images and language (2021–2023)
The first big unlock was CLIP, introduced by OpenAI in January 2021. CLIP learned to match images with natural language descriptions — no special training needed for each new type of image. That idea seeded nearly everything that followed. GPT-4 in March 2023 brought image understanding into a mainstream large language model for the first time, accepting pictures alongside text and producing human-level results on professional benchmarks.
Wave 2 — Going native: one model for everything (2024)
The landmark moment was GPT-4o ("Omni"), announced in May 2024. Unlike earlier systems that routed your image to one model and your text to another, GPT-4o processed text, audio, and vision together in real time — no separate pipeline stages. OpenAI described it as a step toward "natively multimodal AI." Around the same time, Anthropic's Claude 3 family added vision capabilities, and Meta's Llama 3.2 brought open-weights multimodal models to anyone who wanted to run them locally. Alibaba's Qwen2.5-VL followed with strong vision-language performance in three sizes.
Also in 2024, OpenAI introduced Sora — a video generation model that could produce up to one minute of high-fidelity video from a text description, framing video generation as a path toward AI that understands the physical world.
Wave 3 — Hearing and acting (2025)
Speech got its own moment: OpenAI's Whisper (released 2022, but widely adopted through this period) set the open-source baseline for robust multilingual transcription. Mistral's Voxtral (mid-2025) pushed further, offering open-weight speech models that could answer questions and summarize directly from audio — competitive with GPT-4o mini on audio benchmarks.
The bigger story in 2025 was computer use: AI that can look at your screen and actually operate software. Anthropic launched a public beta in mid-2025 with Claude 3.5 Sonnet scoring 14.9% on the OSWorld benchmark — roughly double the next-best AI at the time, though well below the human score of 70–75%. OpenAI followed with its Computer-Using Agent (CUA), combining GPT-4o's vision with reinforcement learning to navigate browsers and desktop apps.
Video generation matured too: Sora 2 (September 2025) added synchronized dialogue and sound effects alongside improved physics. Google DeepMind released Veo 3 and Imagen 4 for professional video and image creation, plus a filmmaking tool called Flow.
Wave 4 — Into the physical world (2025–2026)
The most recent frontier is embodied AI — models that don't just see the world but act in it. Google DeepMind released Gemini Robotics and Gemini Robotics-ER for robotic systems that perceive and reason about physical environments, followed by Gemini Robotics On-Device for robots that run AI locally without a cloud connection. Genie 3 pushed further still: a world model that generates interactive, navigable 3D environments in real time at 24 fps and 720p — not a video to watch, but a space to move through.
Computer use also took a leap: Anthropic acquired Vercept (a team specializing in AI perception for computer interaction) and released Claude Sonnet 4.6, which scored 72.5% on OSWorld — up from under 15% just eighteen months earlier, and approaching human-level performance on tasks like navigating spreadsheets and filling out web forms.
The open-weights side of the story
Multimodal progress has not been locked behind paywalls. Meta's Llama 3.2 was the first open-weights Llama with image understanding. Mistral Small 4 unified vision, reasoning, and coding in a single 119B-parameter open model under Apache 2.0. Mistral's Voxtral brought open speech understanding. Alibaba's Qwen2.5-VL offered strong vision-language performance in sizes from 3B to 72B, all publicly available. This means developers and researchers can build on, fine-tune, and study these capabilities without depending on any single company's API.
Where it's heading
The trajectory points toward AI that interacts with the world the way people do — through sight, sound, and action, not just typed words. Computer use is the near-term proving ground: the jump from 15% to 72% on OSWorld in under two years suggests this will be a practical capability, not just a demo, within the current model generation. Embodied robotics is the longer arc: models that can perceive, plan, and act in physical space, running locally on the robot itself. The open question is whether unified single-model architectures will continue to win, or whether specialized models for each modality will hold advantages in cost and reliability for specific tasks.




