Almanac
Topic guide · Beginner

Multimodal Progress: How AI Learned to See, Hear, and Act

Multimodal ProgressBeginneractive·v1 · live·generated 6d ago
TL;DRAI models began as text-only systems, but over the past few years they have gained the ability to understand images, listen to speech, generate video, and even control computers — all within a single model. The field has moved from bolting separate tools together toward unified architectures where seeing, hearing, and doing are woven into one system, and the frontier is now physical: robots that perceive and act in the real world.

Key takeaways

  • CLIP (2021) was the foundational breakthrough that taught AI to connect images and natural language without task-specific training.
  • GPT-4o (2024) was the first major model to handle text, audio, and vision natively in real time — without separate pipeline stages.
  • Sora (2024) and Sora 2 (2025) demonstrated that video generation could reach up to one minute of high-fidelity output, with Sora 2 adding synchronized audio.
  • Computer use — AI controlling a real desktop — jumped from 14.9% on the OSWorld benchmark in mid-2025 to 72.5% by early 2026, approaching human-level performance.
  • Open-weights labs (Meta Llama 3.2, Mistral Small 4, Qwen2.5-VL) brought multimodal vision and speech to models anyone can run or modify.
  • Google DeepMind extended multimodal AI into physical robots with Gemini Robotics, and into interactive 3D world generation with Genie 3.

What multimodal AI is

For most of AI's history, a model did one thing: it read text and wrote text. "Multimodal" means breaking that constraint — giving AI the ability to see images, listen to speech, watch video, and increasingly to act on a computer or in a physical robot. The story of the last five years is how that capability went from a research curiosity to the default expectation for any serious AI product.

Why it matters

Think about how humans actually communicate: we point at things, describe what we see, play audio clips, share screenshots. A text-only AI misses most of that. Multimodal AI can read a chart you paste in, transcribe a meeting recording, help you navigate a website, or understand a video — all things that were simply impossible for the previous generation of tools. As these capabilities mature, they are also the foundation for AI agents that can do real work on your computer without you having to describe every step in words.

How it evolved: four waves

Wave 1 — Connecting images and language (2021–2023)

The first big unlock was CLIP, introduced by OpenAI in January 2021. CLIP learned to match images with natural language descriptions — no special training needed for each new type of image. That idea seeded nearly everything that followed. GPT-4 in March 2023 brought image understanding into a mainstream large language model for the first time, accepting pictures alongside text and producing human-level results on professional benchmarks.

Wave 2 — Going native: one model for everything (2024)

The landmark moment was GPT-4o ("Omni"), announced in May 2024. Unlike earlier systems that routed your image to one model and your text to another, GPT-4o processed text, audio, and vision together in real time — no separate pipeline stages. OpenAI described it as a step toward "natively multimodal AI." Around the same time, Anthropic's Claude 3 family added vision capabilities, and Meta's Llama 3.2 brought open-weights multimodal models to anyone who wanted to run them locally. Alibaba's Qwen2.5-VL followed with strong vision-language performance in three sizes.

Also in 2024, OpenAI introduced Sora — a video generation model that could produce up to one minute of high-fidelity video from a text description, framing video generation as a path toward AI that understands the physical world.

Wave 3 — Hearing and acting (2025)

Speech got its own moment: OpenAI's Whisper (released 2022, but widely adopted through this period) set the open-source baseline for robust multilingual transcription. Mistral's Voxtral (mid-2025) pushed further, offering open-weight speech models that could answer questions and summarize directly from audio — competitive with GPT-4o mini on audio benchmarks.

The bigger story in 2025 was computer use: AI that can look at your screen and actually operate software. Anthropic launched a public beta in mid-2025 with Claude 3.5 Sonnet scoring 14.9% on the OSWorld benchmark — roughly double the next-best AI at the time, though well below the human score of 70–75%. OpenAI followed with its Computer-Using Agent (CUA), combining GPT-4o's vision with reinforcement learning to navigate browsers and desktop apps.

Video generation matured too: Sora 2 (September 2025) added synchronized dialogue and sound effects alongside improved physics. Google DeepMind released Veo 3 and Imagen 4 for professional video and image creation, plus a filmmaking tool called Flow.

Wave 4 — Into the physical world (2025–2026)

The most recent frontier is embodied AI — models that don't just see the world but act in it. Google DeepMind released Gemini Robotics and Gemini Robotics-ER for robotic systems that perceive and reason about physical environments, followed by Gemini Robotics On-Device for robots that run AI locally without a cloud connection. Genie 3 pushed further still: a world model that generates interactive, navigable 3D environments in real time at 24 fps and 720p — not a video to watch, but a space to move through.

Computer use also took a leap: Anthropic acquired Vercept (a team specializing in AI perception for computer interaction) and released Claude Sonnet 4.6, which scored 72.5% on OSWorld — up from under 15% just eighteen months earlier, and approaching human-level performance on tasks like navigating spreadsheets and filling out web forms.

The open-weights side of the story

Multimodal progress has not been locked behind paywalls. Meta's Llama 3.2 was the first open-weights Llama with image understanding. Mistral Small 4 unified vision, reasoning, and coding in a single 119B-parameter open model under Apache 2.0. Mistral's Voxtral brought open speech understanding. Alibaba's Qwen2.5-VL offered strong vision-language performance in sizes from 3B to 72B, all publicly available. This means developers and researchers can build on, fine-tune, and study these capabilities without depending on any single company's API.

Where it's heading

The trajectory points toward AI that interacts with the world the way people do — through sight, sound, and action, not just typed words. Computer use is the near-term proving ground: the jump from 15% to 72% on OSWorld in under two years suggests this will be a practical capability, not just a demo, within the current model generation. Embodied robotics is the longer arc: models that can perceive, plan, and act in physical space, running locally on the robot itself. The open question is whether unified single-model architectures will continue to win, or whether specialized models for each modality will hold advantages in cost and reliability for specific tasks.

The four waves of multimodal AI progress

Multimodal milestones by capability area

CapabilityKey model(s)What it can doStatus
Vision + languageGPT-4 (2023), Qwen2.5-VL, Llama 3.2Understand and reason about images alongside textMainstream
Native omnimodal (text/audio/vision)GPT-4oProcess all three modalities in real time, no separate pipelinesMainstream
Image generation (native)GPT-4o Image GenerationGenerate images directly inside a language modelMainstream
Video generationSora, Sora 2, Veo 3Generate up to ~1 min of video; Sora 2 adds synchronized audioActive
Speech understandingWhisper, VoxtralTranscription, translation, Q&A from audio; open-weights availableMainstream
Computer useClaude 3.5 Sonnet → Sonnet 4.6Control a real desktop via screenshots; 14.9% → 72.5% OSWorldRapidly maturing
Embodied roboticsGemini Robotics, Gemini Robotics On-DevicePerceive and act in physical environments, including on-deviceEmerging
Interactive world generationGenie 3Real-time navigable 3D environments at 24 fps / 720pResearch frontier

All entries trace to events in this bundle; unknown cells render —.

Timeline

  1. CLIP connects images and language via natural language supervision

  2. Whisper open-sources robust multilingual speech recognition

  3. GPT-4 accepts image and text inputs — first mainstream multimodal LLM

  4. Sora introduced: up to one minute of high-fidelity video generation

  5. GPT-4o: native text, audio, and vision in a single real-time model

  6. Claude computer use launches: 14.9% OSWorld, double the next-best AI

  7. Sora 2 adds synchronized audio and improved physics to video generation

  8. Genie 3 generates real-time navigable 3D worlds at 720p / 24 fps

  9. Anthropic acquires Vercept; Claude Sonnet 4.6 hits 72.5% OSWorld

Related topics

FAQ

What does 'multimodal' actually mean?

It means an AI can work with more than one type of input or output — for example, reading an image and answering a question about it in text, or listening to speech and summarizing it. The newest models handle text, images, audio, and even video all at once.

Why does it matter that modalities are 'unified' in one model?

When a single model handles everything natively — rather than routing your image to one system and your text to another — it can reason across them together in real time, which makes interactions faster, more natural, and more capable.

What is computer use, and how good is it now?

Computer use means the AI can look at your screen and actually click, type, and navigate software on your behalf. In mid-2025 the best models scored around 15% on a standard benchmark; by early 2026 that had risen to 72.5%, approaching what a human scores.

Can I run multimodal models myself without paying a cloud provider?

Yes — Meta's Llama 3.2 added vision to its open-weights family, Mistral released Voxtral for speech and Mistral Small 4 for unified vision and text, and Alibaba's Qwen2.5-VL is available on Hugging Face, all under permissive licenses.

What is the difference between video generation and world models?

Video generation (Sora, Veo 3) produces a clip you watch. World models like Genie 3 generate interactive 3D environments you can navigate in real time — the AI simulates a space you can move through, not just a pre-rendered video.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

  • v1live6d ago

Related guides (4)

More on Multimodal Progress (6)

5arXiv · cs.LG·1mo ago·source ↗

RefDecoder: Reference-Conditioned Video VAE Decoder for Enhanced Visual Generation

RefDecoder addresses an architectural asymmetry in latent diffusion models where denoising networks are heavily conditioned but decoders remain unconditional, causing detail loss and inconsistency. The approach injects high-fidelity reference image signals into the VAE decoding process via reference attention, with a lightweight image encoder mapping reference frames into high-dimensional tokens co-processed at each decoder up-sampling stage. Evaluated on Inter4K, WebVid, and Large Motion benchmarks, RefDecoder achieves up to +2.1dB PSNR over unconditional baselines and improves VBench I2V scores across subject consistency, background consistency, and overall quality. The module is plug-and-play, compatible with existing video generation systems including Wan 2.1 and VideoVAE+ without additional fine-tuning.

6arXiv · cs.CL·1mo ago·source ↗

ATLAS: Unified Agentic and Latent Visual Reasoning via Functional Tokens

ATLAS proposes a framework where a single discrete 'functional token' serves dual roles as both an agentic operation trigger and a latent visual reasoning unit in multimodal models. This design avoids the computational cost of generating intermediate images while sidestepping the context-switching latency of external tool calls and the generalization limitations of pure latent methods. The framework is compatible with standard SFT and RL training pipelines without architectural changes, and introduces Latent-Anchored GRPO (LA-GRPO) to stabilize reinforcement learning when functional tokens are sparse. Experiments show strong performance on visual reasoning benchmarks with maintained interpretability.

5arXiv · cs.AI·1mo ago·source ↗

EntityBench: Benchmark for Entity-Consistent Long-Range Multi-Shot Video Generation

EntityBench is a new benchmark comprising 140 episodes (2,491 shots) derived from real narrative media, designed to evaluate entity consistency—characters, objects, and locations—across long multi-shot video generation sequences. It introduces tiered difficulty up to 50 shots and recurrence gaps of up to 48 shots, paired with a three-pillar evaluation suite covering intra-shot quality, prompt alignment, and cross-shot consistency. The authors also propose EntityMem, a memory-augmented baseline that stores verified per-entity visual references in a persistent memory bank, achieving the highest character fidelity (Cohen's d = +2.33) among evaluated methods. Results show that cross-shot entity consistency degrades sharply with recurrence distance in existing approaches.

5arXiv · cs.AI·1mo ago·source ↗

IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation

IVGT is a new neural architecture that implicitly models continuous 3D geometry from unposed multi-view images without requiring explicit pointmap regression. It learns a continuous neural scene representation in a canonical coordinate system, supporting SDF-based surface queries and color prediction via lightweight decoders. The model is trained with multi-dataset joint optimization using 2D supervision and 3D geometric regularization, achieving strong generalization across mesh reconstruction, novel view synthesis, depth/normal estimation, and camera pose estimation tasks.

6Qwen Research·1mo ago·source ↗

Qwen-Image-Edit: Image Editing Model with Text Rendering and Dual Visual Control

Alibaba's Qwen team has released Qwen-Image-Edit, a 20B-parameter image editing model built on the Qwen-Image foundation. The model extends Qwen-Image's text rendering capabilities to editing tasks, enabling precise in-image text modification. It uses a dual-path architecture that simultaneously feeds input images into Qwen2.5-VL for semantic control and a VAE Encoder for appearance control, enabling both semantic and appearance-level edits.

7Qwen Research·1mo ago·source ↗

Qwen-Image: 20B MMDiT Image Foundation Model with Native Text Rendering

Alibaba's Qwen team has released Qwen-Image, a 20B parameter MMDiT (Multimodal Diffusion Transformer) image generation foundation model. The model claims significant advances in complex text rendering capabilities, including multi-line layouts, paragraph-level semantics, and fine-grained typographic details across alphabetic and other language scripts. It also features precise image editing capabilities and is accessible via Qwen Chat and open-weight repositories on HuggingFace and ModelScope.