Almanac
Topic guide · In-depth

Multimodal Progress: From Bolted-On Vision to Unified Perception

Multimodal ProgressIn-depthactive·v1 · live·generated 6d ago
TL;DRMultimodal AI has moved from treating vision, audio, and text as separate pipeline stages to integrating them inside single architectures — and the frontier is now pushing further into video generation, computer use, and physical-world robotics. The central tension is no longer whether models can handle multiple modalities, but whether unified architectures genuinely outperform specialized ones, and whether perception can drive reliable action in the real world.

Key takeaways

  • CLIP (2021) established the contrastive text-image pretraining paradigm that underpins most vision-language work that followed.
  • GPT-4o (May 2024) was the first major production model to process audio, vision, and text natively in a single pass rather than through chained pipelines.
  • Sora (Feb 2024 research preview → Dec 2024 launch) framed video generation as a path toward physical world simulation; Sora 2 (Sep 2025) added synchronized audio and improved physics.
  • Computer use — models controlling GUIs via screenshots — went from Anthropic's 14.9% OSWorld score in Aug 2025 to Claude Sonnet 4.6's 72.5% by early 2026, a roughly 5× gain in under 18 months.
  • Open-weights multimodal coverage expanded significantly: Meta's Llama 3.2 added vision, Mistral released Voxtral (speech) and Mistral Small 4 (unified text+image MoE), and Alibaba's Qwen2.5-VL reached 72B-scale vision-language.
  • DeepMind's Gemini Robotics line and Genie 3 world model extended multimodal perception into embodied physical agents and real-time interactive 3D environments.

What this area covers

Multimodal AI is the project of building systems that perceive and generate across more than one sensory channel — text, images, audio, video, and increasingly physical action. The field spans vision-language models (VLMs) that read images and answer questions about them, speech systems that transcribe and understand audio, video generators that synthesize moving images, and the emerging class of computer-use and robotics agents that close the loop between perception and action.

Why it matters

Language-only models are bounded by what can be expressed in text. Multimodal capability removes that ceiling: a model that can see a screenshot, hear a voice command, watch a video, or control a GUI can participate in workflows that were previously inaccessible to AI. The practical stakes range from developer tooling (vision-assisted debugging, UI automation) to creative production (video generation, image synthesis) to physical-world deployment (robotics, embodied agents).

Phase 1 — Contrastive pretraining and the VLM foundation (2021–2023)

The modern multimodal era traces to CLIP (January 2021), which demonstrated that a neural network could learn visual concepts from natural language supervision alone — zero-shot visual classification without task-specific training data, mirroring the zero-shot transfer GPT-2 and GPT-3 had shown in language. This contrastive text-image pretraining paradigm became the substrate for most subsequent vision-language work.

OpenAI's Whisper (September 2022) did the equivalent for speech: a single model trained on 680,000 hours of multilingual web audio achieved near-human English transcription accuracy and multilingual translation, released as open weights. These two models — CLIP and Whisper — established the baselines that later systems would be measured against.

GPT-4 (March 2023) brought image understanding into a flagship production model, accepting image and text inputs and producing text outputs. By November 2023, GPT-4 Turbo with Vision extended this to the developer API alongside DALL·E 3 image generation. ChatGPT gained voice input and speech output for Plus and Enterprise users in September 2023, making multimodal interaction consumer-facing for the first time at scale.

Phase 2 — Native unification and the omnimodal bet (2024)

The pivotal architectural shift arrived in May 2024 with GPT-4o (Omni). Rather than routing audio, vision, and text through separate pipeline stages, GPT-4o processed all three natively in a single model — the first major production system to do so. OpenAI positioned it as their primary production model going forward, signaling that the pipeline era was ending.

The same month, Anthropic's Claude 3 family introduced multimodal vision capabilities alongside near-perfect long-context recall, while Meta's Llama 3.1 (July 2024) extended the open-weights frontier with multilingual support and extended context — though vision came later with Llama 3.2 (September 2024), which added image understanding and lightweight edge variants, marking Meta's first open-weights multimodal Llama release.

Video generation matured in parallel. OpenAI's Sora research preview (February 2024) introduced a transformer architecture operating on spacetime patches of video and image latent codes, trained jointly on videos and images of variable durations and resolutions. OpenAI explicitly framed scaling video generation as a path toward general-purpose physical world simulators — a claim that reframed the stakes of the field. Sora publicly launched at sora.com in December 2024 with up to 1080p, 20-second generation from text prompts or existing assets.

Alibaba's Qwen2.5-VL (January 2025) brought open-weights vision-language capability to 72B scale across three sizes, while Qwen2 had already established strong multilingual and long-context text foundations.

Phase 3 — Computer use, audio parity, and the action gap (2025)

The most consequential multimodal development of 2025 was not a new modality but a new output type: computer use. Anthropic launched a public beta of computer use for Claude 3.5 Sonnet in August 2025, enabling the model to control a computer by interpreting screenshots and issuing pixel-level cursor and keyboard commands. The 14.9% OSWorld score was roughly double the next-best AI model at the time, though well below human-level performance of 70–75%. Prompt injection was identified as the primary near-term risk.

Google DeepMind followed with a Gemini 2.5 Computer Use model preview in October 2025, entering the same space. The competitive dynamic accelerated rapidly: by February 2026, Anthropic had acquired Vercept — a team specializing in AI perception for computer use, co-founded by researchers including Ross Girshick — and Claude Sonnet 4.6 reached 72.5% on OSWorld, approaching human-level performance on tasks like navigating spreadsheets and completing web forms.

On audio, Mistral AI released Voxtral (July 2025), a family of open-weight speech understanding models (24B and 3B) that outperformed Whisper large-v3 across all tasks and were competitive with GPT-4o mini and Gemini 2.5 Flash on audio understanding. Voxtral supported long-form audio up to 30–40 minutes, native multilingual transcription, and function-calling directly from voice — extending the open-weights audio baseline well beyond Whisper's transcription-focused design.

Video generation continued to advance: Sora 2 (September 2025) added synchronized dialogue and sound effects, improved physics simulation, and enhanced steerability. Google DeepMind announced Veo 3 and Imagen 4 (May 2025) targeting professional media production, and Genie 3 (October 2025) pushed into interactive world modeling — generating navigable 3D environments in real time at 24 fps and 720p, maintaining consistency for several minutes.

Phase 4 — Embodied AI and unified open-weights models (late 2025–2026)

DeepMind extended its multimodal stack into physical-world robotics with Gemini Robotics (March 2025), Gemini Robotics On-Device (June 2025, targeting edge deployment without cloud inference), and Gemini Robotics 1.5 (October 2025), which added perception, planning, reasoning, tool use, and multi-step task execution for embodied agents. This trajectory — from screen pixels to robot actuators — represents the most ambitious extension of the multimodal perception stack.

On the open-weights side, Mistral Small 4 (March 2026) unified capabilities previously split across separate specialist models — reasoning (Magistral), multimodal (Pixtral), and coding (Devstral) — into a single 119B-parameter MoE model with native text and image input, released under Apache 2.0. This consolidation pattern, where a single open model replaces a portfolio of specialists, is a meaningful signal about where the open ecosystem is heading.

Meta's Muse Spark (April 2026), the first model from its Superintelligence Labs, introduced natively multimodal reasoning with visual chain-of-thought and multi-agent orchestration — though as a closed-weights product, marking a strategic departure from Meta's open Llama line. Google DeepMind announced Gemini Omni (May 2026), extending the omnimodal naming convention further. Claude Opus 4.7 (May 2026) shipped improved vision with higher image resolution alongside cybersecurity safeguards.

Where it's heading

The bundle points to three active frontiers. First, computer use as a convergence point: the rapid OSWorld improvement (14.9% → 72.5% in 18 months) suggests GUI control is becoming a standard capability rather than a research demo, with acquisition activity (Vercept) and dedicated model variants (Gemini 2.5 CU) signaling sustained investment. Second, video + audio unification: Sora 2's synchronized audio and Veo 3's professional media targeting suggest the next generation of video models will be expected to handle sound natively, not as an add-on. Third, embodied physical action: DeepMind's robotics line is the clearest bet that multimodal perception will eventually need to drive reliable physical-world behavior — the hardest version of the action gap problem.

Multimodal capability lineage: from contrastive pretraining to embodied action

Multimodal capability milestones by modality and lab

CapabilityKey model(s)Lab(s)Status in bundle
Text + image understandingGPT-4, Claude 3, Qwen2.5-VL, Llama 3.2OpenAI, Anthropic, Alibaba, MetaBroadly available, open-weights parity reached
Native omnimodal (text+audio+vision)GPT-4o, Gemini OmniOpenAI, Google DeepMindProduction; unified single-pass inference
Text-to-video generationSora, Sora 2, Veo 3OpenAI, Google DeepMindSora 2 adds synchronized audio; Veo 3 targets pro media
Computer use (GUI control)Claude 3.5 Sonnet → Sonnet 4.6, Gemini 2.5 CUAnthropic, Google DeepMind14.9% → 72.5% OSWorld in ~18 months
Speech / audio understandingWhisper, VoxtralOpenAI, Mistral AIOpen-weights; Voxtral outperforms Whisper large-v3
Embodied roboticsGemini Robotics, Gemini Robotics On-Device, Gemini Robotics 1.5Google DeepMindEmerging; on-device variant targets edge deployment
Interactive world modelsGenie 3Google DeepMindCapability demo; 24 fps, 720p, real-time 3D

All entries traceable to events in the bundle; '—' would denote unknown cells.

Timeline

  1. CLIP: contrastive text-image pretraining establishes the VLM paradigm

  2. Whisper: open-weights ASR trained on 680K hours sets the speech baseline

  3. GPT-4 launches as a multimodal model accepting image and text inputs

  4. ChatGPT gains vision, voice input, and speech output for Plus/Enterprise users

  5. Sora research preview: video generation framed as world simulation on transformer architecture

  6. GPT-4o: first production model processing audio, vision, and text in a single native pass

  7. Llama 3.2: Meta's first open-weights multimodal Llama adds vision and edge variants

  8. Sora publicly launches at sora.com with up to 1080p / 20-second video generation

  9. Qwen2.5-VL released in 3B/7B/72B open-weights sizes

  10. GPT-4o gains native image generation; Gemini Robotics extends multimodal perception to physical robots

  11. Anthropic launches computer use for Claude 3.5 Sonnet at 14.9% OSWorld — double the next-best AI

  12. Sora 2 adds synchronized audio and improved physics simulation

  13. Genie 3 generates real-time interactive 3D environments at 24 fps / 720p

  14. Gemini 2.5 Computer Use model previews; Gemini Robotics 1.5 targets embodied agents

  15. Mistral Voxtral: open-weights speech models outperform Whisper large-v3 across all tasks

  16. Anthropic acquires Vercept (computer use perception); Sonnet 4.6 reaches 72.5% OSWorld

  17. Mistral Small 4 unifies reasoning, vision, and coding in a single 119B MoE open-weights model

  18. Gemini Omni announced; Claude Opus 4.7 ships improved vision with higher image resolution

Related topics

FAQ

What does 'natively multimodal' mean versus a pipeline approach?

A native multimodal model processes all input types (text, image, audio) inside a single forward pass rather than routing them through separate specialist models stitched together. GPT-4o was the first major production model to do this across audio, vision, and text simultaneously.

How fast has computer use improved?

Anthropic's Claude 3.5 Sonnet scored 14.9% on OSWorld in August 2025 — roughly double the next-best AI at the time. By early 2026, Claude Sonnet 4.6 reached 72.5% on the same benchmark, approaching the human-level range of 70–75%.

Is open-weights multimodal capability catching up to closed models?

Substantially yes across vision-language: Llama 3.2 added vision to Meta's open family, Qwen2.5-VL reached 72B scale, Mistral Small 4 unified text and image in a single Apache 2.0 MoE, and Voxtral matched or beat Whisper large-v3 on speech tasks.

What is the difference between video generation and world modeling?

Video generation (Sora, Veo 3) produces a fixed output clip from a prompt. World modeling (Genie 3) generates an interactive, navigable environment that responds to user input in real time — the distinction is agency and consistency over time rather than one-shot synthesis.

Where does robotics fit into multimodal progress?

DeepMind's Gemini Robotics line applies multimodal perception and reasoning to physical agents that must act in the real world — extending the same vision-language-action stack from screen-based computer use to dexterous manipulation and on-device edge deployment.

Stay current

Call Me Almanac pairs the week's AI news with guides like this one — Midweek & Sunday.

Versions

  • v1live6d ago

Related guides (4)

More on Multimodal Progress (6)

5arXiv · cs.LG·1mo ago·source ↗

RefDecoder: Reference-Conditioned Video VAE Decoder for Enhanced Visual Generation

RefDecoder addresses an architectural asymmetry in latent diffusion models where denoising networks are heavily conditioned but decoders remain unconditional, causing detail loss and inconsistency. The approach injects high-fidelity reference image signals into the VAE decoding process via reference attention, with a lightweight image encoder mapping reference frames into high-dimensional tokens co-processed at each decoder up-sampling stage. Evaluated on Inter4K, WebVid, and Large Motion benchmarks, RefDecoder achieves up to +2.1dB PSNR over unconditional baselines and improves VBench I2V scores across subject consistency, background consistency, and overall quality. The module is plug-and-play, compatible with existing video generation systems including Wan 2.1 and VideoVAE+ without additional fine-tuning.

6arXiv · cs.CL·1mo ago·source ↗

ATLAS: Unified Agentic and Latent Visual Reasoning via Functional Tokens

ATLAS proposes a framework where a single discrete 'functional token' serves dual roles as both an agentic operation trigger and a latent visual reasoning unit in multimodal models. This design avoids the computational cost of generating intermediate images while sidestepping the context-switching latency of external tool calls and the generalization limitations of pure latent methods. The framework is compatible with standard SFT and RL training pipelines without architectural changes, and introduces Latent-Anchored GRPO (LA-GRPO) to stabilize reinforcement learning when functional tokens are sparse. Experiments show strong performance on visual reasoning benchmarks with maintained interpretability.

5arXiv · cs.AI·1mo ago·source ↗

EntityBench: Benchmark for Entity-Consistent Long-Range Multi-Shot Video Generation

EntityBench is a new benchmark comprising 140 episodes (2,491 shots) derived from real narrative media, designed to evaluate entity consistency—characters, objects, and locations—across long multi-shot video generation sequences. It introduces tiered difficulty up to 50 shots and recurrence gaps of up to 48 shots, paired with a three-pillar evaluation suite covering intra-shot quality, prompt alignment, and cross-shot consistency. The authors also propose EntityMem, a memory-augmented baseline that stores verified per-entity visual references in a persistent memory bank, achieving the highest character fidelity (Cohen's d = +2.33) among evaluated methods. Results show that cross-shot entity consistency degrades sharply with recurrence distance in existing approaches.

5arXiv · cs.AI·1mo ago·source ↗

IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation

IVGT is a new neural architecture that implicitly models continuous 3D geometry from unposed multi-view images without requiring explicit pointmap regression. It learns a continuous neural scene representation in a canonical coordinate system, supporting SDF-based surface queries and color prediction via lightweight decoders. The model is trained with multi-dataset joint optimization using 2D supervision and 3D geometric regularization, achieving strong generalization across mesh reconstruction, novel view synthesis, depth/normal estimation, and camera pose estimation tasks.

6Qwen Research·1mo ago·source ↗

Qwen-Image-Edit: Image Editing Model with Text Rendering and Dual Visual Control

Alibaba's Qwen team has released Qwen-Image-Edit, a 20B-parameter image editing model built on the Qwen-Image foundation. The model extends Qwen-Image's text rendering capabilities to editing tasks, enabling precise in-image text modification. It uses a dual-path architecture that simultaneously feeds input images into Qwen2.5-VL for semantic control and a VAE Encoder for appearance control, enabling both semantic and appearance-level edits.

7Qwen Research·1mo ago·source ↗

Qwen-Image: 20B MMDiT Image Foundation Model with Native Text Rendering

Alibaba's Qwen team has released Qwen-Image, a 20B parameter MMDiT (Multimodal Diffusion Transformer) image generation foundation model. The model claims significant advances in complex text rendering capabilities, including multi-line layouts, paragraph-level semantics, and fine-grained typographic details across alphabetic and other language scripts. It also features precise image editing capabilities and is accessible via Qwen Chat and open-weight repositories on HuggingFace and ModelScope.