Introducing Qwen-VL-Plus and Qwen-VL-Max: Upgraded Multimodal Models from Alibaba
Alibaba's Qwen team has launched two enhanced versions of their multimodal model, Qwen-VL-Plus and Qwen-VL-Max, building on the open-sourced Qwen-VL released in September 2023. Key improvements include substantially boosted image reasoning capabilities, enhanced detail recognition and text extraction from images, and support for high-definition images exceeding one million pixels across various aspect ratios. The upgrades represent a significant step forward in the Qwen-VL series' generalization and visual understanding capabilities.
Related guides (3)
Related events (8)
Qwen2-VL: Alibaba Releases Latest Vision-Language Model with Extended Video Understanding
Alibaba's Qwen team has released Qwen2-VL, the latest iteration of their vision-language model series built on the Qwen2 foundation. The model claims state-of-the-art performance on visual understanding benchmarks including MathVista, DocVQA, RealWorldQA, and MTVQA. A notable capability is understanding videos exceeding 20 minutes in length for question answering, dialog, and content creation tasks.
Qwen2.5-VL: Alibaba's New Flagship Vision-Language Model Released in 3B/7B/72B Sizes
Alibaba's Qwen team has released Qwen2.5-VL, their new flagship vision-language model, representing a significant upgrade over Qwen2-VL. The release includes both base and instruct variants in three sizes (3B, 7B, 72B), all open-weighted and available on Hugging Face and ModelScope. The 72B instruct model is also accessible via Qwen Chat. Key capabilities highlighted include enhanced visual understanding, with the model positioned as a major step forward in multimodal performance.
Qwen VLo: Unified Multimodal Understanding and Generation Model
Alibaba's Qwen team has announced Qwen VLo, a new model that unifies multimodal understanding and image generation in a single architecture. Building on the Qwen2.5 VL lineage, the model is positioned to both comprehend and generate high-quality visual content. This represents a step toward unified perception-and-creation models, a direction several frontier labs are pursuing simultaneously.
Qwen2.5-VL-32B: Reinforcement-Learning-Optimized Vision-Language Model Released
Alibaba's Qwen team has released Qwen2.5-VL-32B-Instruct, a 32-billion-parameter vision-language model built on the Qwen2.5-VL series and further optimized with reinforcement learning. The model is open-sourced under the Apache 2.0 license and available on Hugging Face and ModelScope. It follows the January 2025 launch of the broader Qwen2.5-VL series, positioning the 32B scale as a balance between capability and deployment practicality.
QVQ-Max: Alibaba Qwen Releases Visual Reasoning Model with Multimodal Chain-of-Thought
Alibaba's Qwen team has officially released QVQ-Max, a visual reasoning model succeeding the December 2024 QVQ-72B-Preview. The model is designed to analyze and reason over images and videos, covering domains including mathematics, programming, and creative tasks. It represents a step beyond the exploratory preview, positioning as a production-grade multimodal reasoning system.
Qwen releases Qwen3.5-2B multimodal model on Hugging Face
Alibaba's Qwen team released Qwen3.5-2B, a 2-billion-parameter image-text-to-text model, on Hugging Face. The model supports conversational use and is compatible with Azure deployment endpoints. With nearly 2 million downloads, it has seen substantial community uptake.
Qwen2.5-Omni: Alibaba Releases End-to-End Multimodal Model with Real-Time Streaming
Alibaba's Qwen team releases Qwen2.5-Omni, a 7B-parameter end-to-end multimodal model capable of processing text, images, audio, and video simultaneously. The model delivers real-time streaming responses in both text and natural speech synthesis. It is openly available on Hugging Face, ModelScope, DashScope, and GitHub, accompanied by a technical paper.
Qwen2 Model Family Released: Five Sizes, 128K Context, Multilingual
Alibaba's Qwen team has released Qwen2, an evolution from Qwen1.5, comprising five pretrained and instruction-tuned models ranging from 0.5B to 72B parameters, including a 57B mixture-of-experts variant (57B-A14B). The release highlights training on 27 additional languages beyond English and Chinese, significantly improved coding and mathematics performance, and extended context support up to 128K tokens for the 7B and 72B instruct variants. Benchmark results are claimed to be state-of-the-art across a large number of evaluations.


