
Qwen2.5-VL
qwen2-5-vl-539281ca·6 events·first seen 1mo agoAliases: Qwen2.5-VL, Qwen2.5 VL, Qwen2-VL, Qwen2-VL 7B
Co-occurring entities
More like this (12)
Recent events (6)
Qwen2.5-VL: Alibaba's New Flagship Vision-Language Model Released in 3B/7B/72B Sizes
Alibaba's Qwen team has released Qwen2.5-VL, their new flagship vision-language model, representing a significant upgrade over Qwen2-VL. The release includes both base and instruct variants in three sizes (3B, 7B, 72B), all open-weighted and available on Hugging Face and ModelScope. The 72B instruct model is also accessible via Qwen Chat. Key capabilities highlighted include enhanced visual understanding, with the model positioned as a major step forward in multimodal performance.
Qwen2-VL: Alibaba Releases Latest Vision-Language Model with Extended Video Understanding
Alibaba's Qwen team has released Qwen2-VL, the latest iteration of their vision-language model series built on the Qwen2 foundation. The model claims state-of-the-art performance on visual understanding benchmarks including MathVista, DocVQA, RealWorldQA, and MTVQA. A notable capability is understanding videos exceeding 20 minutes in length for question answering, dialog, and content creation tasks.
Qwen-Image-Edit: Image Editing Model with Text Rendering and Dual Visual Control
Alibaba's Qwen team has released Qwen-Image-Edit, a 20B-parameter image editing model built on the Qwen-Image foundation. The model extends Qwen-Image's text rendering capabilities to editing tasks, enabling precise in-image text modification. It uses a dual-path architecture that simultaneously feeds input images into Qwen2.5-VL for semantic control and a VAE Encoder for appearance control, enabling both semantic and appearance-level edits.
Qwen2.5-VL-32B: Reinforcement-Learning-Optimized Vision-Language Model Released
Alibaba's Qwen team has released Qwen2.5-VL-32B-Instruct, a 32-billion-parameter vision-language model built on the Qwen2.5-VL series and further optimized with reinforcement learning. The model is open-sourced under the Apache 2.0 license and available on Hugging Face and ModelScope. It follows the January 2025 launch of the broader Qwen2.5-VL series, positioning the 32B scale as a balance between capability and deployment practicality.
Qwen VLo: Unified Multimodal Understanding and Generation Model
Alibaba's Qwen team has announced Qwen VLo, a new model that unifies multimodal understanding and image generation in a single architecture. Building on the Qwen2.5 VL lineage, the model is positioned to both comprehend and generate high-quality visual content. This represents a step toward unified perception-and-creation models, a direction several frontier labs are pursuing simultaneously.
Pixtral 12B: Mistral AI's First Multimodal Model (Now Deprecated)
Mistral AI released Pixtral 12B in September 2024 as their first natively multimodal model, combining a new 400M parameter vision encoder trained from scratch with a 12B multimodal decoder based on Mistral Nemo. The model supports variable image sizes and aspect ratios, a 128K token context window for multiple images, and achieved 52.5% on MMMU while maintaining strong text-only benchmark performance. The model is now deprecated and has been replaced by newer vision and multimodal models from Mistral. It was released under Apache 2.0 license.