4Qwen Research (via RSSHub)·1mo ago

OFA: Towards Building a One-For-All Unified Multimodal Pretrained Model

Alibaba's Qwen team introduces OFA (One-For-All), a unified multimodal pretrained model designed to handle both understanding and generation tasks across multiple modalities within a single framework. The model is pretrained using instruction-based multitask pretraining to endow it with diverse capabilities. This work was published in late 2022 as part of the broader wave of generalist multimodal models. It represents an early effort toward a single model architecture capable of spanning vision, language, and cross-modal tasks.

Frontier Model Releases Multimodal Progress Alibaba DAMO Academy Qwen OFA (One-For-All)instruction-based multitask pretraining

Related guides (3)

Frontier Model ReleasesTopic guide

Frontier Model Releases: The Race From Language to Action

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Qwen

Qwen: Alibaba's Open-Weight AI Model Family

Read asBeginner In-depth

Related events (8)

4Qwen Research·1mo ago·source ↗

OFASys: Multitask Multimodal Learning Framework from Alibaba/Qwen

Alibaba's Qwen team released OFASys, an open-source framework designed to simplify multimodal multitask learning, building on their earlier OFA unified pretrained model. The system aims to reduce engineering friction in setting up multi-task, multi-modal training pipelines, including data batching and training stability. It is positioned as infrastructure for building generalist AI models with minimal code overhead.

Agent and Tool Ecosystem Multimodal Progress Alibaba OFA Qwen +1 more

7Qwen Research·1mo ago·source ↗

Qwen2.5-Omni: Alibaba Releases End-to-End Multimodal Model with Real-Time Streaming

Alibaba's Qwen team releases Qwen2.5-Omni, a 7B-parameter end-to-end multimodal model capable of processing text, images, audio, and video simultaneously. The model delivers real-time streaming responses in both text and natural speech synthesis. It is openly available on Hugging Face, ModelScope, DashScope, and GitHub, accompanied by a technical paper.

Frontier Model Releases Open Weights Progress Alibaba Qwen2.5-Omni Qwen +5 more

7Qwen Research·1mo ago·source ↗

Qwen VLo: Unified Multimodal Understanding and Generation Model

Alibaba's Qwen team has announced Qwen VLo, a new model that unifies multimodal understanding and image generation in a single architecture. Building on the Qwen2.5 VL lineage, the model is positioned to both comprehend and generate high-quality visual content. This represents a step toward unified perception-and-creation models, a direction several frontier labs are pursuing simultaneously.

Frontier Model Releases Multimodal Progress Qwen-VL Qwen2.5-VL Alibaba Qwen +1 more

9Openai Blog·1mo ago·source ↗

Hello GPT-4o

OpenAI announces GPT-4o (Omni), a new flagship multimodal model capable of reasoning across audio, vision, and text in real time. The model represents a significant step toward natively multimodal AI, processing and generating across modalities without separate pipeline stages. It is positioned as OpenAI's primary production model going forward.

Frontier Model Releases Inference Economics GPT-4o OpenAI GPT-4 +1 more

7The Batch·18d ago·source ↗

Alibaba releases Qwen3.5 open-weights vision-language model family with MoE architecture across eight sizes

Alibaba released the Qwen3.5 family of eight open-weights vision-language models ranging from 0.8B to 397B parameters, built on a mixture-of-experts architecture with mixed attention and Gated DeltaNet layers. The flagship Qwen3.5-397B-A17B outperforms GPT-5.2, Claude 4.5 Opus, and Gemini-3 Pro on 28 of 44 vision benchmarks, while the 9B model surpasses OpenAI's gpt-oss-120B on most language tasks. Open weights are available under Apache 2.0, with hosted agentic variants (Qwen3.5-Plus, Qwen3.5-Flash) available via Alibaba Cloud. The release is notable for strong small-model efficiency and comes amid reported team departures following the Qwen3 rollout.

Frontier Model Releases Open Weights Progress GPT-5.2 Alibaba Cloud Model Studio Claude Opus 4.6 +10 more

7arXiv · cs.CL·22d ago·source ↗

Qwen-VLA: Unified Vision-Language-Action Model Across Robot Tasks, Environments, and Embodiments

Alibaba's Qwen team presents Qwen-VLA, a unified embodied foundation model that extends the Qwen vision-language stack to continuous action and trajectory generation via a DiT-based action decoder. The model is jointly pretrained on diverse data spanning manipulation trajectories, egocentric demonstrations, synthetic simulation, and navigation data, with embodiment-aware prompt conditioning to support multiple robot platforms. A unified action-and-trajectory prediction framework covers manipulation, navigation, and trajectory prediction tasks. Benchmarks show strong results: 97.9% on LIBERO, 73.7% on Simpler-WidowX, 69.0% OSR on R2R navigation, and 76.9% average OOD success in real-world ALOHA experiments.

Frontier Model Releases Evaluation and Benchmarking Qwen-VLA DOMINO R2R +10 more

6Qwen Research·1mo ago·source ↗

Introducing Qwen-VL-Plus and Qwen-VL-Max: Upgraded Multimodal Models from Alibaba

Alibaba's Qwen team has launched two enhanced versions of their multimodal model, Qwen-VL-Plus and Qwen-VL-Max, building on the open-sourced Qwen-VL released in September 2023. Key improvements include substantially boosted image reasoning capabilities, enhanced detail recognition and text extraction from images, and support for high-definition images exceeding one million pixels across various aspect ratios. The upgrades represent a significant step forward in the Qwen-VL series' generalization and visual understanding capabilities.

Frontier Model Releases Open Weights Progress Qwen-VL Qwen-VL-Max Alibaba +2 more

6arXiv · cs.AI·10d ago·source ↗

FADA: Unified vision-language model for fetal ultrasound interpretation deployable on consumer smartphones

FADA is a unified vision-language model built on Qwen3.5-VL that performs clinical interpretation, classification, detection, and segmentation of fetal ultrasound images through a single pipeline without requiring external labels at inference. The system distills knowledge from four domain-specific foundation models using selective distillation, achieving 0.8820 mean Dice for segmentation and 0.7671 mAP@0.50 for detection, with expert validation confirming clinically acceptable outputs. Notably, the compressed 0.8B model runs entirely offline on a commodity smartphone (Qualcomm Snapdragon 7 Gen 1) in approximately 60 seconds, targeting diagnostic access gaps in low- and middle-income countries where trained sonographers are scarce. Code, models, and data are publicly released.

Inference Economics Multimodal Progress USF-MAE FetalCLIP Qwen3-4B +4 more