FADA: Unified vision-language model for fetal ultrasound interpretation deployable on consumer smartphones
FADA is a unified vision-language model built on Qwen3.5-VL that performs clinical interpretation, classification, detection, and segmentation of fetal ultrasound images through a single pipeline without requiring external labels at inference. The system distills knowledge from four domain-specific foundation models using selective distillation, achieving 0.8820 mean Dice for segmentation and 0.7671 mAP@0.50 for detection, with expert validation confirming clinically acceptable outputs. Notably, the compressed 0.8B model runs entirely offline on a commodity smartphone (Qualcomm Snapdragon 7 Gen 1) in approximately 60 seconds, targeting diagnostic access gaps in low- and middle-income countries where trained sonographers are scarce. Code, models, and data are publicly released.
Related guides (2)
Related events (8)
Introducing Idefics2: A Powerful 8B Vision-Language Model for the Community
Hugging Face introduces Idefics2, an 8-billion parameter open vision-language model released for the community. The model is positioned as a capable multimodal system combining vision and language understanding. As an open-weights release from a major AI platform, it contributes to the growing ecosystem of accessible multimodal models.
OFA: Towards Building a One-For-All Unified Multimodal Pretrained Model
Alibaba's Qwen team introduces OFA (One-For-All), a unified multimodal pretrained model designed to handle both understanding and generation tasks across multiple modalities within a single framework. The model is pretrained using instruction-based multitask pretraining to endow it with diverse capabilities. This work was published in late 2022 as part of the broader wave of generalist multimodal models. It represents an early effort toward a single model architecture capable of spanning vision, language, and cross-modal tasks.
OFASys: Multitask Multimodal Learning Framework from Alibaba/Qwen
Alibaba's Qwen team released OFASys, an open-source framework designed to simplify multimodal multitask learning, building on their earlier OFA unified pretrained model. The system aims to reduce engineering friction in setting up multi-task, multi-modal training pipelines, including data batching and training stability. It is positioned as infrastructure for building generalist AI models with minimal code overhead.
DeepSeek releases DeepSeek-OCR vision-language model on Hugging Face
DeepSeek has released DeepSeek-OCR, a multilingual image-text-to-text model on Hugging Face, built on the DeepSeek-VL-v2 architecture. The model targets OCR and image feature extraction tasks and has accumulated over 2.4 million downloads and 3,275 likes, indicating significant community uptake. This represents an open-weights multimodal release from a major Chinese AI lab.
Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models
This Hugging Face blog post provides a technical guide for fine-tuning Microsoft's Florence-2 vision-language models. Florence-2 is a compact yet capable multimodal model supporting tasks like captioning, object detection, and OCR. The post covers practical implementation details for adapting the model to custom datasets using the Hugging Face ecosystem.
Qwen-VLA: Unified Vision-Language-Action Model Across Robot Tasks, Environments, and Embodiments
Alibaba's Qwen team presents Qwen-VLA, a unified embodied foundation model that extends the Qwen vision-language stack to continuous action and trajectory generation via a DiT-based action decoder. The model is jointly pretrained on diverse data spanning manipulation trajectories, egocentric demonstrations, synthetic simulation, and navigation data, with embodiment-aware prompt conditioning to support multiple robot platforms. A unified action-and-trajectory prediction framework covers manipulation, navigation, and trajectory prediction tasks. Benchmarks show strong results: 97.9% on LIBERO, 73.7% on Simpler-WidowX, 69.0% OSR on R2R navigation, and 76.9% average OOD success in real-world ALOHA experiments.
Qwen2-VL: Alibaba Releases Latest Vision-Language Model with Extended Video Understanding
Alibaba's Qwen team has released Qwen2-VL, the latest iteration of their vision-language model series built on the Qwen2 foundation. The model claims state-of-the-art performance on visual understanding benchmarks including MathVista, DocVQA, RealWorldQA, and MTVQA. A notable capability is understanding videos exceeding 20 minutes in length for question answering, dialog, and content creation tasks.
DeepSeek releases DeepSeek-OCR-2 vision-language model on Hugging Face
DeepSeek has released DeepSeek-OCR-2, a multilingual image-text-to-text model on Hugging Face, built on the DeepSeek-VL-v2 architecture and tagged for OCR and vision-language tasks. The model has accumulated over 1.8 million downloads and 980 likes, indicating substantial community uptake. It extends DeepSeek's multimodal model lineup with a specialized document/OCR capability.

