4Hugging Face Blog·1mo ago

FineVideo: Behind the Scenes — HuggingFace Video Dataset Release

HuggingFace published a behind-the-scenes account of FineVideo, a curated dataset aimed at advancing video understanding in AI/ML models. The post details the data collection, annotation, and curation methodology used to build the dataset. FineVideo is positioned as a resource for training and evaluating multimodal video models.

Evaluation and Benchmarking Multimodal Progress FineVideo HuggingFace

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4Hugging Face Blog·1mo ago·source ↗

Build Awesome Datasets for Video Generation

Hugging Face published a blog post on constructing high-quality datasets for video generation models. The post likely covers data collection, preprocessing, and curation pipelines relevant to training video diffusion or generation systems. This is a practical tooling and methodology guide aimed at practitioners working on video AI.

Agent and Tool Ecosystem Multimodal Progress Hugging Face video generation

5Hugging Face Blog·1mo ago·source ↗

Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models

This Hugging Face blog post provides a technical guide for fine-tuning Microsoft's Florence-2 vision-language models. Florence-2 is a compact yet capable multimodal model supporting tasks like captioning, object detection, and OCR. The post covers practical implementation details for adapting the model to custom datasets using the Hugging Face ecosystem.

Enterprise Deployment Patterns Agent and Tool Ecosystem Microsoft Hugging Face Florence-2 +1 more

5Hugging Face Blog·1mo ago·source ↗

SmolVLM2: Bringing Video Understanding to Every Device

Hugging Face introduces SmolVLM2, a family of compact vision-language models designed for video understanding on resource-constrained devices. The models extend the SmolVLM line with video comprehension capabilities while maintaining small footprints suitable for edge and on-device deployment. The release targets democratizing multimodal video understanding beyond cloud-only inference.

Open Weights Progress Inference Economics SmolVLM SmolVLM2 Hugging Face +1 more

4Hugging Face Blog·1mo ago·source ↗

The State of Computer Vision at Hugging Face

Hugging Face published a survey of the computer vision ecosystem available through its platform as of early 2023, covering supported model architectures, tasks, datasets, and tooling. The post reviews progress in image classification, object detection, segmentation, and multimodal vision-language models integrated into the Transformers library. It serves as a reference for practitioners on what CV capabilities are accessible via the Hugging Face hub and APIs.

Agent and Tool Ecosystem Multimodal Progress Transformers Hugging Face

4Hugging Face Blog·1mo ago·source ↗

A Dive into Text-to-Video Models

A Hugging Face blog post providing an overview of text-to-video generation models as of mid-2023. The post surveys the landscape of approaches, architectures, and key models in the emerging text-to-video space. As a tier-2 commentary piece, it synthesizes existing work rather than presenting novel research.

Multimodal Progress text-to-video generation Hugging Face

4Hugging Face Blog·1mo ago·source ↗

Scaling Robotics Datasets with Video Encoding

Hugging Face published a blog post on using video encoding techniques to scale robotics datasets. The post addresses the practical challenge of storing and transmitting large-scale robot learning data efficiently. Video compression is presented as a key infrastructure enabler for expanding robotics training corpora.

Training Infrastructure Agent and Tool Ecosystem video encoding robotics datasets Hugging Face

4Hugging Face Blog·1mo ago·source ↗

Improving Hugging Face Model Access for Kaggle Users

Hugging Face has announced an integration improvement that streamlines how Kaggle users access models from the Hugging Face Hub. The update appears to reduce friction for practitioners using Kaggle notebooks and compute environments to work with Hugging Face-hosted models. This represents a platform-level partnership move between two major ML community hubs.

Enterprise Deployment Patterns Agent and Tool Ecosystem Kaggle Hugging Face

6Deepseek·11d ago·source ↗

DeepSeek releases DeepSeek-OCR vision-language model on Hugging Face

DeepSeek has released DeepSeek-OCR, a multilingual image-text-to-text model on Hugging Face, built on the DeepSeek-VL-v2 architecture. The model targets OCR and image feature extraction tasks and has accumulated over 2.4 million downloads and 3,275 likes, indicating significant community uptake. This represents an open-weights multimodal release from a major Chinese AI lab.

Open Weights Progress Multimodal Progress DeepSeek-OCR-2 DeepSeek V4