Almanac
← Events
5Hugging Face Blog·1mo ago

Open Preference Dataset for Text-to-Image Generation by the Hugging Face Community

Hugging Face has released an open preference dataset for text-to-image generation, collected through community participation. The dataset captures human preference signals across image generation outputs, intended to support alignment and reward modeling research for image generation models. This contributes to the growing ecosystem of open datasets for training and evaluating generative image models.

Related guides (3)

Related events (8)

4Hugging Face Blog·1mo ago·source ↗

Build Awesome Datasets for Video Generation

Hugging Face published a blog post on constructing high-quality datasets for video generation models. The post likely covers data collection, preprocessing, and curation pipelines relevant to training video diffusion or generation systems. This is a practical tooling and methodology guide aimed at practitioners working on video AI.

5Hugging Face Blog·1mo ago·source ↗

Introducing the Synthetic Data Generator - Build Datasets with Natural Language

Hugging Face has launched a Synthetic Data Generator tool that allows users to create datasets using natural language descriptions. The tool is designed to lower the barrier for dataset creation, enabling practitioners to generate training data without writing code. This is relevant to the broader trend of synthetic data as a scalable alternative to manual data collection and annotation.

3Hugging Face Blog·1mo ago·source ↗

Introducing TextImage Augmentation for Document Images

Hugging Face introduces a TextImage augmentation library for document images, aimed at improving model robustness for document understanding tasks. The tooling applies transformations such as noise, blur, and distortion to document images to simulate real-world scanning and printing artifacts. This is relevant to training and fine-tuning vision-language models on document datasets.

7arXiv · cs.AI·22d ago·source ↗

GPIC: Stanford Releases 28-Trillion-Pixel Permissively Licensed Image Corpus for Visual Generation Research

Stanford Vision Lab introduces GPIC, a Giant Permissive Image Corpus of approximately 28 trillion pixels comprising 100M training, 200K validation, and 1M test images, all permissively licensed for research and commercial use. Images are captioned by a state-of-the-art vision-language model, safety-filtered, deduplicated, and hosted on Hugging Face. The release includes a benchmarking protocol for generative modeling and a reference baseline using pixel-space flow matching. The dataset addresses a key gap in scalable visual generative modeling research by providing a large, stable, and openly licensed resource.

5Hugging Face Blog·1mo ago·source ↗

Preference Optimization for Vision Language Models

This Hugging Face blog post covers the application of Direct Preference Optimization (DPO) to vision-language models (VLMs). It likely discusses how preference learning techniques originally developed for text-only LLMs can be adapted to multimodal settings. The post addresses training methodology for aligning VLMs with human preferences across both visual and textual modalities.

5Hugging Face Blog·1mo ago·source ↗

LeRobot Community Datasets: The "ImageNet" of Robotics — When and How?

Hugging Face's LeRobot blog post discusses the vision and current state of building a large-scale community robotics dataset analogous to ImageNet for computer vision. The post examines what it would take to create a standardized, scalable dataset repository for robot learning, drawing on the LeRobot ecosystem. It addresses data collection formats, community contribution workflows, and the open challenges in making such a resource practically useful for training generalizable robot policies.

5Hugging Face Blog·1mo ago·source ↗

State of open video generation models in Diffusers

Hugging Face published a survey of open-source video generation models integrated into the Diffusers library as of January 2025. The post covers the current landscape of available open video generation models, their capabilities, and how they are supported within the Diffusers ecosystem. This serves as a reference for practitioners looking to use or compare open-weights video generation models.

4Hugging Face Blog·1mo ago·source ↗

Welcome aMUSEd: Efficient Text-to-Image Generation

Hugging Face introduces aMUSEd, a text-to-image model based on the MUSE architecture that prioritizes efficiency over raw quality. The model is designed to be smaller and faster than diffusion-based alternatives, making it more accessible for deployment. It is released with integration into the Diffusers library.