7arXiv cs.AI (Artificial Intelligence)·22d ago

GPIC: Stanford Releases 28-Trillion-Pixel Permissively Licensed Image Corpus for Visual Generation Research

Stanford Vision Lab introduces GPIC, a Giant Permissive Image Corpus of approximately 28 trillion pixels comprising 100M training, 200K validation, and 1M test images, all permissively licensed for research and commercial use. Images are captioned by a state-of-the-art vision-language model, safety-filtered, deduplicated, and hosted on Hugging Face. The release includes a benchmarking protocol for generative modeling and a reference baseline using pixel-space flow matching. The dataset addresses a key gap in scalable visual generative modeling research by providing a large, stable, and openly licensed resource.

Training Infrastructure Evaluation and Benchmarking Multimodal Progress GPIC Stanford Vision Lab Flow Matching Hugging Face GPIC benchmark

Related guides (3)

Hugging Face

Hugging Face: The Home of Open-Source AI

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Training InfrastructureTopic guide

Training Infrastructure: The Compute Arms Race Powering Modern AI

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

Open Preference Dataset for Text-to-Image Generation by the Hugging Face Community

Hugging Face has released an open preference dataset for text-to-image generation, collected through community participation. The dataset captures human preference signals across image generation outputs, intended to support alignment and reward modeling research for image generation models. This contributes to the growing ecosystem of open datasets for training and evaluating generative image models.

Evaluation and Benchmarking Alignment and RLHF Hugging Face Open Preference Dataset for Text-to-Image +1 more

7Openai Blog·1mo ago·source ↗

OpenAI Launches gpt-image-1 Image Generation Model via API

OpenAI has made its latest image generation model, gpt-image-1, available through its API for developers and businesses. The model is positioned for professional-grade, customizable visual generation integrated directly into third-party tools and platforms. This follows OpenAI's earlier consumer-facing image generation features and extends them to programmatic access.

Enterprise Deployment Patterns Agent and Tool Ecosystem GPT-Image-1.5 OpenAI API OpenAI +1 more

5arXiv · cs.AI·2d ago·source ↗

Multi-domain benchmark for detecting AI-generated text-rich images from GPT-Image-2

Researchers introduce a new benchmark of 8,602 images across six categories (commercial posters, infographics, academic posters, receipts, tables, UI screenshots) specifically for detecting AI-generated text-rich images produced by OpenAI's GPT-Image-2. Five zero-shot detectors are evaluated, revealing highly domain-dependent performance and severe sensitivity to JPEG compression even in the strongest conventional detector. A multimodal VLM is also explored as a detector, showing promise but limitations on structured formats. The work highlights a gap in existing benchmarks that focus on object-centric rather than text-layout-centric images.

Evaluation and Benchmarking Multimodal Progress GPT-Image-2 OpenAI A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2

6Openai Blog·1mo ago·source ↗

Image GPT: Transformer Models Applied to Pixel Sequences for Image Generation and Classification

OpenAI demonstrates that a large transformer model trained autoregressively on pixel sequences can generate coherent image completions and samples, analogous to text generation. The work establishes a correlation between generative sample quality and downstream image classification accuracy. The best generative model achieves features competitive with top convolutional networks in the unsupervised setting, suggesting shared representational principles across modalities.

Frontier Model Releases Multimodal Progress Transformers convolutional neural network OpenAI +2 more

4arXiv · cs.AI·46h ago·source ↗

SARLO-80: Large-scale VHR SAR-optical-text dataset for multimodal foundation model training

Researchers from ONERA release SARLO-80, a dataset of 119,566 triplets combining very-high-resolution complex SAR imagery, aligned optical patches, and natural-language captions covering 257 locations across 72 countries. The dataset is built from Umbra spotlight acquisitions standardized to an 80cm slant-range grid, with three caption variants per sample to support vision-language training and evaluation. It addresses a recognized gap in SAR-optical multimodal resources, which have historically been limited to low-resolution intensity-only products. The dataset and preprocessing code are publicly released on Hugging Face Hub.

Evaluation and Benchmarking Multimodal Progress Umbra SARLO-80 Hugging Face +1 more

8Openai Blog·1mo ago·source ↗

Introducing 4o Image Generation

OpenAI has integrated a native image generation capability directly into GPT-4o, positioning it as a primary model capability rather than a separate system. The announcement frames this as their most advanced image generator to date, emphasizing both aesthetic quality and practical utility. This represents a shift toward unified multimodal models that generate images natively rather than relying on separate diffusion-based pipelines.

Frontier Model Releases Inference Economics GPT-4o GPT-4o Image Generation OpenAI +1 more

5Hugging Face Blog·1mo ago·source ↗

LeRobot Community Datasets: The "ImageNet" of Robotics — When and How?

Hugging Face's LeRobot blog post discusses the vision and current state of building a large-scale community robotics dataset analogous to ImageNet for computer vision. The post examines what it would take to create a standardized, scalable dataset repository for robot learning, drawing on the LeRobot ecosystem. It addresses data collection formats, community contribution workflows, and the open challenges in making such a resource practically useful for training generalizable robot policies.

Evaluation and Benchmarking Open Weights Progress LeRobot Hugging Face ImageNet +1 more

4Hugging Face Blog·1mo ago·source ↗

Build Awesome Datasets for Video Generation

Hugging Face published a blog post on constructing high-quality datasets for video generation models. The post likely covers data collection, preprocessing, and curation pipelines relevant to training video diffusion or generation systems. This is a practical tooling and methodology guide aimed at practitioners working on video AI.

Agent and Tool Ecosystem Multimodal Progress Hugging Face video generation