Almanac
← Events
4Hugging Face Blog·1mo ago

Data Is Better Together: A Look Back and Forward

Hugging Face's 'Data Is Better Together' (DIBT) initiative is reviewed, highlighting community-driven efforts to collaboratively build high-quality datasets for AI training. The post reflects on past achievements in crowdsourcing preference data and instruction datasets, and outlines future directions for scaling community data collection. The initiative represents a model for open, distributed dataset creation as an alternative to proprietary data pipelines.

Related guides (3)

Related events (8)

4Hugging Face Blog·1mo ago·source ↗

Data is Better Together: Community-Driven Dataset Building with Argilla and Hugging Face Spaces

Hugging Face and Argilla are launching a collaborative initiative to enable communities to collectively build higher-quality datasets using Argilla's annotation tooling integrated with Hugging Face Spaces. The effort targets the data curation bottleneck in AI development by crowdsourcing human feedback and annotations at scale. This represents a community-oriented approach to producing training and evaluation datasets for open-source AI models.

4Hugging Face Blog·1mo ago·source ↗

Scaling AI-based Data Processing with Hugging Face + Dask

Hugging Face published a blog post describing how to scale AI-based data processing pipelines by combining Hugging Face datasets and models with Dask, a parallel computing framework. The post covers patterns for distributed inference and large-scale dataset preprocessing. This is a practical integration guide targeting ML engineers who need to process data at scale beyond single-machine limits.

4Hugging Face Blog·1mo ago·source ↗

Ethics and Society Newsletter #6: Building Better AI: The Importance of Data Quality

Hugging Face's Ethics and Society team publishes their sixth newsletter focusing on data quality as a foundational concern for AI development. The piece addresses how training data composition, curation practices, and quality standards affect model behavior, safety, and societal impact. It situates data quality within broader responsible AI development frameworks.

4Hugging Face Blog·1mo ago·source ↗

Build Awesome Datasets for Video Generation

Hugging Face published a blog post on constructing high-quality datasets for video generation models. The post likely covers data collection, preprocessing, and curation pipelines relevant to training video diffusion or generation systems. This is a practical tooling and methodology guide aimed at practitioners working on video AI.

5Openai Blog·1mo ago·source ↗

OpenAI Data Partnerships

OpenAI announced a data partnerships program aimed at collaborating with external organizations to create both open-source and private datasets for AI training. The initiative seeks to expand the diversity and quality of training data available to OpenAI. This represents a structured effort to source large-scale, high-quality data from institutional partners rather than relying solely on existing web-scraped corpora.

4Hugging Face Blog·1mo ago·source ↗

Streaming Datasets: 100x More Efficient

Hugging Face published a blog post describing efficiency improvements to their datasets streaming functionality, claiming up to 100x gains. The post covers technical changes to how large datasets are accessed and loaded without full downloads. This is relevant to ML practitioners working with large-scale training data pipelines.

5Hugging Face Blog·1mo ago·source ↗

LeRobot Community Datasets: The "ImageNet" of Robotics — When and How?

Hugging Face's LeRobot blog post discusses the vision and current state of building a large-scale community robotics dataset analogous to ImageNet for computer vision. The post examines what it would take to create a standardized, scalable dataset repository for robot learning, drawing on the LeRobot ecosystem. It addresses data collection formats, community contribution workflows, and the open challenges in making such a resource practically useful for training generalizable robot policies.

5Hugging Face Blog·1mo ago·source ↗

The Future of the Global Open-Source AI Ecosystem: From DeepSeek to AI+

Hugging Face publishes a retrospective and forward-looking commentary marking one year since the 'DeepSeek moment,' examining how DeepSeek's open-weight releases reshaped the global open-source AI ecosystem. The piece analyzes the downstream effects on model development, inference economics, and competitive dynamics between open and closed AI labs. It situates these developments within a broader 'AI+' framing, suggesting a new phase of AI integration across industries.