Almanac
technique

near-deduplication

techniqueactivenear-deduplication-5b5802ed·1 events·first seen 28d ago

Aliases: near-deduplication

Co-occurring entities

More like this (12)

Recent events (1)

4Hugging Face Blog·28d ago·source ↗

Large-scale Near-deduplication Behind BigCode

This Hugging Face blog post details the near-deduplication pipeline developed for the BigCode project, which processes large-scale source code datasets used to train code language models. The post covers the technical methodology for identifying and removing near-duplicate documents at scale, including hashing techniques and distributed processing approaches. Deduplication is a critical preprocessing step that affects training data quality and model generalization.