technique
near-deduplication
techniqueactive
near-deduplication-5b5802ed·1 events·first seen 28d agoAliases: near-deduplication
Co-occurring entities
More like this (12)
Recent events (1)
Large-scale Near-deduplication Behind BigCode
This Hugging Face blog post details the near-deduplication pipeline developed for the BigCode project, which processes large-scale source code datasets used to train code language models. The post covers the technical methodology for identifying and removing near-duplicate documents at scale, including hashing techniques and distributed processing approaches. Deduplication is a critical preprocessing step that affects training data quality and model generalization.