paper
Recovering the Zipfian Distribution in Unsupervised Term Discovery
paperactiveprovisional
recovering-the-zipfian-distribution-in-unsupervised-term-discovery-1fccf729·1 events·first seen 7d agoAliases: Recovering the Zipfian Distribution in Unsupervised Term Discovery
Co-occurring entities
More like this (12)
Zipf distributionRecovery Subspace DimensionalitySelf-Augmenting Retrieval for Diffusion Language ModelsTF-IDF + Logistic RegressionBeyond Uniform Tokens: Adaptive Compression for Time Series Language ModelsSupervised Semantic DifferentialTF-IDFA Unifying Lens on Supervised Fine-Tuning Through Target Distribution Designunsupervised language modelingUnstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse AutoencodersUnsupervised Continual Clustering via Forward-Backward Knowledge DistillationUnsupervised Pre-training
Recent events (1)
Graph-based clustering recovers Zipfian distributions in unsupervised term discovery
A new arXiv preprint argues that K-means and other centre-based clustering methods produce artificially uniform lexicon distributions in unsupervised speech term discovery, due to their bias toward spherical clusters. The authors propose graph-based clustering using the Leiden algorithm as a bottom-up alternative, demonstrating it substantially outperforms K-means, GMM, and BIRCH on word- and syllable-level lexicon discovery across three languages while producing more Zipf-like distributions. The work challenges the dominance of centre-based methods in this subfield of unsupervised speech processing.