product
LLMSurgeon
productactiveprovisional
llmsurgeon-75f469d0·1 events·first seen 19d agoAliases: LLMSurgeon
Co-occurring entities
More like this (12)
Recent events (1)
LLMSurgeon: Post-Hoc Auditing of LLM Pretraining Data Mixtures
LLMSurgeon formalizes Data Mixture Surgery (DMS), a framework for estimating the domain-level distribution of an LLM's pretraining corpus using only generated text from the target model. The method casts DMS as an inverse problem under the label-shift assumption, using a calibrated soft confusion matrix to correct domain confusion and recover the latent mixture prior. The authors also introduce LLMScan, a verifiable evaluation suite built from open-source LLMs with known pretraining mixtures, on which LLMSurgeon demonstrates high-fidelity recovery of domain compositions without access to training data.