Almanac
product

LLMSurgeon

productactiveprovisionalllmsurgeon-75f469d0·1 events·first seen 19d ago

Aliases: LLMSurgeon

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.LG·19d ago·source ↗

LLMSurgeon: Post-Hoc Auditing of LLM Pretraining Data Mixtures

LLMSurgeon formalizes Data Mixture Surgery (DMS), a framework for estimating the domain-level distribution of an LLM's pretraining corpus using only generated text from the target model. The method casts DMS as an inverse problem under the label-shift assumption, using a calibrated soft confusion matrix to correct domain confusion and recover the latent mixture prior. The authors also introduce LLMScan, a verifiable evaluation suite built from open-source LLMs with known pretraining mixtures, on which LLMSurgeon demonstrates high-fidelity recovery of domain compositions without access to training data.