Almanac
← Events
5arXiv cs.CL (Computation and Language)·7d ago

KDoS framework proposes distribution-optimized synthetic data for LLM knowledge injection

Researchers introduce KDoS (Knowledge Distribution-optimized Synthesis), a framework that uses a three-stage feedback mechanism guided by 'knowledge density' to optimize the distribution of synthetic training data for LLMs. Rather than stopping at preset token counts or fixed ratios, KDoS dynamically adjusts synthesis to avoid sparse or redundant domain coverage. Experiments across Qwen, Ling, and LLaMA models (0.6B–16B parameters) on 1B–5B token scales show consistent improvements over baselines on six knowledge benchmarks. A key finding is that an optimal knowledge distribution exists and remains stable across model families and scales.

Related guides (2)

Related events (8)

6arXiv · cs.LG·1mo ago·source ↗

LLMSurgeon: Post-Hoc Auditing of LLM Pretraining Data Mixtures

LLMSurgeon formalizes Data Mixture Surgery (DMS), a framework for estimating the domain-level distribution of an LLM's pretraining corpus using only generated text from the target model. The method casts DMS as an inverse problem under the label-shift assumption, using a calibrated soft confusion matrix to correct domain confusion and recover the latent mixture prior. The authors also introduce LLMScan, a verifiable evaluation suite built from open-source LLMs with known pretraining mixtures, on which LLMSurgeon demonstrates high-fidelity recovery of domain compositions without access to training data.

5arXiv · cs.CL·13d ago·source ↗

d-OPSD: First on-policy self-distillation framework tailored for diffusion LLMs

Researchers introduce d-OPSD, the first on-policy self-distillation (OPSD) framework designed specifically for diffusion large language models (dLLMs). The method addresses a fundamental mismatch between existing autoregressive OPSD approaches and dLLMs' arbitrary-order generation by using suffix conditioning on self-generated answers and step-level rather than token-level divergence supervision. Across four reasoning benchmarks, d-OPSD outperforms RLVR and SFT baselines while requiring only ~10% of the optimization steps of RLVR, suggesting strong sample efficiency gains for dLLM post-training.

5arXiv · cs.AI·12h ago·source ↗

DOPD: Advantage-aware dual on-policy distillation to address privilege illusion in LLM/VLM training

Researchers introduce DOPD (Dual On-policy Distillation), a knowledge distillation framework that dynamically routes token-level supervision between a privileged teacher and privileged student policy based on advantage gap and relative probabilities. The method addresses a failure mode called 'privilege illusion,' where information asymmetry between teacher and student is conflated with a transferable capability gap. Experiments on both LLM and VLM settings show DOPD outperforms vanilla on-policy distillation and related methods, with additional gains on stability, continual learning, and out-of-distribution tasks.

5arXiv · cs.CL·15d ago·source ↗

OPCoD: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback

Researchers introduce On-Policy Co-Distillation (OPCoD), a training framework where two LLMs, each stronger in a different domain, iteratively tutor each other using on-policy rollouts and peer feedback. The method uses cognizance-based gating to control when feedback is given and feedback anchoring to ground it in the problem context. On Science Q&A tasks, OPCoD achieves Pareto improvement for both models across all evaluated domain pairs, outperforming one-way distillation and single-model fine-tuning baselines.

4arXiv · cs.CL·15d ago·source ↗

Synthetic data generation method enables small LLMs to match large models on Text-To-Cypher tasks

A new arXiv paper presents an automatic synthetic data generation method for fine-tuning small LLMs on Text-To-Cypher (Text2Cypher) parsing, enabling natural language interfaces to property graph databases. Experiments across major Text-To-Cypher benchmarks show that small fine-tuned models can compete with much larger proprietary models. The approach is positioned as a solution for local deployment scenarios requiring data sovereignty without expensive annotation.

6arXiv · cs.LG·1mo ago·source ↗

Strong Teacher Not Needed? On Distillation in LLM Pretraining

This paper challenges the conventional assumption that knowledge distillation requires a stronger teacher to produce better students. Through systematic variation of architecture sizes and training token budgets, the authors find that even small, undertrained teachers can improve larger student models when language modeling and distillation losses are properly mixed. Counterintuitively, stronger teachers can saturate or reverse distillation gains, and distillation benefits generalization more than in-domain fitting.

4Hugging Face Blog·1mo ago·source ↗

Optimizing your LLM in production

A Hugging Face blog post covering practical techniques for optimizing large language models in production environments. The post likely addresses inference efficiency methods such as quantization, batching, caching, and hardware utilization strategies. It serves as a practitioner-oriented guide for deploying LLMs at scale.

5arXiv · cs.CL·27d ago·source ↗

Knowledge editing via locate-then-edit transferred to masked diffusion language models, revealing multi-token failure mode

A new arXiv paper investigates whether locate-then-edit knowledge editing methods, developed for autoregressive models, transfer to masked diffusion language models (MDMs) such as LLaDA and Dream. The authors find that causal tracing identifies the same early-to-mid-layer MLP location in both paradigms, but MDMs degrade systematically on multi-token edits due to partially unmasked intermediate states that the edit was never optimized for. A correction targeting these intermediate states substantially restores multi-token editing performance. The work is the first systematic comparison of knowledge editing across autoregressive and diffusion-based language model paradigms.