Demystifying Data Organization for Enhanced LLM Training
This Microsoft Research paper systematically investigates how data organization—distinct from data selection—affects LLM training efficiency across pre-training and SFT stages. The authors formalize four guidelines (Boundary Sharpening, Cyclic Scheduling, Curriculum Continuity, and Local Diversity) and introduce two novel data ordering methods, STR and SAW, that reuse pre-computed sample-level scores with minimal additional overhead. Experiments across multiple model scales and dataset sizes demonstrate improved training stability and performance, with code released publicly.
Related guides (3)
Related events (8)
LLMSurgeon: Post-Hoc Auditing of LLM Pretraining Data Mixtures
LLMSurgeon formalizes Data Mixture Surgery (DMS), a framework for estimating the domain-level distribution of an LLM's pretraining corpus using only generated text from the target model. The method casts DMS as an inverse problem under the label-shift assumption, using a calibrated soft confusion matrix to correct domain confusion and recover the latent mixture prior. The authors also introduce LLMScan, a verifiable evaluation suite built from open-source LLMs with known pretraining mixtures, on which LLMSurgeon demonstrates high-fidelity recovery of domain compositions without access to training data.
Optimizing your LLM in production
A Hugging Face blog post covering practical techniques for optimizing large language models in production environments. The post likely addresses inference efficiency methods such as quantization, batching, caching, and hardware utilization strategies. It serves as a practitioner-oriented guide for deploying LLMs at scale.
Paper challenges LLM expert-level claims by measuring variance and error magnitude in code-based data analysis tasks
A new arXiv paper argues that standard LLM benchmarks overstate model capabilities by focusing on average performance on training-data-adjacent tasks while ignoring response variance and error magnitude. The authors introduce a novel benchmark requiring frontier LLMs to write code for data analysis tasks, comparing results against human expert submissions. Human experts outperformed the frontier LLM on average across multiple metrics and showed lower performance variability. The findings challenge the prevailing narrative that LLMs perform at human-expert level on knowledge economy tasks.
DEFINED: Data-efficient framework for fine-grained creativity assessment in debate using LLMs
DEFINED is a computational framework for automated creativity assessment in debate scenarios, operationalizing creativity through an eight-dimensional hierarchical metric system implemented via a pretrained autoregressive language model with a hierarchical scoring head. The system addresses data scarcity through constrained data augmentation and mixed-granularity training from limited expert-annotated data. It outperforms prompt-based LLM evaluators and existing debate scoring methods on authentic competition data. The work is relevant to AI evaluation methodology and the broader question of whether LLMs can reliably assess complex human cognitive outputs.
Survey proposes four-layer architecture for token-operations-oriented LLM inference optimization
A new arXiv preprint introduces a four-layer technical architecture—Multi-model Fusion, Model Optimization, Compute-Model Fusion, and Compute-Network-Model Fusion—for systematically organizing LLM inference optimization techniques. The paper reviews key technologies and industry status at each layer and analyzes their application in real-world business scenarios. The framing around 'token operations' positions inference optimization as an operational discipline analogous to traditional IT operations.
RL-trained LLMs learn retriever-specific query formulation strategies for RAG
A new arXiv paper presents the first systematic study of using reinforcement learning to teach LLMs to adapt query formulation strategies to different retrieval backends. The authors find that different retrievers have surprisingly distinct optimal query styles (e.g., descriptive vs. question-like), making cross-retriever strategy transfer ineffective. They introduce a branching-based rollout technique to stabilize training over multi-step retrieval trajectories and show gains from retriever-specific human guidance and model scaling.
Benchmarking study finds LLMs fail at counterintuitive probability problems despite strong standard performance
A new arXiv paper evaluates 8 state-of-the-art LLMs on discrete probability problems using two datasets: standard exercises (average accuracy 0.96) and counterintuitive exercises designed to trigger heuristic reasoning (average accuracy 0.59). The authors document token bias causing 20%+ performance drops when canonical problem formulations are disguised, and up to 34% degradation when misleading suggestions are embedded in prompts. The findings argue that current LLMs are not genuine probabilistic reasoners despite their success on advanced math benchmarks.
Systematic study reveals effectiveness-fluency trade-offs in LLM conditioning methods
A new arXiv paper systematically evaluates a range of LLM conditioning methods across both concept injection and removal scenarios, finding that efficient steering methods often degrade fluency significantly. A key finding is that activation steering is substantially less effective on instruction-tuned models than on base models, a previously overlooked interaction. Simple prompting and supervised fine-tuning work for concept injection but not removal, and cheap textual metrics are found to correlate well with expensive LLM-as-judge evaluations.


