Almanac
← Events
4arXiv cs.CL (Computation and Language)·6d ago

Context-aware distillation and ablation study for Text2DSL Polkit rule generation

Researchers extend a Text2DSL system for generating Polkit domain-specific language rules from natural language, replacing prompt-only synthetic data generation with context-aware distillation using DeepSeek-V4-Flash as a teacher model operating under structured context (BNF grammar, API spec, closed vocabulary). The approach scales a verified corpus from 4,204 to 10,073 NL-to-Polkit-rule pairs at near-perfect validity rates. A factorial ablation across eight context conditions on GigaChat-10B-A1.8B finds that structured context is load-bearing rather than cosmetic, with vocabulary contributing the largest semantic-quality gains via Shapley decomposition.

Related guides (2)

Related events (8)

6arXiv · cs.CL·1mo ago·source ↗

Self-Policy Distillation via Capability-Selective Subspace Projection

This paper introduces Self-Policy Distillation (SPD), a self-distillation method for LLMs that requires no external signals such as correctness filters or reward models. SPD extracts a low-rank capability subspace from the model's own gradients on correctness-defining tokens, then projects KV activations into this subspace during self-generation to isolate task-relevant signal from stylistic noise. Experiments across code generation, math reasoning, and QA show up to 13% improvement over prior signal-free self-distillation methods and 15% better out-of-domain generalization.

6arXiv · cs.CL·1mo ago·source ↗

Canonical-Context On-Policy Distillation (CCOPD) for Multi-Turn LLM Consistency

This paper identifies 'self-anchored drift' as a key failure mode in multi-turn LLMs: when information is revealed incrementally across turns, models produce unsupported assumptions that distort final answers, even when the total evidence is identical to a single-prompt setting. The authors propose Canonical-Context On-Policy Distillation (CCOPD), which trains a student model on incremental multi-turn conversations to match the output distribution of a frozen teacher conditioned on the full clean prompt. Trained only on math conversations, CCOPD achieves a 32% average relative improvement on multi-turn (RAW-SHARDED) tasks and generalizes zero-shot to five out-of-domain task families while preserving single-prompt performance.

6arXiv · cs.CL·12d ago·source ↗

ZPPO: Teacher-in-prompt training method outperforms distillation and GRPO for small vision-language models

Researchers introduce Zone of Proximal Policy Optimization (ZPPO), a training method inspired by Vygotsky's zone of proximal development that embeds teacher guidance in prompts rather than policy gradients or logit imitation. On hard questions where student rollouts fail, ZPPO constructs Binary Candidate-included Questions (BCQ) and Negative Candidate-included Questions (NCQ) to help the student discriminate correct from incorrect responses, with a replay buffer that recirculates hard questions until mastered. Evaluated on the Qwen3 family (0.8B–9B) with a 27B teacher across a 31-benchmark suite covering VLM, LLM, and video tasks, ZPPO outperforms both distillation and GRPO baselines, with the largest gains at the smallest model scale. The method addresses a known failure mode of RL training where zero-reward rollouts produce no gradient signal.

6arXiv · cs.CL·1mo ago·source ↗

TextReg: Regularization Framework for Mitigating Prompt Distributional Overfitting in LLM Optimization

TextReg addresses a failure mode in iterative prompt optimization where LLM-rewritten prompts grow longer, accumulate narrow rules, and generalize poorly—termed prompt distributional overfitting. The authors formalize this via 'representational inefficiency,' a dual-factor measure decomposing prompt inefficiency into capacity cost and scope narrowness. TextReg applies a soft-penalty regularization framework using Dual-Evidence Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Prompt Update. On reasoning benchmarks, it achieves up to +11.8% OOD accuracy over TextGrad and +16.5% over REVOLVE.

4arXiv · cs.CL·20d ago·source ↗

Corpus-Grounded Feature Diffusion pipeline for automated IEP generation in Traditional Chinese

Researchers propose a low-resource fine-tuning pipeline called Corpus-Grounded Feature Diffusion (CGFD) to automate Individualized Education Program (IEP) drafting from Traditional Chinese parent-teacher interview transcripts. The approach fine-tunes Breeze-7B with QLoRA on 582 synthetically diffused samples and uses schema-constrained decoding at inference time, finding that Grammar-Constrained Decoding is counterproductive under Traditional Chinese token budgets. On a small formal hold-out (n=10), the system achieves BERTScore F1 of 0.779, outperforming zero-shot GPT-5.4, DeepSeek-V3.2, Gemini-3-Flash-Preview, and Llama-4-Maverick baselines while enabling fully local, air-gapped inference. The work addresses a gap in Traditional Chinese special-education NLP and demonstrates a privacy-preserving deployment pattern for sensitive document generation.

6arXiv · cs.CL·6d ago·source ↗

SelfCompact: Model-driven adaptive context compaction for long agent traces

Researchers propose SelfCompact, a scaffold that lets language models decide when and how to compact their own accumulated context during long agentic runs, rather than relying on fixed token-threshold triggers. The system pairs a compaction tool with a lightweight rubric specifying when to invoke or suppress compaction based on trajectory structure (e.g., sub-task completion vs. mid-derivation). Evaluated across six benchmarks and seven models, SelfCompact matches or exceeds fixed-interval summarization while reducing per-question token cost by 30-70%, with gains of up to 18.1 points on math tasks and 5-9 points on agentic search. The work identifies a 'meta-cognitive gap' in unprompted models and shows it can be closed via scaffolding without fine-tuning.

5arXiv · cs.AI·19d ago·source ↗

Step-aligned critique outperforms GRPO and reference-solution conditioning in self-distillation

A new arXiv paper investigates context design for self-distillation of language models, comparing binary reward (GRPO), reference solutions, and step-by-step critiques aligned to the solver's reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution conditioning by 5.27 points on Avg@12. Per-token advantage analysis shows that step-aligned feedback targets only failing tokens, avoiding unnecessary pressure on already-correct reasoning steps. The findings suggest structural alignment between feedback and the model's reasoning trace is a key driver of self-distillation effectiveness.

5arXiv · cs.CL·12d ago·source ↗

d-OPSD: First on-policy self-distillation framework tailored for diffusion LLMs

Researchers introduce d-OPSD, the first on-policy self-distillation (OPSD) framework designed specifically for diffusion large language models (dLLMs). The method addresses a fundamental mismatch between existing autoregressive OPSD approaches and dLLMs' arbitrary-order generation by using suffix conditioning on self-generated answers and step-level rather than token-level divergence supervision. Across four reasoning benchmarks, d-OPSD outperforms RLVR and SFT baselines while requiring only ~10% of the optimization steps of RLVR, suggesting strong sample efficiency gains for dLLM post-training.