4arXiv cs.CL (Computation and Language)·6d ago

Context-aware distillation and ablation study for Text2DSL Polkit rule generation

Researchers extend a Text2DSL system for generating Polkit domain-specific language rules from natural language, replacing prompt-only synthetic data generation with context-aware distillation using DeepSeek-V4-Flash as a teacher model operating under structured context (BNF grammar, API spec, closed vocabulary). The approach scales a verified corpus from 4,204 to 10,073 NL-to-Polkit-rule pairs at near-perfect validity rates. A factorial ablation across eight context conditions on GigaChat-10B-A1.8B finds that structured context is load-bearing rather than cosmetic, with vocabulary contributing the largest semantic-quality gains via Shapley decomposition.

Evaluation and Benchmarking Agent and Tool Ecosystem DeepSeek-V4-Flash PolkitBench Context-Aware Distillation and Ablation for Text2DSL GigaChat-10B-A1.8B

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·1mo ago·source ↗

Self-Policy Distillation via Capability-Selective Subspace Projection

This paper introduces Self-Policy Distillation (SPD), a self-distillation method for LLMs that requires no external signals such as correctness filters or reward models. SPD extracts a low-rank capability subspace from the model's own gradients on correctness-defining tokens, then projects KV activations into this subspace during self-generation to isolate task-relevant signal from stylistic noise. Experiments across code generation, math reasoning, and QA show up to 13% improvement over prior signal-free self-distillation methods and 15% better out-of-domain generalization.

Frontier Model Releases Evaluation and Benchmarking large language models key-value (KV) activation projection low-rank subspace projection +2 more

6arXiv · cs.CL·1mo ago·source ↗

Canonical-Context On-Policy Distillation (CCOPD) for Multi-Turn LLM Consistency

This paper identifies 'self-anchored drift' as a key failure mode in multi-turn LLMs: when information is revealed incrementally across turns, models produce unsupported assumptions that distort final answers, even when the total evidence is identical to a single-prompt setting. The authors propose Canonical-Context On-Policy Distillation (CCOPD), which trains a student model on incremental multi-turn conversations to match the output distribution of a frozen teacher conditioned on the full clean prompt. Trained only on math conversations, CCOPD achieves a 32% average relative improvement on multi-turn (RAW-SHARDED) tasks and generalizes zero-shot to five out-of-domain task families while preserving single-prompt performance.

Evaluation and Benchmarking Agent and Tool Ecosystem on-policy distillation multi-turn language models self-anchored drift +2 more

6arXiv · cs.CL·12d ago·source ↗

ZPPO: Teacher-in-prompt training method outperforms distillation and GRPO for small vision-language models

Researchers introduce Zone of Proximal Policy Optimization (ZPPO), a training method inspired by Vygotsky's zone of proximal development that embeds teacher guidance in prompts rather than policy gradients or logit imitation. On hard questions where student rollouts fail, ZPPO constructs Binary Candidate-included Questions (BCQ) and Negative Candidate-included Questions (NCQ) to help the student discriminate correct from incorrect responses, with a replay buffer that recirculates hard questions until mastered. Evaluated on the Qwen3 family (0.8B–9B) with a 27B teacher across a 31-benchmark suite covering VLM, LLM, and video tasks, ZPPO outperforms both distillation and GRPO baselines, with the largest gains at the smallest model scale. The method addresses a known failure mode of RL training where zero-reward rollouts produce no gradient signal.

Open Weights Progress Alignment and RLHF GRPO Proximal Policy Optimization Qwen3 +1 more

6arXiv · cs.CL·1mo ago·source ↗

TextReg: Regularization Framework for Mitigating Prompt Distributional Overfitting in LLM Optimization

TextReg addresses a failure mode in iterative prompt optimization where LLM-rewritten prompts grow longer, accumulate narrow rules, and generalize poorly—termed prompt distributional overfitting. The authors formalize this via 'representational inefficiency,' a dual-factor measure decomposing prompt inefficiency into capacity cost and scope narrowness. TextReg applies a soft-penalty regularization framework using Dual-Evidence Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Prompt Update. On reasoning benchmarks, it achieves up to +11.8% OOD accuracy over TextGrad and +16.5% over REVOLVE.

Evaluation and Benchmarking Agent and Tool Ecosystem TextGrad REVOLVE Semantic Edit Regularization +4 more

4arXiv · cs.CL·20d ago·source ↗

Corpus-Grounded Feature Diffusion pipeline for automated IEP generation in Traditional Chinese

Researchers propose a low-resource fine-tuning pipeline called Corpus-Grounded Feature Diffusion (CGFD) to automate Individualized Education Program (IEP) drafting from Traditional Chinese parent-teacher interview transcripts. The approach fine-tunes Breeze-7B with QLoRA on 582 synthetically diffused samples and uses schema-constrained decoding at inference time, finding that Grammar-Constrained Decoding is counterproductive under Traditional Chinese token budgets. On a small formal hold-out (n=10), the system achieves BERTScore F1 of 0.779, outperforming zero-shot GPT-5.4, DeepSeek-V3.2, Gemini-3-Flash-Preview, and Llama-4-Maverick baselines while enabling fully local, air-gapped inference. The work addresses a gap in Traditional Chinese special-education NLP and demonstrates a privacy-preserving deployment pattern for sensitive document generation.

Evaluation and Benchmarking Enterprise Deployment Patterns DeepSeek V4 Corpus-Grounded Feature Diffusion Grammar-Constrained Decoding +6 more

6arXiv · cs.CL·6d ago·source ↗

SelfCompact: Model-driven adaptive context compaction for long agent traces

Researchers propose SelfCompact, a scaffold that lets language models decide when and how to compact their own accumulated context during long agentic runs, rather than relying on fixed token-threshold triggers. The system pairs a compaction tool with a lightweight rubric specifying when to invoke or suppress compaction based on trajectory structure (e.g., sub-task completion vs. mid-derivation). Evaluated across six benchmarks and seven models, SelfCompact matches or exceeds fixed-interval summarization while reducing per-question token cost by 30-70%, with gains of up to 18.1 points on math tasks and 5-9 points on agentic search. The work identifies a 'meta-cognitive gap' in unprompted models and shows it can be closed via scaffolding without fine-tuning.

Long Context Evolution Inference Economics SelfCompact Self-Compacting Language Model Agents +1 more

5arXiv · cs.AI·19d ago·source ↗

Step-aligned critique outperforms GRPO and reference-solution conditioning in self-distillation

A new arXiv paper investigates context design for self-distillation of language models, comparing binary reward (GRPO), reference solutions, and step-by-step critiques aligned to the solver's reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution conditioning by 5.27 points on Avg@12. Per-token advantage analysis shows that step-aligned feedback targets only failing tokens, avoiding unnecessary pressure on already-correct reasoning steps. The findings suggest structural alignment between feedback and the model's reasoning trace is a key driver of self-distillation effectiveness.

Evaluation and Benchmarking Alignment and RLHF GRPO The Role of Feedback Alignment in Self-Distillation

5arXiv · cs.CL·12d ago·source ↗

d-OPSD: First on-policy self-distillation framework tailored for diffusion LLMs

Researchers introduce d-OPSD, the first on-policy self-distillation (OPSD) framework designed specifically for diffusion large language models (dLLMs). The method addresses a fundamental mismatch between existing autoregressive OPSD approaches and dLLMs' arbitrary-order generation by using suffix conditioning on self-generated answers and step-level rather than token-level divergence supervision. Across four reasoning benchmarks, d-OPSD outperforms RLVR and SFT baselines while requiring only ~10% of the optimization steps of RLVR, suggesting strong sample efficiency gains for dLLM post-training.

Frontier Model Releases Alignment and RLHF d-OPSD Learning from the Self-future: On-policy Self-distillation for dLLMs