6arXiv cs.AI (Artificial Intelligence)·10d ago

Q-target framework unifies supervised fine-tuning variants through target distribution design

A new arXiv preprint reframes supervised fine-tuning (SFT) as a problem of target distribution design rather than loss objective selection, introducing the Q-target framework that decomposes SFT supervision into two explicit choices: reliance on the observed token and allocation of remaining probability mass. The authors show that many existing SFT variants can be understood as implicit choices of this target distribution. They propose Target-SFT, which constructs training objectives directly from the desired target distribution, and report consistent improvements across ten reasoning dataset-model settings. The work offers a unifying theoretical lens and opens a broader design space for SFT objectives.

Evaluation and Benchmarking Alignment and RLHF Q-target framework A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design Target-SFT Target-SFT

Related guides (2)

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.AI·8d ago·source ↗

RA-RFT: Retrieval-Augmented Reinforcement Fine-Tuning teaches LLMs to reason by analogy

Researchers propose Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that trains a retriever to rank contexts by expected reasoning benefit rather than semantic similarity, then fine-tunes a policy model via reinforcement learning using retrieved analogous demonstrations. The key insight is that reasoning-relevant retrieval surfaces complementary solution strategies rather than superficially similar problems. On mathematical reasoning benchmarks, RA-RFT improves AIME 2025 average@32 accuracy by 7.1 and 2.8 points over GRPO for Qwen3-1.7B and Qwen3-4B respectively, suggesting reasoning-aware retrieval is orthogonal to reward design and training curriculum improvements.

Evaluation and Benchmarking Alignment and RLHF RA-RFT Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning GRPO +3 more

6arXiv · cs.CL·10d ago·source ↗

QK-Restore: Fixing long-context recall degradation caused by CoT fine-tuning in hybrid LLMs

Researchers find that chain-of-thought supervised fine-tuning systematically degrades long-context recall in hybrid linear-attention models (HypeNet, Jet-Nemotron), with Needle-In-A-Haystack performance collapsing dramatically—e.g., HypeNet-9B dropping from 67.2% to 9.4% at 256K context. The root cause is identified as CoT-SFT biasing attention gradients toward short-range patterns, corrupting the query-key projections responsible for long-range routing. The paper proposes QK-Restore, a training-free fix that restores only W_Q and W_K from the pre-SFT checkpoint, recovering long-context capability while preserving reasoning gains.

Long Context Evolution Alignment and RLHF Jet-Nemotron Needle-in-a-Haystack HypeNet +2 more

6arXiv · cs.CL·1mo ago·source ↗

ChunkFT: Memory-Efficient Full Fine-Tuning via Byte-Streamed Chunk Optimization

ChunkFT is a fine-tuning framework that reformulates full-parameter optimization around a dynamically activated working set of sub-tensors, enabling gradient computation without dense gradient materialization. It achieves full-parameter fine-tuning of a 7B model in 13.72GB GPU memory on a single RTX 4090, and scales Llama 3-70B fine-tuning to 2×H800 GPUs. Downstream evaluations on language understanding, math reasoning, and MT-Bench show ChunkFT matches or exceeds full-parameter fine-tuning quality while outperforming existing memory-efficient baselines such as LoRA-class methods. A theoretical convergence analysis in the deterministic setting is also provided.

Training Infrastructure Open Weights Progress Llama 3.1 70B MT-Bench Meta AI +5 more

7arXiv · cs.CL·23d ago·source ↗

AXPO: Agent Explorative Policy Optimization Addresses Thinking-Acting Gap in Multimodal Agentic Reasoning

This paper identifies a structural asymmetry in agentic reasoning called the 'Thinking-Acting Gap,' where tool use is attempted in only ~30% of rollouts under standard RL training (GRPO), and all-wrong tool-using subgroups suppress learning signals. The authors propose AXPO (Agent eXplorative Policy Optimization), which fixes the thinking prefix and resamples tool calls for all-wrong subgroups, combined with uncertainty-based prefix selection. Evaluated across nine multimodal benchmarks on Qwen3-VL-Thinking at multiple scales, SFT+AXPO outperforms SFT+GRPO by +1.8pp on both Pass@1 and Pass@4 at 8B, with the 8B model surpassing the 32B baseline on Pass@4 using 4× fewer parameters.

Frontier Model Releases Agent and Tool Ecosystem AXPO GRPO Thinking-Acting Gap +4 more

6arXiv · cs.LG·23d ago·source ↗

PEFT-Arena: Benchmarking Parameter-Efficient Finetuning via Stability-Plasticity Trade-offs

PEFT-Arena is a new benchmark that evaluates parameter-efficient finetuning methods jointly on downstream task performance and retention of pretrained general capabilities, framing the problem as a stability-plasticity dilemma. Across methods tested under comparable parameter budgets, orthogonal finetuning achieves the best Pareto frontier. The paper provides geometric analyses in both weight space (spectral/singular-value structure) and activation space (representation distortion metrics) to explain why different PEFT methods differ in forgetting behavior. A practical finding is that final SFT checkpoints often overshoot an optimal retention operating point, motivating path-wise rewinding as a post-hoc correction.

Evaluation and Benchmarking Agent and Tool Ecosystem stability-plasticity dilemma stability-plasticity dilemma orthogonal finetuning +7 more

6arXiv · cs.CL·19d ago·source ↗

DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization

DRIFT is a training framework that bridges online RL and offline SFT for multi-turn LLM optimization by exploiting the theoretical equivalence between KL-regularized RL and importance-weighted supervised learning. It decouples rollout generation from policy optimization: trajectories are sampled from a fixed reference policy offline, weighted by return-based importance scores, and used for weighted SFT. Empirically, DRIFT matches or exceeds multi-turn RL baselines while retaining the efficiency and simplicity of standard supervised fine-tuning. Code is publicly released.

Inference Economics Agent and Tool Ecosystem KL-regularized RL Reinforcement Learning DRIFT +2 more

5arXiv · cs.LG·26d ago·source ↗

Good Token Hunting: Token Selection Framework for Visual Geometry Transformers

This paper introduces a two-stage token selection framework to address the quadratic computational scaling of global attention in visual geometry transformers used for multi-view 3D reconstruction. The approach combines diversity-based inter-frame selection (frame-level) with entropy-guided intra-frame sparsification (token-level within frames). Experiments demonstrate over 85% acceleration for 500-image scenes while maintaining or improving baseline reconstruction quality, offering a favorable speed-accuracy trade-off.

Inference Economics Agent and Tool Ecosystem inter-frame token selection visual geometry transformer global attention +5 more

6arXiv · cs.LG·17d ago·source ↗

q0: Hyper-Epoch Pretraining turns multi-epoch budgets into diverse model populations for better generalization

A new arXiv preprint introduces hyper-epoch pretraining (q0), a framework that reframes multi-epoch training as exploration of a model population rather than refinement of a single model. The approach uses three primitives—cyclic schedules with anti-correlated learning rate and weight decay, chain distillation, and a learned prior for inference-time weighting—to achieve lower validation loss than single-model training. On a 1.8B-parameter model trained on FineWeb, q0 matches a 256-epoch ensemble baseline using only ~56 epochs (~4.6× fewer), with cumulative ~12.9× data efficiency under the Slowrun setting. The work directly addresses the emerging regime where compute scales faster than high-quality data supply.

Training Infrastructure Open Weights Progress FineWeb q0: Primitives for Hyper-Epoch Pretraining