5arXiv cs.CL (Computation and Language)·41h ago

ORBIT: Training-free multi-attribute behavioral steering via orthogonal subspace rotation

Researchers introduce ORBIT (Orthogonal Rotation-Based Intervention Technique), a training-free activation steering method that simultaneously controls multiple behavioral attributes in language models. The approach constructs a joint subspace from per-attribute steering planes via SVD and applies a single norm-preserving rotation, avoiding the norm imbalance and directional cancellation problems of naive vector summation. The authors also release TraitFactory, a new multi-attribute behavioral benchmark, and evaluate across Llama-3.2-3B, Qwen-2.5-7B, and Llama-3.1-8B. ORBIT outperforms existing training-free baselines on multi-attribute steering while better preserving output coherence.

Evaluation and Benchmarking Alignment and RLHF TraitFactory Llama 3.2 ORBIT ToneBank Qwen 2.5-7B Llama-3.1-8B

Related guides (2)

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·12d ago·source ↗

OrchRM: Self-supervised reward modeling for multi-agent orchestration without human annotations

Researchers propose Orchestration Reward Modeling (OrchRM), a self-supervised framework that trains reward models for LLM-based multi-agent orchestrators using intermediate execution artifacts to construct win-lose pairs for Bradley-Terry training. The approach avoids costly sub-agent rollouts by operating directly at the orchestration level, achieving up to 10x improvement in training token efficiency and up to 8% accuracy gains in test-time scaling. Results generalize across mathematical reasoning, web-based QA, and multi-hop reasoning tasks.

Agent and Tool Ecosystem Alignment and RLHF Reward Modeling for Multi-Agent Orchestration OrchRM Bradley-Terry +1 more

7arXiv · cs.CL·8d ago·source ↗

Language models linearly encode a 'value axis' tracking expected goal success, study finds

Researchers construct a 'value axis' in Qwen3-8B's activation space using synthetic in-context RL data, finding that this axis distinguishes high vs. low confidence, backtracking vs. non-backtracking rollouts, and correct vs. corrupted code. Steering along this axis causally modulates self-correction behavior and verbosity, while DPO training shifts the internal value of rewarded behaviors. Applied to real-world settings, the axis reveals that Qwen assigns low internal value to politically sensitive queries post-training and that SFT increases domain-specific confidence. The findings suggest LLMs linearly encode an estimate of expected goal success that shapes their generative behavior.

AI Safety Research Alignment and RLHF The Value Axis: Language Models Encode Whether They're on the Right Track Direct Preference Optimization (DPO)Qwen3-4B

6arXiv · cs.LG·17h ago·source ↗

InSight: Self-guided autonomous skill acquisition for vision-language-action models via primitive steerability

InSight is a framework enabling VLA models to autonomously acquire new manipulation skills beyond their training data by decomposing demonstrations into labeled primitive actions (e.g., 'move gripper to bowl', 'pour the bottle') and running a VLM-guided data flywheel that identifies missing primitives, attempts demonstrations, and integrates successful ones back into training. The system requires no human demonstrations of target skills and is evaluated on simulation and real-world tasks including block flipping, drawer closing, sweeping, and pouring. Learned primitives can be composed for novel long-horizon tasks, offering a practical path toward continual skill acquisition in robotic VLA policies.

Agent and Tool Ecosystem Multimodal Progress InSight InSight: Self-Guided Skill Acquisition via Steerable VLAs InSight: Self-Guided Skill Acquisition via Steerable VLAs

6arXiv · cs.LG·8d ago·source ↗

HABC: Hierarchical Advantage Weighting for Online RL Fine-Tuning of Vision-Language-Action Policies

Researchers introduce Hierarchical Advantage-Weighted Behavior Cloning (HABC), a method for fine-tuning pretrained Vision-Language-Action (VLA) policies via online RL using only sparse binary episode outcomes. HABC trains separate critic heads for viability and efficiency objectives, combines them via a state-adaptive gate, and applies intervention-aware credit assignment to avoid incorrect supervision across human-intervention boundaries. On three contact-rich bimanual real-robot tasks, HABC improves success rates from SFT baselines of 36%, 44%, and 12% to 92%, 88%, and 38% respectively. The work addresses a fundamental credit assignment problem in robot learning from sparse outcome signals.

Alignment and RLHF Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes Hierarchical Advantage-Weighted Behavior Cloning

5arXiv · cs.CL·14d ago·source ↗

RL-based alignment improves interactivity in full-duplex spoken dialogue models

Researchers propose a post-training alignment method using reinforcement learning to improve interactivity in full-duplex spoken dialogue models, which can listen and speak simultaneously. The method addresses four canonical axes of interactivity—pause handling, turn-taking, backchanneling, and user interruption—each with axis-specific reward functions, plus an LLM-based reward to prevent semantic degradation. The approach is applied to two open-source models, Moshi and PersonaPlex, showing consistent improvements in both offline and real-time multi-turn evaluation.

Alignment and RLHF Multimodal Progress Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models PersonaPlex Moshi

7arXiv · cs.CL·27d ago·source ↗

AXPO: Agent Explorative Policy Optimization Addresses Thinking-Acting Gap in Multimodal Agentic Reasoning

This paper identifies a structural asymmetry in agentic reasoning called the 'Thinking-Acting Gap,' where tool use is attempted in only ~30% of rollouts under standard RL training (GRPO), and all-wrong tool-using subgroups suppress learning signals. The authors propose AXPO (Agent eXplorative Policy Optimization), which fixes the thinking prefix and resamples tool calls for all-wrong subgroups, combined with uncertainty-based prefix selection. Evaluated across nine multimodal benchmarks on Qwen3-VL-Thinking at multiple scales, SFT+AXPO outperforms SFT+GRPO by +1.8pp on both Pass@1 and Pass@4 at 8B, with the 8B model surpassing the 32B baseline on Pass@4 using 4× fewer parameters.

Frontier Model Releases Agent and Tool Ecosystem AXPO GRPO Thinking-Acting Gap +4 more

7arXiv · cs.AI·1mo ago·source ↗

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

Vector Policy Optimization (VPO) is a new RL post-training algorithm for LLMs that replaces the scalar reward paradigm with vector-valued rewards, explicitly training models to produce diverse solution sets that specialize across different reward trade-offs. VPO is designed as a near-drop-in replacement for the GRPO advantage estimator and targets inference-scaling search procedures like AlphaEvolve. Across four tasks, VPO matches or outperforms scalar RL baselines on pass@k and best@k metrics, with advantages growing as search budget increases, and unlocks evolutionary search problems that GRPO-trained models cannot solve. The paper argues that diversity-optimized post-training may need to become the default as inference-time search becomes standard.

Evaluation and Benchmarking Inference Economics GRPO pass@k AlphaEvolve +4 more

6arXiv · cs.CL·29d ago·source ↗

MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models

MAGIC is a training-free coreset selection method for multimodal instruction tuning that uses three intrinsic signals—Multimodal Gain, Bridging Relevance, and Skill-Neuron Signatures—to identify compact, behaviorally faithful training subsets without backpropagation. The method operates in a three-stage pipeline: filtering low-gain examples, ranking by a quality objective, and bucket-wise budget allocation over neuron signatures. On LLaVA-665K and Vision-Flan datasets with 20% data budgets, MAGIC matches or slightly exceeds full fine-tuning performance (100.3% and 101.6% relative) while reducing wall-clock training time by 73.7%. Results transfer to LLaVA-1.5-7B and -13B target models.

Training Infrastructure Inference Economics MAGIC LLaVA-1.5-7B LLaVA-665K +5 more