6arXiv cs.CL (Computation and Language)·9d ago

ALIGNBEAM: Training-free safety alignment transfer across model families at inference time

ALIGNBEAM is a training-free inference-time method that transfers safety alignment from a safe anchor model to a domain-fine-tuned target model, even when the two models have different vocabularies. It works by translating anchor logits into the target model's vocabulary token-by-token at each decoding step, then using a small LLM judge to select the safest among K candidate continuations. The method addresses a known vulnerability where domain fine-tuning degrades safety, and demonstrates substantial refusal improvements on adversarial benchmarks without retraining either model or incurring prohibitive inference overhead.

Inference Economics AI Safety Research Alignment and RLHF ALIGNBEAM

Related guides (3)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Related events (8)

6arXiv · cs.AI·18d ago·source ↗

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

SafeSteer proposes a safety alignment method that targets only 'safety tokens' in the output distribution rather than applying global fine-tuning, arguing that safety features are inherently sparse. It constructs a safety teacher via activation steering, then restricts a reverse KL penalty to selected safety tokens during training. The approach achieves strong safety performance across seven benchmarks with minimal capability degradation, requiring only 100 harmful samples—less than 1% of data used by prior baselines.

Evaluation and Benchmarking AI Safety Research on-policy distillation SafeSteer alignment tax +4 more

7Openai Blog·1mo ago·source ↗

Deliberative Alignment: Reasoning Enables Safer Language Models

OpenAI introduces deliberative alignment, a new alignment strategy applied to o1 models in which the model is directly taught safety specifications and trained to reason over them at inference time. Unlike prior approaches that embed safety implicitly through RLHF, this method makes safety reasoning explicit and inspectable. The announcement positions deliberative alignment as a meaningful advance in scalable oversight and safe deployment of frontier reasoning models.

Frontier Model Releases AI Safety Research Reinforcement Learning from Human Feedback OpenAI deliberative alignment +2 more

6arXiv · cs.CL·47h ago·source ↗

Activation-space directions for detecting and mitigating emergent misalignment across LLM families

Researchers fine-tuned four small instruction-tuned model families (Qwen2.5-1.5B, Gemma-2-2B, Llama-3.2-1B, Ministral-3B) on insecure code to induce emergent misalignment, then investigated whether a shared activation-space direction could detect and correct it. A difference-in-means direction achieves 99.6% separation of aligned vs. misaligned activations within each model, and causal steering by subtracting this direction reduces misaligned behavior by 21–51 points. Cross-architecture transfer via ridge regression yields large behavioral suppression but fails specificity controls, revealing a two-tier structure: within-model directions are causally specific and actionable, while cross-model directions are real but non-specific. The findings bound the utility of linear cross-architecture correction and recommend within-model probing for safety auditing.

Evaluation and Benchmarking AI Safety Research Llama 3.2 Gemma 2 Qwen2.5-1.5B +4 more

7Openai Blog·1mo ago·source ↗

Toward understanding and preventing misalignment generalization

OpenAI investigates how training language models on incorrect or harmful responses can cause broader misalignment that generalizes beyond the training distribution. The research identifies an internal feature (likely a representation or circuit) that drives this misalignment generalization behavior. Crucially, the team finds this feature can be reversed with minimal fine-tuning, suggesting a practical mitigation pathway. This work connects mechanistic interpretability to alignment safety in a concrete, actionable way.

Evaluation and Benchmarking AI Safety Research emergent misalignment mechanistic interpretability OpenAI +2 more

5arXiv · cs.CL·11d ago·source ↗

GGRO: Gradient-Guided Reward Optimization for inference-time LLM alignment

Researchers introduce Gradient-Guided Reward Optimization (GGRO), an inference-time alignment method that uses gradient signals from a reward model to inject 'nudging tokens' at high-uncertainty decoding steps, rather than relying on sampling-intensive re-ranking approaches like Best-of-N. The method monitors token-level entropy to detect distribution drift and steers generation trajectories directly, claiming improved robustness to reward hacking with minimal computational overhead. Experiments show gains across safety, helpfulness, and reasoning benchmarks compared to standard inference-time alignment baselines.

Inference Economics Alignment and RLHF Best-of-N Sampling Gradient-Guided Reward Optimization

6arXiv · cs.CL·17d ago·source ↗

Adversarial robustness and safety alignment in multilingual multimodal LLMs: cross-lingual vulnerability and 'safety-by-failure'

A systematic study evaluates adversarial robustness and safety alignment of multimodal LLMs across 12 languages, finding that adversarial images optimized in one language transfer to others (cross-lingual transferability). The paper introduces the concept of 'safety-by-failure': low-resource languages appear safer not due to genuine alignment but because models fail to comprehend harmful instructions in those languages. Models like Qwen3-VL that integrate multilingual capability throughout training (rather than only at instruction tuning) show genuine cross-lingual safety with active refusal. The findings challenge the assumption that low-resource language safety metrics reflect real alignment.

Evaluation and Benchmarking AI Safety Research Qwen3-4B Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models +1 more

4arXiv · cs.LG·11d ago·source ↗

Agency-transferring technique improves RL policy training by bootstrapping from baseline policies

A new arXiv paper proposes a model-free reinforcement learning method that embeds an existing suboptimal baseline policy into training via an arbitration mechanism, progressively transferring control from the baseline to a trainable neural network. The approach yields high goal-reaching rates from the start of training and produces a standalone policy that outperforms the baseline without requiring it at inference time. Theoretical bounds on goal-reaching probability are derived, and empirical results on continuous-control benchmarks show competitive or superior returns compared to existing methods.

Alignment and RLHF An Agency-Transferring Model-Free Policy Enhancement Technique

4arXiv · cs.CL·12d ago·source ↗

Acoustic cue alignment tokens improve speech emotion recognition in audio language models

Researchers study whether instruction-following audio language models (ALMs) use explicit acoustic cues in a grounded way when raw audio is already available. They derive six interpretable acoustic concept tokens from the eGeMAPS feature set and append them to text prompts, testing on FAU-Aibo and IEMOCAP benchmarks. Aligned tokens improve unweighted average recall while shuffled or corrupted tokens degrade performance, but models don't fully collapse under perturbation, indicating partial anchoring to the audio signal. The work offers a practical probing method for interpretability and robustness in affective computing with ALMs.

Evaluation and Benchmarking Multimodal Progress FAU-Aibo Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition IEMOCAP +1 more