6arXiv cs.CL (Computation and Language)·24d ago

Activation Steering for Synthetic Safety Data Generation: Diversity as a Critical Quality Axis

This paper investigates whether activation steering (AS) can generate high-quality synthetic training data for downstream safety detection classifiers, filling a gap in the literature. Across 4 safety concepts × 2 models × 4 steering methods, the authors find that AS-generated data outperforms prompt-generated data on 3 of 4 concepts, but only 41 of 136 configurations succeed, indicating a narrow effective regime. The study introduces sample- and set-level diversity as a previously absent quality axis, finding that higher steering strength reduces diversity and that the harmonic mean of success, coherence, and diversity correlates more reliably with downstream AUROC than prior metrics alone. The results provide a practical heuristic for practitioners tuning AS hyperparameters for safety data generation.

Evaluation and Benchmarking AI Safety Research Alignment and RLHF Safety Detection Classifier HHH (Helpful, Harmless, Honest)Activation Steering AUROC Synthetic Data Generator

Related guides (3)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.AI·19d ago·source ↗

SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

SafeSteer proposes a safety alignment method that targets only 'safety tokens' in the output distribution rather than applying global fine-tuning, arguing that safety features are inherently sparse. It constructs a safety teacher via activation steering, then restricts a reverse KL penalty to selected safety tokens during training. The approach achieves strong safety performance across seven benchmarks with minimal capability degradation, requiring only 100 harmful samples—less than 1% of data used by prior baselines.

Evaluation and Benchmarking AI Safety Research on-policy distillation SafeSteer alignment tax +4 more

5arXiv · cs.CL·1mo ago·source ↗

SynAE: Framework for Evaluating Synthetic Data Quality in Tool-Calling Agent Benchmarks

SynAE is a proposed evaluation framework for measuring how well synthetic datasets replicate and augment real data trajectories for multi-turn, tool-calling agent testing. It assesses validity, fidelity, and diversity across four metric categories: task instructions, tool calls, final outputs, and downstream evaluation. The paper demonstrates that no single metric suffices to characterize synthetic data quality, motivating multi-axis evaluation. A demo and code are publicly available.

Evaluation and Benchmarking Agent and Tool Ecosystem multi-turn agent benchmarks tool-calling agents SynAE +1 more

5arXiv · cs.CL·11d ago·source ↗

Provenance-grounded gating and adaptive recovery improve synthetic post-training data curation

A controlled study examines two underexplored practices in synthetic post-training data pipelines: grounding filtering signals in source provenance and systematically recovering rejected samples rather than discarding them. Using adversarially injected corpora as ground-truth failure labels, the authors find that exact source provenance improves faithfulness gating for stronger judges, that hallucination and reward gates reject largely disjoint populations (making both necessary), and that adaptive recovery via failure diagnosis and targeted regeneration outperforms naive resampling. Generator scale is the primary driver of downstream fine-tuning quality, with filtration and recovery contributing meaningfully but secondarily.

Evaluation and Benchmarking Alignment and RLHF Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation

7arXiv · cs.AI·26d ago·source ↗

Retrying vs Resampling in AI Control: Safety Tradeoffs in Coding Scaffolds

This paper analyzes two strategies for handling flagged actions in AI coding scaffolds—retrying (blocking risky actions and continuing) and resampling (drawing multiple samples from the same context)—from an AI control perspective that treats the model as potentially adversarial. The authors find that retrying backfires because the untrusted model can exploit monitor rationale to craft stealthier attacks, while resampling avoids this information leakage. Using Claude Opus 4.6 as the untrusted model and MiMo-V2-Flash as the monitor on the BashArena benchmark, they show that drawing five samples per step and auditing on maximum suspicion score raises safety from 61% to 71% at a 0.3% audit budget. Two findings contradict prior work: auditing on maximum (not minimum) suspicion scores is better, and executing the least suspicious sample yields only marginal safety gains.

Evaluation and Benchmarking AI Safety Research Claude Opus 4.6 MiMo-V2-Flash Ctrl-Z +6 more

5Openai Blog·1mo ago·source ↗

Reducing bias and improving safety in DALL·E 2

OpenAI announced a new technique applied to DALL·E 2 that adjusts image generation of people to better reflect global demographic diversity. The intervention targets representational bias in the model's outputs when generating human subjects. This is an early public example of a major lab deploying a post-training bias mitigation technique in a production image generation system.

AI Safety Research Alignment and RLHF DALL·E 3 OpenAI +1 more

3arXiv · cs.LG·12d ago·source ↗

Systematic framework for selecting trajectories in data augmentation evaluated across five strategies

A thesis-derived arXiv preprint proposes a framework for evaluating five trajectory selection strategies—Outlierness, Diversity, Representativeness, Uncertainty, and Random—for data augmentation in spatio-temporal ML tasks. The study tests these strategies across four datasets spanning animal behavior, maritime, and urban traffic domains using linear and non-linear models with Optuna-based hyperparameter optimization. Key findings show systematic strategies (especially Outlierness and Uncertainty) outperform random selection in sparse datasets but can degrade performance in dense, high-quality datasets, with UMAP visualization confirming topological effects.

Evaluation and Benchmarking Optuna A Systematic Approach for Selecting Trajectories for Data Augmentation UMAP

6arXiv · cs.CL·25d ago·source ↗

SAERL: Using Sparse Autoencoders to Guide LLM Reinforcement Learning Data Engineering

SAERL is a post-training data engineering framework that uses Sparse Autoencoders (SAEs) — a mechanistic interpretability tool — to extract intrinsic model signals for controlling data diversity, difficulty, and quality during RL fine-tuning. The framework applies SAE-space clustering for batch diversity, a difficulty proxy for curriculum ordering, and a quality probe for data filtering. On Qwen2.5-Math-1.5B with GRPO, SAERL achieves 3% average accuracy improvement and reaches target accuracy with 20% fewer training steps. SAE representations transfer across model families and scales, suggesting broad applicability as a lightweight data engineering tool.

Training Infrastructure Evaluation and Benchmarking mechanistic interpretability GRPO Reinforcement Learning from Human Feedback +6 more

7Openai Blog·1mo ago·source ↗

Deliberative Alignment: Reasoning Enables Safer Language Models

OpenAI introduces deliberative alignment, a new alignment strategy applied to o1 models in which the model is directly taught safety specifications and trained to reason over them at inference time. Unlike prior approaches that embed safety implicitly through RLHF, this method makes safety reasoning explicit and inspectable. The announcement positions deliberative alignment as a meaningful advance in scalable oversight and safe deployment of frontier reasoning models.

Frontier Model Releases AI Safety Research Reinforcement Learning from Human Feedback OpenAI deliberative alignment +2 more