Activation Steering for Synthetic Safety Data Generation: Diversity as a Critical Quality Axis
This paper investigates whether activation steering (AS) can generate high-quality synthetic training data for downstream safety detection classifiers, filling a gap in the literature. Across 4 safety concepts × 2 models × 4 steering methods, the authors find that AS-generated data outperforms prompt-generated data on 3 of 4 concepts, but only 41 of 136 configurations succeed, indicating a narrow effective regime. The study introduces sample- and set-level diversity as a previously absent quality axis, finding that higher steering strength reduces diversity and that the harmonic mean of success, coherence, and diversity correlates more reliably with downstream AUROC than prior metrics alone. The results provide a practical heuristic for practitioners tuning AS hyperparameters for safety data generation.
Related guides (3)
Related events (8)
SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment
SafeSteer proposes a safety alignment method that targets only 'safety tokens' in the output distribution rather than applying global fine-tuning, arguing that safety features are inherently sparse. It constructs a safety teacher via activation steering, then restricts a reverse KL penalty to selected safety tokens during training. The approach achieves strong safety performance across seven benchmarks with minimal capability degradation, requiring only 100 harmful samples—less than 1% of data used by prior baselines.
SynAE: Framework for Evaluating Synthetic Data Quality in Tool-Calling Agent Benchmarks
SynAE is a proposed evaluation framework for measuring how well synthetic datasets replicate and augment real data trajectories for multi-turn, tool-calling agent testing. It assesses validity, fidelity, and diversity across four metric categories: task instructions, tool calls, final outputs, and downstream evaluation. The paper demonstrates that no single metric suffices to characterize synthetic data quality, motivating multi-axis evaluation. A demo and code are publicly available.
Provenance-grounded gating and adaptive recovery improve synthetic post-training data curation
A controlled study examines two underexplored practices in synthetic post-training data pipelines: grounding filtering signals in source provenance and systematically recovering rejected samples rather than discarding them. Using adversarially injected corpora as ground-truth failure labels, the authors find that exact source provenance improves faithfulness gating for stronger judges, that hallucination and reward gates reject largely disjoint populations (making both necessary), and that adaptive recovery via failure diagnosis and targeted regeneration outperforms naive resampling. Generator scale is the primary driver of downstream fine-tuning quality, with filtration and recovery contributing meaningfully but secondarily.
Retrying vs Resampling in AI Control: Safety Tradeoffs in Coding Scaffolds
This paper analyzes two strategies for handling flagged actions in AI coding scaffolds—retrying (blocking risky actions and continuing) and resampling (drawing multiple samples from the same context)—from an AI control perspective that treats the model as potentially adversarial. The authors find that retrying backfires because the untrusted model can exploit monitor rationale to craft stealthier attacks, while resampling avoids this information leakage. Using Claude Opus 4.6 as the untrusted model and MiMo-V2-Flash as the monitor on the BashArena benchmark, they show that drawing five samples per step and auditing on maximum suspicion score raises safety from 61% to 71% at a 0.3% audit budget. Two findings contradict prior work: auditing on maximum (not minimum) suspicion scores is better, and executing the least suspicious sample yields only marginal safety gains.
Reducing bias and improving safety in DALL·E 2
OpenAI announced a new technique applied to DALL·E 2 that adjusts image generation of people to better reflect global demographic diversity. The intervention targets representational bias in the model's outputs when generating human subjects. This is an early public example of a major lab deploying a post-training bias mitigation technique in a production image generation system.
Systematic framework for selecting trajectories in data augmentation evaluated across five strategies
A thesis-derived arXiv preprint proposes a framework for evaluating five trajectory selection strategies—Outlierness, Diversity, Representativeness, Uncertainty, and Random—for data augmentation in spatio-temporal ML tasks. The study tests these strategies across four datasets spanning animal behavior, maritime, and urban traffic domains using linear and non-linear models with Optuna-based hyperparameter optimization. Key findings show systematic strategies (especially Outlierness and Uncertainty) outperform random selection in sparse datasets but can degrade performance in dense, high-quality datasets, with UMAP visualization confirming topological effects.
SAERL: Using Sparse Autoencoders to Guide LLM Reinforcement Learning Data Engineering
SAERL is a post-training data engineering framework that uses Sparse Autoencoders (SAEs) — a mechanistic interpretability tool — to extract intrinsic model signals for controlling data diversity, difficulty, and quality during RL fine-tuning. The framework applies SAE-space clustering for batch diversity, a difficulty proxy for curriculum ordering, and a quality probe for data filtering. On Qwen2.5-Math-1.5B with GRPO, SAERL achieves 3% average accuracy improvement and reaches target accuracy with 20% fewer training steps. SAE representations transfer across model families and scales, suggesting broad applicability as a lightweight data engineering tool.
Deliberative Alignment: Reasoning Enables Safer Language Models
OpenAI introduces deliberative alignment, a new alignment strategy applied to o1 models in which the model is directly taught safety specifications and trained to reason over them at inference time. Unlike prior approaches that embed safety implicitly through RLHF, this method makes safety reasoning explicit and inspectable. The announcement positions deliberative alignment as a meaningful advance in scalable oversight and safe deployment of frontier reasoning models.


