technique
ALIGNBEAM
techniqueactiveprovisional
alignbeam-d45de65f·1 events·first seen 6d agoAliases: ALIGNBEAM
More like this (12)
Recent events (1)
ALIGNBEAM: Training-free safety alignment transfer across model families at inference time
ALIGNBEAM is a training-free inference-time method that transfers safety alignment from a safe anchor model to a domain-fine-tuned target model, even when the two models have different vocabularies. It works by translating anchor logits into the target model's vocabulary token-by-token at each decoding step, then using a small LLM judge to select the safest among K candidate continuations. The method addresses a known vulnerability where domain fine-tuning degrades safety, and demonstrates substantial refusal improvements on adversarial benchmarks without retraining either model or incurring prohibitive inference overhead.