Entity · technique

post-training alignment

techniqueactivepost-training-alignment-e386ed36·2 events·first seen May 25, 2026

Aliases: post-training alignment, alignment training

Co-occurring entities

gender bias in VLMs Vision-Language Models visual-token activation probing occupation-gender stereotype dataset LALS (Latent Association Leaning Score)Mistral AI Alibaba Mistral paired-scenario forced-choice probe geopolitical bias Qwen 2.5-7B

More like this (12)

post-training compression Consistency Training Can Entrench Misalignment AI alignment MedAlign REAlignment Reward The Alignment Project Post-Training Shifts Confidence: A Three-Stage Analysis of How SFT, RL, and OPD Shape Pre-, Intra-, and Post-CoT Calibration ALIGN Positive Alignment PostTrainBench consistency training emergent misalignment

Recent events (2)

6arXiv · cs.CL·Jun 1, 2026·source ↗

Vision-Language Models Suppress Female Representations Under Ambiguous Input

This paper investigates gender bias in vision-language models (VLMs) when inputs are ambiguous (e.g., workers in full gear or seen from behind), finding that models default to male outputs even for strongly female-stereotyped occupations. The authors introduce LALS (Latent Association Leaning Score), a zero-shot metric that probes internal visual-token activations to measure concept associations across layers. Across 15 occupations, 800+ ambiguous images, and four VLMs, they find a systematic decoupling: models internally encode female associations but suppress them before generation, with male signals amplifying end-to-end while female signals peak mid-network and are filtered out. Cultural visual cues like clothing color further modulate these internal associations.

Evaluation and Benchmarking AI Safety Research gender bias in VLMs Vision-Language Models visual-token activation probing +5 more

7arXiv · cs.AI·May 25, 2026·source ↗

Geopolitical Bias in LLMs Originates in Post-Training, Not Pre-Training Data

A study testing seven open-weight LLM pairs (base vs. chat models) across seven labs finds that geopolitical bias is introduced during post-training rather than inherited from pre-training data. Six of seven labs showed post-training shifts favoring the developer's home country or region, with Alibaba's Qwen 2.5 showing the most extreme shift (18x increase in China-favourability log-odds). The effect is also language-dependent: Mistral becomes pro-France only under French prompting. The authors argue this implicates alignment and RLHF processes as active shapers of geopolitical perspective, calling for greater transparency and auditing of post-training pipelines.

Evaluation and Benchmarking Open Weights Progress Mistral AI Alibaba Mistral +6 more