Almanac
technique

abliteration

techniqueactiveprovisionalabliteration-e2f0334c·1 events·first seen 40h ago

Aliases: abliteration

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.CL·40h ago·source ↗

TF-RefusalBench: Measuring and mitigating over-alignment in multilingual criminal law LLM applications

Researchers introduce TF-RefusalBench, a 5,200-prompt multilingual benchmark derived from Swiss Federal Supreme Court rulings to measure over-alignment (excessive refusals and disclaimers) in LLMs handling criminal law translation and summarization tasks. The benchmark covers French, German, Italian, and English and reveals that over-alignment is influenced by model choice, prompt language, and text language. The paper evaluates mitigation strategies including prompting and abliteration (refusal direction ablation), finding abliteration eliminates refusals with minimal task performance cost. The work is grounded in a real deployment context: the Swiss Federal Supreme Court already uses on-premises LLMs for translation and summarization.