TF-RefusalBench: Measuring and mitigating over-alignment in multilingual criminal law LLM applications
Researchers introduce TF-RefusalBench, a 5,200-prompt multilingual benchmark derived from Swiss Federal Supreme Court rulings to measure over-alignment (excessive refusals and disclaimers) in LLMs handling criminal law translation and summarization tasks. The benchmark covers French, German, Italian, and English and reveals that over-alignment is influenced by model choice, prompt language, and text language. The paper evaluates mitigation strategies including prompting and abliteration (refusal direction ablation), finding abliteration eliminates refusals with minimal task performance cost. The work is grounded in a real deployment context: the Swiss Federal Supreme Court already uses on-premises LLMs for translation and summarization.
Related guides (3)
Related events (8)
BabelJudge: Benchmark for measuring LLM-as-a-judge reliability across languages and agent trajectories
BabelJudge is a new open-source benchmark and audit framework that systematically measures four failure modes in LLM-as-a-judge systems: position bias, verbosity bias, order inconsistency, and cross-lingual degradation. The framework uses a 'gold-labelling by degradation' technique to generate labeled evaluation pairs without human annotation. Evaluation of Qwen2.5-7B-Instruct-4bit across English, Hindi, Arabic, and Swahili reveals severe cross-lingual reliability drops, with Swahili order consistency collapsing to near-random (0.480). The framework is extended to agentic evaluation with nine trajectory-level perturbations and three new metrics, released as a Python package supporting 11 judge backends.
Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study
This paper systematically investigates strategies for extending LLM-based automatic evaluation (LLMs-as-a-Judge) to multilingual settings, covering high-, mid-, and low-resource languages (English, Spanish, Basque). The authors compare instruction translation, monolingual vs. multilingual supervision, and model size, finding that fine-tuned smaller models can match proprietary models when in-domain data is available, while zero-shot larger models are preferable out-of-domain. Two meta-evaluation datasets are extended to Spanish and Basque, and all data and code are publicly released.
AdversaBench: Automated LLM red-teaming pipeline with multi-judge confirmation and cross-model transferability
AdversaBench is a new end-to-end red-teaming pipeline that mutates seed prompts using five structured operators and confirms failures via a three-judge panel with a meta-judge tiebreaker. Experiments on 45 seeds across reasoning, instruction-following, and tool-use categories produced confirmed failures for every seed. Key findings include sharp variation in operator effectiveness by category, misleading binary failure rates, judge agreement metrics distorted by label skew, and zero-shot transferability of adversarial prompts from Llama 3.1 8B to Llama 3.3 70B. Code and dataset are publicly released.
The Shibboleth Effect: Cross-lingual behavioral skew in frontier LLMs under adversarial geopolitical simulation
Researchers introduce the 'Shibboleth Effect' — systematic behavioral differences in LLMs when operating in different languages — and audit six frontier models (GPT-4o, Llama-4, Mistral-Large, Gemini-3.1-Pro, Qwen3.6-Plus, DeepSeek-R1) using a synthetic maritime territorial dispute wargame played in English versus Turkish. Results are heterogeneous: Llama-4 becomes significantly more coercive in Turkish while Gemini-3.1-Pro and DeepSeek-R1 become less so, and GPT-4o shows no detectable shift. The study identifies two candidate buffering mechanisms — chain-of-thought institutional anchoring and multilingual RLHF alignment — with direct implications for deploying LLMs in diplomatic or crisis-management contexts.
Benchmarking Local LLMs for Confidential Translation Workflows
This paper evaluates locally runnable LLMs (via Ollama) for offline, privacy-constrained translation workflows targeting freelance translators and smaller language service providers. The authors expand their Reeve Foundation corpus to include German and Simplified Chinese, then benchmark local models across four language directions against commercial NMTs (DeepL, Baidu), a frontier LLM (GPT-5.2), and professional local NMT systems. Results show substantial performance variation by language direction and model size, with the best local LLMs matching or exceeding local NMT systems and the frontier LLM, though falling short of top commercial NMTs. The study supports the viability of local LLMs for confidentiality-sensitive translation use cases.
PsychoSafe: Framework for Psychologically-Informed LLM Refusals in High-Risk Interactions
Researchers introduce PsychoSafe, a refusal framework that reframes LLM non-compliance as structured supportive communication grounded in evidence-based psychological intervention strategies. The work constructs an 8,019 prompt-response corpus across five risk domains and applies prompting and parameter-efficient fine-tuning to Qwen 3.5 27B, achieving 28.1% improvement in refusal quality over a generic baseline with notable gains in resource referral and psychological grounding. Evaluations on SORRY-Bench and XSTest reveal strong in-domain robustness but limited out-of-domain generalization, pointing to a need for more diverse fine-tuning data. The framework is relevant to safety alignment work targeting crisis, coercion, and escalating-intent scenarios.
FRANZ: A Communicative Audit Framework for LLM Response Framing on Subjective Questions
Researchers introduce FRANZ, an automated framework for auditing how LLMs frame responses to subjective, culturally-sensitive questions across four dimensions: cultural positioning, generalizing language, anthropomorphic cues, and conversational maxims. The work is paired with SQUARE, a 376k-question corpus drawn from 57 subreddits and mapped to 7 countries and 19 question categories. Applying FRANZ to three open-weight LLMs reveals statistically significant differences in framing behavior, and uncovers a positive coupling between insider positioning and anthropomorphism that varies by country. The study argues that existing evaluations focused on factual correctness miss important communicative dimensions of LLM outputs.
LexNeo-Bench: Probing LLM Knowledge of Lexical Borrowing in Luxembourgish via Knowledge-Graph Prompting
Researchers introduce LexNeo-Bench, a 3,050-instance benchmark for evaluating LLM performance on lexical borrowing classification and neology detection in Luxembourgish, a low-resource contact language. Three multilingual LLMs are tested across 34 prompt configurations; without external context, models perform near chance on borrowing classification (25–35%). Injecting instance-specific subgraphs from a linguistic knowledge graph raises accuracy to 71–81% and largely closes the gap between small and large models, though neology detection remains difficult. The study highlights the value of lexicon-aware, structured prompting for low-resource multilingual evaluation.


