5arXiv cs.CL (Computation and Language)·40h ago

TF-RefusalBench: Measuring and mitigating over-alignment in multilingual criminal law LLM applications

Researchers introduce TF-RefusalBench, a 5,200-prompt multilingual benchmark derived from Swiss Federal Supreme Court rulings to measure over-alignment (excessive refusals and disclaimers) in LLMs handling criminal law translation and summarization tasks. The benchmark covers French, German, Italian, and English and reveals that over-alignment is influenced by model choice, prompt language, and text language. The paper evaluates mitigation strategies including prompting and abliteration (refusal direction ablation), finding abliteration eliminates refusals with minimal task performance cost. The work is grounded in a real deployment context: the Swiss Federal Supreme Court already uses on-premises LLMs for translation and summarization.

Evaluation and Benchmarking AI Safety Research Enterprise Deployment Patterns abliteration Swiss Federal Supreme Court TF-RefusalBench

Related guides (3)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From AI Demo to Production Reality

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·46h ago·source ↗

BabelJudge: Benchmark for measuring LLM-as-a-judge reliability across languages and agent trajectories

BabelJudge is a new open-source benchmark and audit framework that systematically measures four failure modes in LLM-as-a-judge systems: position bias, verbosity bias, order inconsistency, and cross-lingual degradation. The framework uses a 'gold-labelling by degradation' technique to generate labeled evaluation pairs without human annotation. Evaluation of Qwen2.5-7B-Instruct-4bit across English, Hindi, Arabic, and Swahili reveals severe cross-lingual reliability drops, with Swahili order consistency collapsing to near-random (0.480). The framework is extended to agentic evaluation with nine trajectory-level perturbations and three new metrics, released as a Python package supporting 11 judge backends.

Evaluation and Benchmarking Agent and Tool Ecosystem BabelJudge Qwen2.5-7B-Instruct-1M Shreyaskc

5arXiv · cs.CL·27d ago·source ↗

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

This paper systematically investigates strategies for extending LLM-based automatic evaluation (LLMs-as-a-Judge) to multilingual settings, covering high-, mid-, and low-resource languages (English, Spanish, Basque). The authors compare instruction translation, monolingual vs. multilingual supervision, and model size, finding that fine-tuned smaller models can match proprietary models when in-domain data is available, while zero-shot larger models are preferable out-of-domain. Two meta-evaluation datasets are extended to Spanish and Basque, and all data and code are publicly released.

Evaluation and Benchmarking Basque language LLM-as-a-Judge mJudge +2 more

5arXiv · cs.CL·22h ago·source ↗

AdversaBench: Automated LLM red-teaming pipeline with multi-judge confirmation and cross-model transferability

AdversaBench is a new end-to-end red-teaming pipeline that mutates seed prompts using five structured operators and confirms failures via a three-judge panel with a meta-judge tiebreaker. Experiments on 45 seeds across reasoning, instruction-following, and tool-use categories produced confirmed failures for every seed. Key findings include sharp variation in operator effectiveness by category, misleading binary failure rates, judge agreement metrics distorted by label skew, and zero-shot transferability of adversarial prompts from Llama 3.1 8B to Llama 3.3 70B. Code and dataset are publicly released.

Evaluation and Benchmarking AI Safety Research Llama 3.1 70B AdversaBench Meta +1 more

6arXiv · cs.CL·14d ago·source ↗

The Shibboleth Effect: Cross-lingual behavioral skew in frontier LLMs under adversarial geopolitical simulation

Researchers introduce the 'Shibboleth Effect' — systematic behavioral differences in LLMs when operating in different languages — and audit six frontier models (GPT-4o, Llama-4, Mistral-Large, Gemini-3.1-Pro, Qwen3.6-Plus, DeepSeek-R1) using a synthetic maritime territorial dispute wargame played in English versus Turkish. Results are heterogeneous: Llama-4 becomes significantly more coercive in Turkish while Gemini-3.1-Pro and DeepSeek-R1 become less so, and GPT-4o shows no detectable shift. The study identifies two candidate buffering mechanisms — chain-of-thought institutional anchoring and multilingual RLHF alignment — with direct implications for deploying LLMs in diplomatic or crisis-management contexts.

Evaluation and Benchmarking AI Safety Research DeepSeek V4 Mistral Large 2 GPT-4o +8 more

4arXiv · cs.CL·23d ago·source ↗

Benchmarking Local LLMs for Confidential Translation Workflows

This paper evaluates locally runnable LLMs (via Ollama) for offline, privacy-constrained translation workflows targeting freelance translators and smaller language service providers. The authors expand their Reeve Foundation corpus to include German and Simplified Chinese, then benchmark local models across four language directions against commercial NMTs (DeepL, Baidu), a frontier LLM (GPT-5.2), and professional local NMT systems. Results show substantial performance variation by language direction and model size, with the best local LLMs matching or exceeding local NMT systems and the frontier LLM, though falling short of top commercial NMTs. The study supports the viability of local LLMs for confidentiality-sensitive translation use cases.

Evaluation and Benchmarking Open Weights Progress Ollama GPT-5.2 DeepL +8 more

5arXiv · cs.CL·15d ago·source ↗

PsychoSafe: Framework for Psychologically-Informed LLM Refusals in High-Risk Interactions

Researchers introduce PsychoSafe, a refusal framework that reframes LLM non-compliance as structured supportive communication grounded in evidence-based psychological intervention strategies. The work constructs an 8,019 prompt-response corpus across five risk domains and applies prompting and parameter-efficient fine-tuning to Qwen 3.5 27B, achieving 28.1% improvement in refusal quality over a generic baseline with notable gains in resource referral and psychological grounding. Evaluations on SORRY-Bench and XSTest reveal strong in-domain robustness but limited out-of-domain generalization, pointing to a need for more diverse fine-tuning data. The framework is relevant to safety alignment work targeting crisis, coercion, and escalating-intent scenarios.

AI Safety Research Alignment and RLHF Qwen 3.5 27B SORRY-Bench XSTest +1 more

5arXiv · cs.CL·22d ago·source ↗

FRANZ: A Communicative Audit Framework for LLM Response Framing on Subjective Questions

Researchers introduce FRANZ, an automated framework for auditing how LLMs frame responses to subjective, culturally-sensitive questions across four dimensions: cultural positioning, generalizing language, anthropomorphic cues, and conversational maxims. The work is paired with SQUARE, a 376k-question corpus drawn from 57 subreddits and mapped to 7 countries and 19 question categories. Applying FRANZ to three open-weight LLMs reveals statistically significant differences in framing behavior, and uncovers a positive coupling between insider positioning and anthropomorphism that varies by country. The study argues that existing evaluations focused on factual correctness miss important communicative dimensions of LLM outputs.

Evaluation and Benchmarking AI Safety Research Reddit SQUARE FRANZ +1 more

4arXiv · cs.CL·1mo ago·source ↗

LexNeo-Bench: Probing LLM Knowledge of Lexical Borrowing in Luxembourgish via Knowledge-Graph Prompting

Researchers introduce LexNeo-Bench, a 3,050-instance benchmark for evaluating LLM performance on lexical borrowing classification and neology detection in Luxembourgish, a low-resource contact language. Three multilingual LLMs are tested across 34 prompt configurations; without external context, models perform near chance on borrowing classification (25–35%). Injecting instance-specific subgraphs from a linguistic knowledge graph raises accuracy to 71–81% and largely closes the gap between small and large models, though neology detection remains difficult. The study highlights the value of lexicon-aware, structured prompting for low-resource multilingual evaluation.

Evaluation and Benchmarking Agent and Tool Ecosystem LexNeo-Bench knowledge graph prompting LuxBorrow +2 more