4arXiv cs.CL (Computation and Language)·2d ago

Empirical study of LLM medical domain adaptation trade-offs in French QA

Researchers present a systematic comparison of continual pretraining (CPT), supervised fine-tuning (SFT), and their combination for adapting LLMs to French medical question answering. The study spans three model families, multiple sizes, and three initialization types, evaluating both multiple-choice and open-ended QA formats. Key findings: CPT+SFT yields the best MCQA scores but gains over SFT alone are often not statistically significant, making SFT a cost-effective default; for open-ended QA, CPT improves overlap metrics while SFT degrades generation quality. Cross-lingual transfer from French adaptation to English benchmarks is also demonstrated.

Evaluation and Benchmarking Enterprise Deployment Patterns Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA

Related guides (2)

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From AI Demo to Production Reality

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·23d ago·source ↗

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

This paper systematically investigates strategies for extending LLM-based automatic evaluation (LLMs-as-a-Judge) to multilingual settings, covering high-, mid-, and low-resource languages (English, Spanish, Basque). The authors compare instruction translation, monolingual vs. multilingual supervision, and model size, finding that fine-tuned smaller models can match proprietary models when in-domain data is available, while zero-shot larger models are preferable out-of-domain. Two meta-evaluation datasets are extended to Spanish and Basque, and all data and code are publicly released.

Evaluation and Benchmarking Basque language LLM-as-a-Judge mJudge +2 more

5arXiv · cs.CL·12d ago·source ↗

Systematic evaluation of LLM prompt sensitivity in healthcare settings reveals safety risks

Researchers conduct a sensitivity analysis of both general-purpose and medical-specific LLMs using the MedMCQA benchmark, testing robustness to lexical and syntactic prompt perturbations. The study finds that even minor phrasing changes can alter clinical advice, and adversarial prompts can produce dangerous outputs such as incorrect dosages or omitted critical findings. Both general-purpose models (GPT-3.5, Llama 3) and domain-specific models (ClinicalBERT, BioLlama3, BioBERT) exhibit this fragility, with syntactic reordering and misleading contextual cues proving more destabilizing than simple paraphrasing.

Evaluation and Benchmarking AI Safety Research BioLlama3 BioBERT MedMCQA +3 more

5arXiv · cs.CL·9d ago·source ↗

New Polish medical exam benchmark reveals MCQA overestimates LLM clinical competence

Researchers introduce an expanded Polish medical exam benchmark with over 15,000 new questions, two new domains, and four structural modifications designed to reduce multiple-choice artifacts and better test reasoning. Evaluating 21 LLMs under the harder setup, the best-performing model (Qwen3.5-122B) drops 28-31 percentage points compared to standard MCQA scores. The findings suggest standard MCQA benchmarks do not reliably reflect true medical competence, even when data contamination is low. The benchmark is publicly released to support further research.

Evaluation and Benchmarking Qwen3.5-122B Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?

6arXiv · cs.CL·17d ago·source ↗

Adversarial robustness and safety alignment in multilingual multimodal LLMs: cross-lingual vulnerability and 'safety-by-failure'

A systematic study evaluates adversarial robustness and safety alignment of multimodal LLMs across 12 languages, finding that adversarial images optimized in one language transfer to others (cross-lingual transferability). The paper introduces the concept of 'safety-by-failure': low-resource languages appear safer not due to genuine alignment but because models fail to comprehend harmful instructions in those languages. Models like Qwen3-VL that integrate multilingual capability throughout training (rather than only at instruction tuning) show genuine cross-lingual safety with active refusal. The findings challenge the assumption that low-resource language safety metrics reflect real alignment.

Evaluation and Benchmarking AI Safety Research Qwen3-4B Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models +1 more

5arXiv · cs.CL·9d ago·source ↗

Systematic study reveals effectiveness-fluency trade-offs in LLM conditioning methods

A new arXiv paper systematically evaluates a range of LLM conditioning methods across both concept injection and removal scenarios, finding that efficient steering methods often degrade fluency significantly. A key finding is that activation steering is substantially less effective on instruction-tuned models than on base models, a previously overlooked interaction. Simple prompting and supervised fine-tuning work for concept injection but not removal, and cheap textual metrics are found to correlate well with expensive LLM-as-judge evaluations.

Evaluation and Benchmarking Alignment and RLHF On The Effectiveness-Fluency Trade-Off In LLM Conditioning: A Systematic Study

4arXiv · cs.CL·19d ago·source ↗

Benchmarking Local LLMs for Confidential Translation Workflows

This paper evaluates locally runnable LLMs (via Ollama) for offline, privacy-constrained translation workflows targeting freelance translators and smaller language service providers. The authors expand their Reeve Foundation corpus to include German and Simplified Chinese, then benchmark local models across four language directions against commercial NMTs (DeepL, Baidu), a frontier LLM (GPT-5.2), and professional local NMT systems. Results show substantial performance variation by language direction and model size, with the best local LLMs matching or exceeding local NMT systems and the frontier LLM, though falling short of top commercial NMTs. The study supports the viability of local LLMs for confidentiality-sensitive translation use cases.

Evaluation and Benchmarking Open Weights Progress Ollama GPT-5.2 DeepL +8 more

4Hugging Face Blog·1mo ago·source ↗

LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning?

This Hugging Face blog post introduces LAVE (LLM-Assisted Visual Evaluation), a zero-shot VQA evaluation methodology applied to the Docmatix dataset. The post investigates whether large vision-language models can perform document visual question answering without task-specific fine-tuning by leveraging LLM-based evaluation metrics. The analysis probes the gap between zero-shot and fine-tuned performance on document understanding tasks, raising questions about the continued necessity of supervised adaptation for VQA.

Evaluation and Benchmarking Multimodal Progress Visual Question Answering LAVE Hugging Face +1 more

5Hugging Face Blog·1mo ago·source ↗

The Open Medical-LLM Leaderboard: Benchmarking Large Language Models in Healthcare

Hugging Face has launched the Open Medical-LLM Leaderboard, a public benchmark for evaluating large language models on healthcare and medical tasks. The leaderboard aggregates performance across multiple medical question-answering datasets to enable standardized comparison of open-weight models in clinical and biomedical domains. This initiative aims to accelerate progress in medical AI by providing transparent, reproducible evaluation infrastructure.

Evaluation and Benchmarking Open Weights Progress PubMedQA Open Medical-LLM Leaderboard MedMCQA +3 more