5arXiv cs.CL (Computation and Language)·2d ago

Training framework reduces calibration error 60%+ in Medical VQA multimodal LLMs

A new arXiv preprint proposes a finetuning framework to improve verbalized uncertainty calibration in multimodal LLMs applied to Medical Visual Question Answering. The composite loss function combines Brier-style calibration, anchor regularization, contrastive image-text alignment, and KL-based stabilization, evaluated on MedGemma 4B IT and Qwen2-VL 7B Instruct across three medical VQA benchmarks. The method reduces calibration error by 60% or more and improves discrimination by 26% or more while preserving predictive accuracy, outperforming prompting-, sampling-, and training-based baselines.

Evaluation and Benchmarking AI Safety Research Multimodal Progress Just how sure are you? Improving Verbalized Uncertainty Calibration in Medical VQA Qwen2.5-7B-Instruct-1M MedGemma 4B IT

Related guides (3)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4Hugging Face Blog·1mo ago·source ↗

LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning?

This Hugging Face blog post introduces LAVE (LLM-Assisted Visual Evaluation), a zero-shot VQA evaluation methodology applied to the Docmatix dataset. The post investigates whether large vision-language models can perform document visual question answering without task-specific fine-tuning by leveraging LLM-based evaluation metrics. The analysis probes the gap between zero-shot and fine-tuned performance on document understanding tasks, raising questions about the continued necessity of supervised adaptation for VQA.

Evaluation and Benchmarking Multimodal Progress Visual Question Answering LAVE Hugging Face +1 more

5arXiv · cs.CL·5d ago·source ↗

Gazer: Training-free semantic correction for autoregressive visual models using MLLM feedback

Researchers introduce Gazer, a training-free framework that integrates multimodal large language model feedback into the sampling loop of autoregressive visual models (AVMs) to correct semantic errors during generation. The system operates in two stages: Reflective Diagnosis identifies semantic errors in intermediate generation states, and Semantic Correction rewinds and adjusts the generation trajectory to better match the target prompt. Experiments on compositional image and video benchmarks show improved semantic alignment and compositional accuracy across multiple AVMs without additional training. The work addresses a known weakness of next-scale prediction AVMs, where semantic errors accumulate across discrete generation scales.

Evaluation and Benchmarking Multimodal Progress Gazer Training-Free Semantic Correction for Autoregressive Visual Models

5arXiv · cs.CL·1mo ago·source ↗

Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals

This paper investigates uncertainty quantification (UQ) for activation oracles—systems that make LLM internal activations human-legible—by evaluating 6 confidence estimation methods across 6,000 samples per oracle. The authors find that bootstrap mode frequency achieves the best calibration (ECE 5.7% vs. 25.5% for log-probability baseline on Qwen3-8B), while the log-prob baseline remains useful as a cheap triage signal. Experiments vary verbalizer and context prompts across two Qwen3 model sizes. Code and a patched trainer are released publicly.

Evaluation and Benchmarking AI Safety Research Expected Calibration Error Activation Oracles Qwen3-4B +4 more

6arXiv · cs.CL·1mo ago·source ↗

MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models

MAGIC is a training-free coreset selection method for multimodal instruction tuning that uses three intrinsic signals—Multimodal Gain, Bridging Relevance, and Skill-Neuron Signatures—to identify compact, behaviorally faithful training subsets without backpropagation. The method operates in a three-stage pipeline: filtering low-gain examples, ranking by a quality objective, and bucket-wise budget allocation over neuron signatures. On LLaVA-665K and Vision-Flan datasets with 20% data budgets, MAGIC matches or slightly exceeds full fine-tuning performance (100.3% and 101.6% relative) while reducing wall-clock training time by 73.7%. Results transfer to LLaVA-1.5-7B and -13B target models.

Training Infrastructure Inference Economics MAGIC LLaVA-1.5-7B LLaVA-665K +5 more

4arXiv · cs.CL·10d ago·source ↗

Empirical study of LLM medical domain adaptation trade-offs in French QA

Researchers present a systematic comparison of continual pretraining (CPT), supervised fine-tuning (SFT), and their combination for adapting LLMs to French medical question answering. The study spans three model families, multiple sizes, and three initialization types, evaluating both multiple-choice and open-ended QA formats. Key findings: CPT+SFT yields the best MCQA scores but gains over SFT alone are often not statistically significant, making SFT a cost-effective default; for open-ended QA, CPT improves overlap metrics while SFT degrades generation quality. Cross-lingual transfer from French adaptation to English benchmarks is also demonstrated.

Evaluation and Benchmarking Enterprise Deployment Patterns Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA

6arXiv · cs.AI·26d ago·source ↗

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

This paper identifies and analyzes 'Perceptual Judgment Bias' in multimodal LLM judges, where models anchor on response text rather than visual evidence when the two conflict. The authors introduce a Perceptually Perturbed Judgment Dataset using counterfactual responses to isolate perceptual errors, and a training framework combining GRPO-based reward modeling with batch-ranking objectives. Experiments on MLLM-as-a-Judge benchmarks show improved perceptual fidelity, ranking coherence, and alignment with human evaluation.

Evaluation and Benchmarking Alignment and RLHF Perceptually Perturbed Judgment Dataset Multimodal Large Language Models GRPO +3 more

5arXiv · cs.LG·10d ago·source ↗

Act2Answer: Benchmarking commonsense and world knowledge retention in Vision-Language-Action models

Researchers introduce Act2Answer, a protocol for evaluating how much commonsense and factual knowledge VLA models retain after fine-tuning on robotics data. The approach converts knowledge benchmark questions into tabletop object-placement episodes, yielding action-grounded success rates that reduce confounds from low-level control failures. A large-scale study of 7 VLA models and 9 VLM baselines finds that VLAs retain solid performance on simple concepts but show larger gaps on richer semantic categories compared to their source VLMs, and that VQA co-training is associated with better knowledge retention.

Evaluation and Benchmarking Multimodal Progress Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models Act2Answer

5arXiv · cs.AI·3d ago·source ↗

FORCE: Efficient RL fine-tuning for Vision-Language-Action models via value-calibrated warm-up and self-distillation

Researchers introduce FORCE, a 3-stage reinforcement learning fine-tuning framework for Vision-Language-Action (VLA) models that addresses sample inefficiency caused by unstable Q-functions and low-quality exploration data. The framework uses a Value-Calibrated Warm-Up phase followed by Q-function-filtered policy updates, eliminating the need for costly human interventions during training. Evaluated on simulation and real-world robotic tasks, FORCE achieves a 79% absolute improvement in task success rates, outperforms prior RL methods by 10%, and accelerates training by 32.5%.

Agent and Tool Ecosystem Alignment and RLHF FORCE FORCE: Efficient VLA Reinforcement Fine-Tuning via Value-Calibrated Warm-up and Self-Distillation