6arXiv cs.AI (Artificial Intelligence)·1mo ago

MedFocus: Causal Visual Attribution Framework for Chest X-ray Reasoning in Large Vision-Language Models

This paper addresses the faithfulness of visual attribution methods in Large Vision-Language Models (LVLMs) applied to chest X-ray (CXR) analysis. The authors develop a causal evaluation framework using counterfactual editing to verify whether expert-annotated regions are causally responsible for model predictions, testing 11 attribution methods across six open-source LVLMs. Finding that existing attribution methods frequently fail to identify the actual visual evidence used by models, they propose MedFocus, a concept-based attribution method using unbalanced optimal transport to localize anatomical regions and measure their causal effect on outputs. MedFocus substantially outperforms prior methods and provides spatial, concept-level, and token-level attributions.

Evaluation and Benchmarking AI Safety Research Enterprise Deployment Patterns Multimodal Progress Unbalanced Optimal Transport MedFocus Chest X-ray Reasoning Vision-Language Models CXR-VQA Counterfactual Editing

Related guides (4)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From LLM Demo to Production Reality

Read asIn-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: The Shifting Yardstick of AI Capability

Read asIn-depth

Related events (8)

6arXiv · cs.CL·24d ago·source ↗

The Abstraction Gap in Vision-Language Causal Reasoning

Researchers introduce a dual-probe methodology and the CAGE benchmark (49,500 questions across 5,500 images) to distinguish linguistic plausibility from faithful causal reasoning in vision-language models. An Abstraction Gap (AG) metric quantifies the normalized performance difference between text-only and chain-of-reasoning probes. Evaluating eight VLMs, seven exhibit AG exceeding 0.50—generating fluent causal text but failing structured causal chain tasks—while one model achieves near-zero AG, suggesting architectural and pretraining choices are decisive. Fine-tuning on 45,000 chain-annotated examples fails to close the gap, pointing to a fundamental capability distinction.

Evaluation and Benchmarking Agent and Tool Ecosystem Pearl's Causal Hierarchy CAGE Text-Only Probe +3 more

6arXiv · cs.LG·27d ago·source ↗

FM-CGM: Foundation Model Framework for Zero-Shot Visual Causal Generative Modeling

FM-CGM is a modular framework that decomposes visual causal reasoning into three components—concept extractor, concept manipulator, and counterfactual generator—using pretrained foundation models without task-specific causal training. The approach combines a large reasoning model for causal inference with a text-to-image diffusion model for generation, enabling zero-shot causal discovery and counterfactual image synthesis. A novel cross-attention mechanism called Causal Semantic Guidance (CSG) ensures that semantic interventions propagate correctly through causal descendants while preserving unaffected image regions. Empirical results show the framework can identify plausible causal structures and generate faithful counterfactual images.

AI Safety Research Agent and Tool Ecosystem cross-attention causal generative modeling FM-CGM +4 more

5arXiv · cs.CL·12d ago·source ↗

BODHI: Contrastive embedding training for causal discovery in Large Behavioural Models

Researchers identify a critical failure mode in biomedical language model embeddings: off-the-shelf encoders (BioBERT, PubMedBERT, BioM-ELECTRA) assign high cosine similarity (0.76–0.92) to causally unrelated cross-domain pairs, achieving 0% accuracy on cross-domain discrimination. The paper introduces BODHI, a contrastive training approach using hard negatives mined from a biomedical knowledge graph, which improves within-vs-across-domain separation from 1.05x to 2.30x and raises discrimination gap by +0.392. The work targets Large Behavioural Models (LBMs)—foundation models that reason over personal life graphs—where false embedding proximity directly produces false causal edges. Additional contributions include an OpenVINO inference optimization achieving 133x latency reduction (1367ms to 10ms) on Intel AMX hardware, plus a counterintuitive finding that FP16 outperforms INT8 on this silicon.

Evaluation and Benchmarking Inference Economics BIOSSES BioBERT PubMedBERT +4 more

6arXiv · cs.CL·10d ago·source ↗

OpenMedReason: Large-scale multimodal medical reasoning corpus with 450K instances for clinical VLM training

Researchers introduce OpenMedReason, a 450K-instance open multimodal medical reasoning corpus with reasoning traces derived from human-authored biomedical literature rather than synthetic chains of thought. The dataset covers diverse medical imaging modalities and is paired with OpenMedReason-Bench, a held-out benchmark evaluating LVLMs on perception, medical knowledge, and rationale axes. Training with OpenMedReason yields a 20% average VQA accuracy improvement over base models and achieves performance within 4.2% of leading comparable-scale medical VLMs. Both the dataset and code are publicly released.

Evaluation and Benchmarking Alignment and RLHF OpenMedReason OpenMedReason-Bench +1 more

5arXiv · cs.CL·12d ago·source ↗

Benchmark for view-level visual evidence identification in multi-view MLLMs for autonomous driving

A new arXiv preprint introduces a multi-view visual question answering benchmark targeting evidence-source identification in autonomous driving scenarios. Given six synchronized NuScenes camera views and a question, models must identify which camera view supports the answer — not just produce a correct answer. The 122-pair benchmark spans causality, counterfactual reasoning, and intent prediction, and exposes grounding failures that answer-only evaluation misses. The work addresses a meaningful gap between answer accuracy and correct visual grounding in safety-critical multimodal systems.

Evaluation and Benchmarking Multimodal Progress NuScenes Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving

6arXiv · cs.CL·26d ago·source ↗

CausaLab: Scalable Benchmark for Interactive Causal Discovery by LLM Agents

CausaLab is a new evaluation environment that tests LLM agents on interactive causal discovery tasks, requiring them to recover both causal graphs and structural equations from synthetic laboratory episodes governed by randomly sampled structural causal models (SCMs). The benchmark separates predictive accuracy from genuine causal understanding, revealing a persistent gap: GPT-5.2-high achieves 92% task accuracy in a 6-node observational setting but only 0.471 all-edge F1 for mechanism recovery. Mixed observation-intervention strategies improve structural fidelity, while pure intervention strategies underperform on both metrics. Premature stopping is identified as a key agent weakness, partially mitigated by prompting models to verify hypothesis-data consistency.

Evaluation and Benchmarking AI Safety Research all-edge F1 GPT-5.2-high causal discovery +3 more

6arXiv · cs.LG·24d ago·source ↗

Label-Free Bias Identification in Vision Models via Gradient Probes on Concept Decompositions

This paper introduces a post-hoc, label-free method for identifying spurious correlations in frozen vision classifiers without requiring bias annotations, group labels, or retraining. The approach applies non-negative matrix factorization to intermediate activations to extract interpretable concept vectors, then ranks them using a gradient-based bias estimator derived from misclassified examples. On Colored MNIST, Waterbirds, and CelebA benchmarks, the method recovers known spurious cues and improves worst-group accuracy by up to 17.9 percentage points on Waterbirds by suppressing top-ranked concepts at inference time. Notably, the method surfaces decision-relevant directions that do not always coincide with annotated attributes, offering both an auditing tool and a debiasing handle for deployed models.

Evaluation and Benchmarking AI Safety Research Colored MNIST Waterbirds CelebA +2 more

6arXiv · cs.CL·1mo ago·source ↗

Vision-OPD: On-Policy Self-Distillation for Fine-Grained Visual Understanding in MLLMs

Vision-OPD addresses a 'regional-to-global perception gap' in multimodal LLMs, where models answer fine-grained visual questions more accurately when given cropped evidence regions than full images. The method instantiates a crop-conditioned teacher and full-image-conditioned student from the same MLLM, minimizing token-level divergence along on-policy rollouts to transfer regional perception to the full-image policy. This self-distillation requires no external teacher models, ground-truth labels, reward verifiers, or inference-time tools. Benchmarks show competitive or superior performance against larger open-source, closed-source, and agentic 'Thinking-with-Images' models.

Evaluation and Benchmarking Agent and Tool Ecosystem Multimodal Large Language Models Thinking-with-Images on-policy self-distillation +4 more