4arXiv cs.LG (Machine Learning)·2d ago

MC Dropout uncertainty estimation masks sub-region calibration failures in brain tumour segmentation

A preprint from arXiv evaluates Monte Carlo Dropout for voxel-level uncertainty estimation in glioma segmentation on 126 BraTS21 patients, comparing a pretrained SegResNet and a locally trained UNet-Res. While global uncertainty-error alignment is strong (AUROC ~0.97), the study finds that UNet-Res exhibits near-zero entropy and an ECE of 0.915 on the enhancing tumour sub-region despite a Dice of only 0.714, a severe miscalibration invisible to standard Dice and AUROC metrics. The paper argues that sub-region-specific calibration assessment is necessary for clinical safety and cannot be replaced by aggregate metrics alone.

Evaluation and Benchmarking AI Safety Research BraTS21 UNet-Res Confidence is Not Reliability: Rethinking MC Dropout in Brain Tumour Segmentation Monte Carlo Dropout SegResNet

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.AI·1mo ago·source ↗

Beyond Prediction Accuracy: Target-Space Recovery Profiles for Evaluating Model-Brain Alignment

This paper introduces a framework for evaluating alignment between artificial vision models and the human visual cortex that goes beyond scalar prediction accuracy. Using repeated fMRI data from the Natural Scenes Dataset, the authors decompose brain response spaces into reproducible dimensions and measure which of these dimensions are recovered by model predictions. A key finding is that pretrained and randomly initialized models can achieve similar prediction accuracy while showing distinct recovery profiles, revealing that accuracy alone can mask fundamental model-brain mismatches. The framework also enables brain-to-brain comparisons as a diagnostic human reference baseline.

Evaluation and Benchmarking Multimodal Progress Natural Scenes Dataset human visual cortex target-space recovery profiles +1 more

5arXiv · cs.AI·23d ago·source ↗

Reverse Probing: Supervised Token-level Uncertainty Quantification for LLMs in Clinical Text

The paper introduces Reverse Probing, a novel uncertainty quantification framework designed specifically for clinical text summarization that estimates token-level uncertainty from pre-existing labeled summaries rather than sampling new outputs. It extracts uncertainty signals from four categories of internal model activations, treating text as a probe into the model's internal state. Evaluated on two expert-annotated clinical datasets, it outperforms eight adapted baselines on all metrics, achieving up to 4× higher AUPRC while reducing inference time and compute. Feature analysis identifies delta energy and neighborhood context as the most consistent predictors of uncertainty across models.

Evaluation and Benchmarking AI Safety Research Reverse Probing delta energy AUPRC +3 more

5arXiv · cs.CL·1mo ago·source ↗

Controlled Audit of Human vs. Synthetic Soft-Labels for Calibration and Uncertainty Alignment

This paper presents a controlled study disentangling the effects of human soft-labels from label mode-shift corrections in soft-label learning, using MNIST and a synthetic variant. The authors find that human soft-labels primarily act as a regularizer improving calibration on difficult samples and promoting stable training convergence, rather than simply correcting mislabeled data. Dataset cartography analysis shows models trained on human soft-labels mirror human uncertainty patterns, while those trained on synthetic labels fail to align. The work provides a diagnostic testbed for evaluating human-AI uncertainty alignment.

Evaluation and Benchmarking AI Safety Research MNIST human uncertainty alignment model calibration +3 more

5arXiv · cs.CL·47h ago·source ↗

RefRad2D dataset and RadGrounder model enable spatially grounded radiology VLMs without manual annotations

Researchers introduce RefRad2D, a 1.2M-pair bilingual (German/English) CT and MR image-text dataset generated automatically via LLM curation and automated segmentation, requiring no manual spatial annotations. The accompanying RadGrounder model jointly performs report generation, VQA, and spatial grounding via bounding-box or segmentation outputs. On external benchmarks Slake and VQA-RAD, RadGrounder matches specialized medical VLMs while adding grounding supervision without degrading language quality. The work demonstrates that large-scale automatically curated clinical data can transfer to downstream medical VQA tasks.

Evaluation and Benchmarking Multimodal Progress RefRad2D Slake RadGrounder +1 more

4arXiv · cs.CL·1mo ago·source ↗

Risk-Aware Hybrid Selective Classification for HIV Suspicion Identification in Spanish Clinical Notes

This paper proposes a hybrid selective classification framework for clinical NLP that explicitly handles both aleatoric and epistemic uncertainty to avoid overconfident predictions in medical triage settings. The system combines Mondrian conformal prediction with a Multi-Centroid Mahalanobis Distance veto, evaluated on HIV suspicion identification in Spanish clinical notes. The authors demonstrate that standard uncertainty metrics and baseline classifiers suffer coverage collapse under strict reliability constraints, while their dual-verification approach isolates a trustworthy operational domain. The work critiques inflated benchmark metrics that arise from forcing deterministic classification on inherently ambiguous clinical instances.

Evaluation and Benchmarking AI Safety Research HIV Suspicion Identification Mondrian Conformal Prediction Selective Classification +3 more

5arXiv · cs.AI·1mo ago·source ↗

CARV: Compute-Aware Variance Reduction for Diffusion Teacher Gradient Estimation

CARV is a hierarchical Monte Carlo estimation framework that reduces gradient variance when using frozen pretrained diffusion models as teachers in downstream pipelines such as text-to-3D distillation and data attribution. The approach amortizes expensive upstream computation (rendering, simulation, encoding) over cheap diffusion-noise resamples, augmented by timestep importance sampling and stratified-inverse-CDF construction. In text-to-3D experiments, CARV delivers 2–3× effective compute multipliers; in single-step distillation, it cuts gradient variance by an order of magnitude but does not improve FID, revealing that MC variance is not the bottleneck in that regime.

Inference Economics Multimodal Progress Model Distillation CARV importance sampling +4 more

4arXiv · cs.LG·24d ago·source ↗

Normal Guidance: Bell-Curve Regularization for Attention-Based MIL in 3D Medical Imaging

This paper addresses weakly supervised classification of 3D medical images where only volume-level binary labels are available. The authors identify that a simple center-focused baseline outperforms attention-based and transformer-based multiple instance learning (MIL) at slice-level classification across brain, thoracic, and abdominal CT datasets. They propose Normal Guidance, a regularization technique that constrains learned attention distributions to follow a bell-shaped curve, achieving superior slice-level localization over state-of-the-art MIL methods across datasets totaling over 4 million 2D slices.

Evaluation and Benchmarking Multiple Instance Learning (MIL)Attention-based MIL Normal Guidance +1 more

5arXiv · cs.AI·4d ago·source ↗

ActiveSAM: Training-free open-vocabulary segmentation via image-conditional class pruning on SAM 3

ActiveSAM is a training-free, zero-shot inference framework that wraps Segment Anything Model 3 (SAM 3) to perform open-vocabulary semantic segmentation more efficiently. It estimates an image-conditioned active class subset at low resolution before running full-resolution decoding only on retained classes, using bucketed prompt multiplexing and margin-aware background calibration. Across eight benchmarks, it outperforms the prior state-of-the-art SegEarth-OV3 by ~1.4 mIoU on average while running up to 5.5x faster on large-vocabulary datasets, with strong robustness to image corruption relevant to autonomous driving and embodied AI.

Evaluation and Benchmarking Inference Economics VILA-Lab Segment Anything Model 2 ActiveSAM +1 more