Entity · dataset

MedAlign

datasetactivemedalign-a036d683·1 events·first seen Jun 16, 2026

Aliases: MedAlign

Co-occurring entities

Compositional Reasoning Depth Predicts Clinical AI Failure Claude Sonnet GPT-4o OpenAI GPT-5.5 Anthropic

More like this (12)

VecAlign ALIGN G-IdiomAlign SecAlign AI alignment AlignAtt post-training alignment The Alignment Project JAM (Judge for Adaptive Metric-Alignment)ALIGNBEAM Positive Alignment Qwen3-ForcedAligner-0.6B

Recent events (1)

6arXiv · cs.CL·Jun 16, 2026·source ↗

Hop-count taxonomy predicts LLM failure on clinical EHR question answering across architectures

Researchers introduce a 'hop-count' taxonomy — the number of distinct inferential steps required to answer a clinical EHR question — as a principled predictor of LLM failure, finding monotone accuracy decline with reasoning depth across Claude Sonnet, GPT-4o, and GPT-5. The pattern holds across two providers and two OpenAI generations, with odds ratios per hop of 0.58–0.80, and is not explained by EHR context truncation. Extended thinking (chain-of-thought) did not significantly flatten the accuracy-depth curve, though token usage scaled with hop count. The findings ground transformer compositionality limits in a clinically consequential domain and suggest hop count as a deployment risk-stratification tool.

Evaluation and Benchmarking AI Safety Research Compositional Reasoning Depth Predicts Clinical AI Failure Claude Sonnet MedAlign +4 more