Almanac
dataset

MedAlign

datasetactiveprovisionalmedalign-a036d683·1 events·first seen 25h ago

Aliases: MedAlign

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.CL·25h ago·source ↗

Hop-count taxonomy predicts LLM failure on clinical EHR question answering across architectures

Researchers introduce a 'hop-count' taxonomy — the number of distinct inferential steps required to answer a clinical EHR question — as a principled predictor of LLM failure, finding monotone accuracy decline with reasoning depth across Claude Sonnet, GPT-4o, and GPT-5. The pattern holds across two providers and two OpenAI generations, with odds ratios per hop of 0.58–0.80, and is not explained by EHR context truncation. Extended thinking (chain-of-thought) did not significantly flatten the accuracy-depth curve, though token usage scaled with hop count. The findings ground transformer compositionality limits in a clinically consequential domain and suggest hop count as a deployment risk-stratification tool.