Almanac
benchmark

ClinEnv

benchmarkactiveprovisionalclinenv-33484bb5·1 events·first seen 15d ago

Aliases: ClinEnv

Co-occurring entities

More like this (12)

Recent events (1)

6arXiv · cs.CL·15d ago·source ↗

ClinEnv: Interactive Multi-Stage Long-Horizon EHR Benchmark for Clinical Agent Evaluation

ClinEnv is a new interactive benchmark that evaluates LLMs as attending physicians over real inpatient admissions using a Longitudinal Inpatient Simulation paradigm. Each case is decomposed into sequential decision stages where models must query four specialized agents before committing to medications, procedures, and diagnoses. Across seven evaluated models, the best achieves only 0.31 decision F1, with a sharp gap between diagnosis recovery (0.51 F1) and management actions (0.17 F1). The benchmark uniquely measures information-acquisition process quality alongside outcome quality, exposing a gap invisible to static or outcome-only evaluations.