6arXiv cs.CL (Computation and Language)·29d ago

ChronoMedKG: Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning

ChronoMedKG is a new biomedical knowledge graph containing 460,497 evidence-linked triples across 13,431 diseases, each annotated with temporal components such as onset window and progression stage. It is constructed via a multi-agent pipeline using multiple frontier LLMs extracting from PubMed/PMC, with multi-model consensus and credibility filtering. The accompanying ChronoTQA benchmark (3,341 questions) reveals frontier LLMs lose ~30 points on temporal vs. static clinical questions, while ChronoMedKG-based retrieval recovers 47–65% of long-tail failures compared to 17–29% for HPOA-RAG. The work addresses a significant gap in existing KGs (PrimeKG, Hetionet, iKraph) that treat disease associations as static facts.

Evaluation and Benchmarking Enterprise Deployment Patterns Agent and Tool Ecosystem Phenopackets PubMed ChronoTQA ChronoMedKG HPOA Orphadata iKraph Orphanet PrimeKG Hetionet

Related guides (3)

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From AI Demo to Production Reality

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·9d ago·source ↗

OpenMedReason: Large-scale multimodal medical reasoning corpus with 450K instances for clinical VLM training

Researchers introduce OpenMedReason, a 450K-instance open multimodal medical reasoning corpus with reasoning traces derived from human-authored biomedical literature rather than synthetic chains of thought. The dataset covers diverse medical imaging modalities and is paired with OpenMedReason-Bench, a held-out benchmark evaluating LVLMs on perception, medical knowledge, and rationale axes. Training with OpenMedReason yields a 20% average VQA accuracy improvement over base models and achieves performance within 4.2% of leading comparable-scale medical VLMs. Both the dataset and code are publicly released.

Evaluation and Benchmarking Alignment and RLHF OpenMedReason OpenMedReason-Bench +1 more

4arXiv · cs.CL·47h ago·source ↗

MedRLM: Recursive multimodal agent framework for long-context clinical decision support

MedRLM is a proposed framework for clinical decision support that uses recursive multi-agent reasoning over heterogeneous patient data including EHRs, medical images, physiological sensor streams, and clinical guidelines. Rather than single-step prompting, it decomposes patient cases into an inspectable external environment coordinated by specialized agents, with a Clinical Evidence Graph Memory and sensor-triggered deeper reasoning. The paper outlines an evaluation design using public and credentialed clinical datasets spanning radiology, ECG, ICU time series, and referral outcomes. The work targets a gap between static medical QA benchmarks and real-world longitudinal clinical workflows.

Agent and Tool Ecosystem Multimodal Progress MedRLM Clinical Evidence Graph Memory

5arXiv · cs.CL·22d ago·source ↗

MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

The paper introduces a pipeline for converting unstructured clinical text into HL7 FHIR R4 bundles, enabling evaluation of LLMs in realistic electronic health record settings. Applied to the MedCaseReasoning dataset, it produces MedCase-Structured, a synthetic benchmark achieving valid FHIR generation for 82.5% of cases. Key finding: LLMs show consistently lower diagnostic accuracy on structured FHIR inputs compared to plain text, underscoring the gap between standard benchmarks and real-world clinical deployment conditions.

Evaluation and Benchmarking Enterprise Deployment Patterns HL7 FHIR R4 large language models MedCase-Structured +1 more

6arXiv · cs.CL·18d ago·source ↗

ClinEnv: Interactive Multi-Stage Long-Horizon EHR Benchmark for Clinical Agent Evaluation

ClinEnv is a new interactive benchmark that evaluates LLMs as attending physicians over real inpatient admissions using a Longitudinal Inpatient Simulation paradigm. Each case is decomposed into sequential decision stages where models must query four specialized agents before committing to medications, procedures, and diagnoses. Across seven evaluated models, the best achieves only 0.31 decision F1, with a sharp gap between diagnosis recovery (0.51 F1) and management actions (0.17 F1). The benchmark uniquely measures information-acquisition process quality alongside outcome quality, exposing a gap invisible to static or outcome-only evaluations.

Long Context Evolution Evaluation and Benchmarking large language models ClinEnv Electronic Health Records (EHR)+3 more

6arXiv · cs.AI·16d ago·source ↗

KINA: 899-item knowledge benchmark across 261 disciplines with formal representativeness and annotation incentive guarantees

KINA (Knowledge Index of Noah's Ark) is a new 899-item LLM benchmark spanning 261 fine-grained disciplines, addressing three methodological weaknesses in existing knowledge benchmarks: poor disciplinary representativeness, flat-payment annotation incentives, and unaudited ranking instability. The authors provide formal results: a (1-1/e) greedy approximation for disciplinary coverage and a proof that bonus-on-bar tournament payment weakly dominates flat payment for annotation quality. Evaluating 42 models from 13 labs, the top performer Gemini-3.1-Pro-Preview reaches 53.17%, with Claude-Opus-4.6 and GPT-5.4 close behind, revealing a tiered rather than smooth leaderboard structure with substantial headroom below saturation.

Frontier Model Releases Evaluation and Benchmarking Claude Opus 4.6 KINA Google +4 more

6arXiv · cs.AI·8d ago·source ↗

Agents-K1: End-to-end knowledge orchestration pipeline for agent-native scientific knowledge graphs

Agents-K1 is a new pipeline that converts raw scientific documents into structured knowledge graphs for use by LLM-based research agents, addressing the gap where existing systems reduce papers to abstracts and flat citation edges. The system integrates a multimodal parser, a 4B information-extraction model trained with GRPO, and a tri-source agent interface combining web search, graph retrieval, and cross-document traversal. The authors process 2.46 million scientific papers to produce Scholar-KG, releasing a one-million-paper subset. Experiments show improvements in scientific information extraction, knowledge graph construction, and multi-hop reasoning.

Evaluation and Benchmarking Agent and Tool Ecosystem GRPO Agents-K1 Scholar-KG +1 more

5arXiv · cs.CL·4d ago·source ↗

MetaSyn benchmark reveals critical screening bottleneck in LLM-based meta-analysis pipelines

Researchers introduce MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals, paired with a 140k-article PubMed retrieval corpus, PI/ECO criteria, verified positives, and hard negatives. Benchmarking twelve pipeline configurations — nine RAG variants and a protocol-driven agent — shows that despite 90.9% retrieval recall at K=200, no system recovers more than 52.7% of ground-truth included studies. The core failure is LLMs' inability to reliably distinguish eligible studies from topically similar but criteria-failing distractors. The paper argues that end-to-end scores obscure where pipelines break down and proposes stage-attributed metrics.

Evaluation and Benchmarking Agent and Tool Ecosystem PubMed Nature Portfolio MetaSyn

5arXiv · cs.LG·26d ago·source ↗

CHRONOS: Temporally-Aware Multi-Agent Coordination for Evolving Data Marketplaces

CHRONOS is a three-layer multi-agent architecture addressing temporal degradation in knowledge-graph data marketplaces, combining neural-ODE-based shortcut decay, changepoint-conditioned Shapley pricing, and EXP3-IX-driven differential privacy budget management. The system achieves 0.937 recall@10, 2.74 QPS, and 161ms latency under a total epsilon of 4.25 (delta=1e-6) using zCDP composition across four benchmarks. A key limitation noted is that at this privacy level, released valuations remain noise-dominated, with utility primarily derived from public index routing. The work provides formal guarantees including per-query recall-loss bounds and finite-sample Shapley error bounds under distribution shift.

Evaluation and Benchmarking AI Safety Research Differential Privacy CHRONOS Gaussian mechanism +6 more