6arXiv cs.CL (Computation and Language)·47h ago

KG grounding helps LLMs only for out-of-training knowledge: controlled clinical QA study

A new arXiv paper investigates when knowledge-graph (KG) grounding improves LLM performance on clinical question answering, finding that structured KG retrieval over the public biomedical graph PrimeKG provides no meaningful improvement on MedQA (all deltas ≤3.4) because the relevant facts are already in the model's training data. On synthetic counterfactual and hybrid benchmarks containing genuinely novel facts, the same pipeline lifts out-of-training accuracy from chance to ~100%. The paper also reproduces and partially corrects a recent Nature Medicine study on frontier LLMs vs. clinical RAG tools, flagging a score-deflating grader bug and clarifying that the reported ~88 HealthBench score reflects the Consensus variant, not full HealthBench (~46-47). The core finding — that RAG/KG grounding pays off only when the decisive fact is outside the model's training distribution — has direct implications for when retrieval augmentation is worth deploying.

Evaluation and Benchmarking Enterprise Deployment Patterns Agent and Tool Ecosystem HealthBench samyama-graph MedQA OpenAI PrimeKG Knowledge-Graph Grounding Helps LLMs Only for Out-of-Training Knowledge: A Controlled Study on Clinical Question Answering GPT-5.5

Related guides (3)

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From AI Demo to Production Reality

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·1mo ago·source ↗

ChronoMedKG: Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning

ChronoMedKG is a new biomedical knowledge graph containing 460,497 evidence-linked triples across 13,431 diseases, each annotated with temporal components such as onset window and progression stage. It is constructed via a multi-agent pipeline using multiple frontier LLMs extracting from PubMed/PMC, with multi-model consensus and credibility filtering. The accompanying ChronoTQA benchmark (3,341 questions) reveals frontier LLMs lose ~30 points on temporal vs. static clinical questions, while ChronoMedKG-based retrieval recovers 47–65% of long-tail failures compared to 17–29% for HPOA-RAG. The work addresses a significant gap in existing KGs (PrimeKG, Hetionet, iKraph) that treat disease associations as static facts.

Evaluation and Benchmarking Enterprise Deployment Patterns Phenopackets PubMed ChronoTQA +8 more

7arXiv · cs.CL·13d ago·source ↗

MedMisBench: LLMs show fragile epistemic resilience under misleading medical context

Researchers introduce MedMisBench, a benchmark of 10,932 medical questions paired with 48,889 misleading context injections, to measure whether LLMs maintain correct medical judgment under adversarial pressure. Across 11 model configurations, mean accuracy drops from 71.1% to 38.0% when misleading context is injected, with authority-framed falsehoods achieving 69.5% attack success. A 14-member international clinical panel flagged serious potential harm in 38.2% of reviewed cases. The work argues that existing medical benchmarks measure knowledge but not robustness to manipulation, exposing a structural gap in LLM safety evaluation for healthcare.

Evaluation and Benchmarking AI Safety Research Measuring Epistemic Resilience of LLMs Under Misleading Medical Context MedMisBench

6arXiv · cs.CL·29d ago·source ↗

Peak-Then-Collapse: RLVR Tool-Use Failures on Knowledge-Graph APIs

This paper investigates RLVR-based tool-use training (GRPO on Qwen2.5-7B-Instruct) on a minimal knowledge-graph API (Freebase over Complex WebQuestions) and documents a 'peak-then-collapse' pattern where tool-grounded answer rates rise then fall to zero within 50 steps, replicated across four seeds and seven reward designs. The authors identify a key structural difference between knowledge-graph APIs and other tool types (Python, web search, JSON): sparse, non-natural-language feedback signals (e.g., empty brackets '[]') prevent the model from recovering via pretraining-familiar error signals. A direct oracle ablation shows relation selection is not the bottleneck—95.4% of errors are retrieval-composition failures—and self-distillation reaches 40% EM at 7B, with capacity scaling to 14B yielding only marginal gains, suggesting an interface-bound ceiling.

Evaluation and Benchmarking Agent and Tool Ecosystem RLVR Self-Distillation GRPO +4 more

6arXiv · cs.CL·15d ago·source ↗

KATE framework improves LLM tool calling via experiential knowledge integration and parallel reasoning

Researchers present KATE (Knowledge-Augmented Tool Execution), a framework addressing LLM failures in multi-step tool use by systematically studying knowledge acquisition, activation, and internalization. Key findings include that instance-level experiential knowledge outperforms abstract intent-level knowledge, that expanding reasoning width via parallel sampling with aggregation beats deeper chain-of-thought, and that reinforcement learning outperforms supervised fine-tuning for knowledge internalization. KATE is evaluated on BFCL-V3 and AppWorld benchmarks, showing consistent improvements over strong baselines across model scales.

Evaluation and Benchmarking Agent and Tool Ecosystem BFCL-V3 AppWorld Knowledge-Augmented Tool Execution +1 more

4arXiv · cs.CL·6d ago·source ↗

Empirical study of LLM medical domain adaptation trade-offs in French QA

Researchers present a systematic comparison of continual pretraining (CPT), supervised fine-tuning (SFT), and their combination for adapting LLMs to French medical question answering. The study spans three model families, multiple sizes, and three initialization types, evaluating both multiple-choice and open-ended QA formats. Key findings: CPT+SFT yields the best MCQA scores but gains over SFT alone are often not statistically significant, making SFT a cost-effective default; for open-ended QA, CPT improves overlap metrics while SFT degrades generation quality. Cross-lingual transfer from French adaptation to English benchmarks is also demonstrated.

Evaluation and Benchmarking Enterprise Deployment Patterns Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA

4arXiv · cs.CL·16d ago·source ↗

HKVM-RAG: Hypergraph key-value separation improves multi-hop retrieval-augmented generation

A new arXiv preprint introduces HKVM-RAG, an evidence-organization layer for multi-hop RAG that uses weighted hyperedges as retrieval keys while retaining passage text as answer values. Under a fixed-substrate protocol controlling for tuple cache, reader, and evaluation budget, the hypergraph key-value approach improves over KG-PPR by +3.4 F1 on 2WikiMultiHopQA and +3.6 F1 on MuSiQue. A dense-aware controller combining frozen ColBERTv2 with HKVM features reaches 88.8, 65.1, and 85.8 F1 on three benchmarks, outperforming ColBERTv2 alone by 5–11 F1 points. The work positions hypergraph organization as a reusable evidence-control mechanism rather than a dense-retrieval replacement.

Evaluation and Benchmarking Agent and Tool Ecosystem ColBERTv2 MuSiQue 2WikiMultiHopQA +2 more

5arXiv · cs.CL·8d ago·source ↗

ContextRL: Context-aware reinforcement learning improves grounding in agentic and multimodal LLMs

Researchers introduce ContextRL, a reinforcement learning method that trains LLMs to select the context that supports a given query-answer pair from two highly similar candidates, rather than supervising only final answers. The approach constructs contrastive context pairs in two domains: coding agent trajectories (1k pairs) and multimodal image pairs (7k pairs). ContextRL achieves +2.2% average gains over standard GRPO on 5 long-horizon benchmarks and +1.8% across 12 visual QA benchmarks, with ablations showing the gains stem from the context-selection objective rather than the contrastive data alone.

Agent and Tool Ecosystem Alignment and RLHF GRPO ContextRL +1 more

5arXiv · cs.CL·16d ago·source ↗

Systematic evaluation of LLM prompt sensitivity in healthcare settings reveals safety risks

Researchers conduct a sensitivity analysis of both general-purpose and medical-specific LLMs using the MedMCQA benchmark, testing robustness to lexical and syntactic prompt perturbations. The study finds that even minor phrasing changes can alter clinical advice, and adversarial prompts can produce dangerous outputs such as incorrect dosages or omitted critical findings. Both general-purpose models (GPT-3.5, Llama 3) and domain-specific models (ClinicalBERT, BioLlama3, BioBERT) exhibit this fragility, with syntactic reordering and misleading contextual cues proving more destabilizing than simple paraphrasing.

Evaluation and Benchmarking AI Safety Research BioLlama3 BioBERT MedMCQA +3 more