6arXiv cs.CL (Computation and Language)·9d ago

OpenMedReason: Large-scale multimodal medical reasoning corpus with 450K instances for clinical VLM training

Researchers introduce OpenMedReason, a 450K-instance open multimodal medical reasoning corpus with reasoning traces derived from human-authored biomedical literature rather than synthetic chains of thought. The dataset covers diverse medical imaging modalities and is paired with OpenMedReason-Bench, a held-out benchmark evaluating LVLMs on perception, medical knowledge, and rationale axes. Training with OpenMedReason yields a 20% average VQA accuracy improvement over base models and achieves performance within 4.2% of leading comparable-scale medical VLMs. Both the dataset and code are publicly released.

Evaluation and Benchmarking Alignment and RLHF Multimodal Progress OpenMedReason OpenMedReason-Bench

Related guides (3)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·22d ago·source ↗

MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

The paper introduces a pipeline for converting unstructured clinical text into HL7 FHIR R4 bundles, enabling evaluation of LLMs in realistic electronic health record settings. Applied to the MedCaseReasoning dataset, it produces MedCase-Structured, a synthetic benchmark achieving valid FHIR generation for 82.5% of cases. Key finding: LLMs show consistently lower diagnostic accuracy on structured FHIR inputs compared to plain text, underscoring the gap between standard benchmarks and real-world clinical deployment conditions.

Evaluation and Benchmarking Enterprise Deployment Patterns HL7 FHIR R4 large language models MedCase-Structured +1 more

4arXiv · cs.CL·46h ago·source ↗

MedRLM: Recursive multimodal agent framework for long-context clinical decision support

MedRLM is a proposed framework for clinical decision support that uses recursive multi-agent reasoning over heterogeneous patient data including EHRs, medical images, physiological sensor streams, and clinical guidelines. Rather than single-step prompting, it decomposes patient cases into an inspectable external environment coordinated by specialized agents, with a Clinical Evidence Graph Memory and sensor-triggered deeper reasoning. The paper outlines an evaluation design using public and credentialed clinical datasets spanning radiology, ECG, ICU time series, and referral outcomes. The work targets a gap between static medical QA benchmarks and real-world longitudinal clinical workflows.

Agent and Tool Ecosystem Multimodal Progress MedRLM Clinical Evidence Graph Memory

6Hugging Face Blog·1mo ago·source ↗

NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset

NVIDIA has released a dataset of 6 million multilingual reasoning examples, published via Hugging Face. The dataset is intended to support training and evaluation of reasoning capabilities across multiple languages. This release addresses a known gap in multilingual reasoning data availability for the research community.

Frontier Model Releases Evaluation and Benchmarking NVIDIA Multilingual Reasoning Dataset v1 NVIDIA Hugging Face +1 more

6arXiv · cs.CL·29d ago·source ↗

ChronoMedKG: Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning

ChronoMedKG is a new biomedical knowledge graph containing 460,497 evidence-linked triples across 13,431 diseases, each annotated with temporal components such as onset window and progression stage. It is constructed via a multi-agent pipeline using multiple frontier LLMs extracting from PubMed/PMC, with multi-model consensus and credibility filtering. The accompanying ChronoTQA benchmark (3,341 questions) reveals frontier LLMs lose ~30 points on temporal vs. static clinical questions, while ChronoMedKG-based retrieval recovers 47–65% of long-tail failures compared to 17–29% for HPOA-RAG. The work addresses a significant gap in existing KGs (PrimeKG, Hetionet, iKraph) that treat disease associations as static facts.

Evaluation and Benchmarking Enterprise Deployment Patterns Phenopackets PubMed ChronoTQA +8 more

5Hugging Face Blog·1mo ago·source ↗

The Open Medical-LLM Leaderboard: Benchmarking Large Language Models in Healthcare

Hugging Face has launched the Open Medical-LLM Leaderboard, a public benchmark for evaluating large language models on healthcare and medical tasks. The leaderboard aggregates performance across multiple medical question-answering datasets to enable standardized comparison of open-weight models in clinical and biomedical domains. This initiative aims to accelerate progress in medical AI by providing transparent, reproducible evaluation infrastructure.

Evaluation and Benchmarking Open Weights Progress PubMedQA Open Medical-LLM Leaderboard MedMCQA +3 more

6Openai Blog·2d ago·source ↗

OpenAI reasoning model helps diagnose 18 previously unsolved rare childhood genetic diseases

Researchers used an OpenAI reasoning model to assist physicians in diagnosing rare genetic diseases in children, identifying 18 new diagnoses in cases that had previously gone unsolved. The announcement comes from OpenAI's official blog, positioning the work as a demonstration of reasoning model utility in high-stakes clinical settings. The result is notable as a concrete real-world application of frontier reasoning capabilities in medicine.

Frontier Model Releases Enterprise Deployment Patterns OpenAI Reasoning Models OpenAI

8Mistral Ai News·19d ago·source ↗

Mistral AI Releases Magistral: First Reasoning Model in Open and Enterprise Variants

Mistral AI announces Magistral, its first reasoning model, released in two variants: Magistral Small (24B parameters, open-weight, Apache 2.0) and Magistral Medium (enterprise, closed). Magistral Medium scores 73.6% on AIME2024 (90% with majority voting @64), while Magistral Small scores 70.7% (83.3% respectively). Key differentiators include native multilingual chain-of-thought reasoning across eight major languages, transparent traceable reasoning steps, and up to 10x faster token throughput in Le Chat via Flash Answers. The release is accompanied by a research paper covering training infrastructure, reinforcement learning algorithm, and novel observations for training reasoning models.

Frontier Model Releases Evaluation and Benchmarking Mistral AI AIME2024 Amazon SageMaker +13 more

6arXiv · cs.LG·4d ago·source ↗

ExpRL: RL-based mid-training using human QA data as reward scaffolds for LLM reasoning

ExpRL proposes an automated approach to LLM mid-training that replaces manually curated reasoning traces with large corpora of human-written QA data used as reward scaffolds rather than imitation targets. Reference solutions are hidden from the policy and used only to construct problem-specific grading rubrics, enabling dense process-level rewards that reinforce partial progress and intermediate reasoning steps. On challenging math reasoning benchmarks, ExpRL outperforms SFT, sparse-reward GRPO, and self-distillation as an RL initialization strategy, with additional mixed-domain experiments suggesting broader applicability.

Evaluation and Benchmarking Alignment and RLHF ExpRL: Exploratory RL for LLM Mid-Training GRPO ExpRL