Entity · dataset

MedQA

datasetactivemedqa-e3e7ebc5·4 events·first seen May 19, 2026

Aliases: MedQA

Co-occurring entities

More like this (12)

MedMCQA PubMedQA MedQADE IndQA SimpleQA QIMMA ChartQA GPQA ThReadMed-QA FinQA TableQA GQA

Recent events (4)

4arXiv · cs.CL·Jul 22, 2026·source ↗

DAIS: Dependency-aware intermediate QA supervision improves complex reasoning in LLMs

Researchers introduce DAIS (Dependency-Aware Intermediate QA Supervision), a training-time framework that converts teacher rationales into stage-level QA records where each intermediate step is conditioned on prior reasoning states. Evaluated on GDPR, AIACT, MedQA, and FOLIO benchmarks using Qwen backbones, DAIS outperforms answer-only, flat chain-of-thought, and independent-QA baselines, with up to 5.6% and average 4.2% gains on policy-compliance tasks. Ablations confirm that dependency conditioning contributes beyond simply adding more intermediate text, suggesting it as a lightweight auxiliary supervision signal.

Evaluation and Benchmarking Alignment and RLHF DAIS MedQA Qwen +2 more

6arXiv · cs.CL·Jun 23, 2026·source ↗

KG grounding helps LLMs only for out-of-training knowledge: controlled clinical QA study

A new arXiv paper investigates when knowledge-graph (KG) grounding improves LLM performance on clinical question answering, finding that structured KG retrieval over the public biomedical graph PrimeKG provides no meaningful improvement on MedQA (all deltas ≤3.4) because the relevant facts are already in the model's training data. On synthetic counterfactual and hybrid benchmarks containing genuinely novel facts, the same pipeline lifts out-of-training accuracy from chance to ~100%. The paper also reproduces and partially corrects a recent Nature Medicine study on frontier LLMs vs. clinical RAG tools, flagging a score-deflating grader bug and clarifying that the reported ~88 HealthBench score reflects the Consensus variant, not full HealthBench (~46-47). The core finding — that RAG/KG grounding pays off only when the decisive fact is outside the model's training distribution — has direct implications for when retrieval augmentation is worth deploying.

Evaluation and Benchmarking Enterprise Deployment Patterns HealthBench samyama-graph MedQA +5 more

7The Batch·Jun 5, 2026·source ↗

Gray market API proxy network enables discounted access to U.S. AI models in China via fraud and distillation

A ChinaTalk report details an informal ecosystem of API proxy servers, account farms, identity brokers, and token resellers that gives Chinese developers access to U.S. AI models like Claude, ChatGPT, and Gemini at steep discounts — sometimes 10% of market price — through methods ranging from terms-of-service violations to credit card fraud. CISPA Helmholtz Center research found proxy 'Gemini-2.5' access achieved only 37% on MedQA versus 83.82% via Google's official API, suggesting model substitution is common. The network also harvests API call logs as training data, feeding the industrial-scale distillation practices Anthropic accused DeepSeek, Moonshot, and MiniMax of in February. The White House acknowledged the distillation threat in an April memo, framing it as an adversarial national security concern.

Frontier Model Releases AI Safety Research White House Gemini 2.5 DeepSeek V4 +10 more

5Hugging Face Blog·May 19, 2026·source ↗

The Open Medical-LLM Leaderboard: Benchmarking Large Language Models in Healthcare

Hugging Face has launched the Open Medical-LLM Leaderboard, a public benchmark for evaluating large language models on healthcare and medical tasks. The leaderboard aggregates performance across multiple medical question-answering datasets to enable standardized comparison of open-weight models in clinical and biomedical domains. This initiative aims to accelerate progress in medical AI by providing transparent, reproducible evaluation infrastructure.

Evaluation and Benchmarking Open Weights Progress PubMedQA Open Medical-LLM Leaderboard MedMCQA +3 more