KINA: 899-item knowledge benchmark across 261 disciplines with formal representativeness and annotation incentive guarantees
KINA (Knowledge Index of Noah's Ark) is a new 899-item LLM benchmark spanning 261 fine-grained disciplines, addressing three methodological weaknesses in existing knowledge benchmarks: poor disciplinary representativeness, flat-payment annotation incentives, and unaudited ranking instability. The authors provide formal results: a (1-1/e) greedy approximation for disciplinary coverage and a proof that bonus-on-bar tournament payment weakly dominates flat payment for annotation quality. Evaluating 42 models from 13 labs, the top performer Gemini-3.1-Pro-Preview reaches 53.17%, with Claude-Opus-4.6 and GPT-5.4 close behind, revealing a tiered rather than smooth leaderboard structure with substantial headroom below saturation.
Related guides (4)
Related events (8)
GIM: A Grounded Integration Measure Benchmark for Evaluating Multi-Domain Cognitive Coordination in LLMs
The Grounded Integration Measure (GIM) is a new LLM benchmark of 820 original problems designed to resist benchmark saturation by requiring integration of multiple cognitive operations—constraint satisfaction, state tracking, epistemic vigilance, audience calibration—over broadly accessible knowledge. Unlike knowledge-escalation benchmarks (GPQA, HLE) or pure abstraction benchmarks (ARC-AGI), GIM grounds reasoning in realistic tasks without gating on specialized expertise. The authors calibrate a 2-parameter logistic IRT model over 200k+ prompt-response pairs across 28 models and 47 test configurations, producing the most extensive published study of test-time compute vs. model capability tradeoffs on a fixed benchmark. A key finding is that within-family configuration choices (thinking budget, quantization) matter as much as model selection.
KATE framework improves LLM tool calling via experiential knowledge integration and parallel reasoning
Researchers present KATE (Knowledge-Augmented Tool Execution), a framework addressing LLM failures in multi-step tool use by systematically studying knowledge acquisition, activation, and internalization. Key findings include that instance-level experiential knowledge outperforms abstract intent-level knowledge, that expanding reasoning width via parallel sampling with aggregation beats deeper chain-of-thought, and that reinforcement learning outperforms supervised fine-tuning for knowledge internalization. KATE is evaluated on BFCL-V3 and AppWorld benchmarks, showing consistent improvements over strong baselines across model scales.
ChronoMedKG: Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning
ChronoMedKG is a new biomedical knowledge graph containing 460,497 evidence-linked triples across 13,431 diseases, each annotated with temporal components such as onset window and progression stage. It is constructed via a multi-agent pipeline using multiple frontier LLMs extracting from PubMed/PMC, with multi-model consensus and credibility filtering. The accompanying ChronoTQA benchmark (3,341 questions) reveals frontier LLMs lose ~30 points on temporal vs. static clinical questions, while ChronoMedKG-based retrieval recovers 47–65% of long-tail failures compared to 17–29% for HPOA-RAG. The work addresses a significant gap in existing KGs (PrimeKG, Hetionet, iKraph) that treat disease associations as static facts.
WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata
WikiVQABench is a new human-curated VQA benchmark that requires external knowledge beyond visual perception, constructed by combining Wikipedia images, captions, and Wikidata structured knowledge with LLM-generated question candidates reviewed by human annotators. The benchmark evaluates knowledge-intensive reasoning in vision-language models, covering 15 VLMs ranging from 256M to 90B parameters. Accuracy spans 24.7% to 75.6%, indicating meaningful discrimination across model scales. The dataset and code are publicly released.
MetaSyn benchmark reveals critical screening bottleneck in LLM-based meta-analysis pipelines
Researchers introduce MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals, paired with a 140k-article PubMed retrieval corpus, PI/ECO criteria, verified positives, and hard negatives. Benchmarking twelve pipeline configurations — nine RAG variants and a protocol-driven agent — shows that despite 90.9% retrieval recall at K=200, no system recovers more than 52.7% of ground-truth included studies. The core failure is LLMs' inability to reliably distinguish eligible studies from topically similar but criteria-failing distractors. The paper argues that end-to-end scores obscure where pipelines break down and proposes stage-attributed metrics.
LexNeo-Bench: Probing LLM Knowledge of Lexical Borrowing in Luxembourgish via Knowledge-Graph Prompting
Researchers introduce LexNeo-Bench, a 3,050-instance benchmark for evaluating LLM performance on lexical borrowing classification and neology detection in Luxembourgish, a low-resource contact language. Three multilingual LLMs are tested across 34 prompt configurations; without external context, models perform near chance on borrowing classification (25–35%). Injecting instance-specific subgraphs from a linguistic knowledge graph raises accuracy to 71–81% and largely closes the gap between small and large models, though neology detection remains difficult. The study highlights the value of lexicon-aware, structured prompting for low-resource multilingual evaluation.
Agents-K1: End-to-end knowledge orchestration pipeline for agent-native scientific knowledge graphs
Agents-K1 is a new pipeline that converts raw scientific documents into structured knowledge graphs for use by LLM-based research agents, addressing the gap where existing systems reduce papers to abstracts and flat citation edges. The system integrates a multimodal parser, a 4B information-extraction model trained with GRPO, and a tri-source agent interface combining web search, graph retrieval, and cross-document traversal. The authors process 2.46 million scientific papers to produce Scholar-KG, releasing a one-million-paper subset. Experiments show improvements in scientific information extraction, knowledge graph construction, and multi-hop reasoning.
K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts
K-BrowseComp is a new 400-problem benchmark for evaluating web-browsing agents in Korean-language contexts, with a 300-problem manually validated subset and a 100-problem adversarially constructed synthetic split. Frontier models including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1 achieve only 30–46% on the verified subset, a significant drop from English BrowseComp performance, while Korean proprietary models score 0–10%. The benchmark exploits the asymmetry between problem creation and solving difficulty, and the adversarial synthetic split caps the strongest model at 26%, positioning it as a targeted stress test for agentic web-browsing capability.



