6arXiv cs.AI (Artificial Intelligence)·10d ago

Atlas H&E-TME: AI system matches expert pathologist accuracy for scalable tumor microenvironment profiling

Researchers present Atlas H&E-TME, an AI system built on the Atlas family of pathology foundation models that generates over 4,500 quantitative readouts per whole-slide H&E image at cell-level resolution across multiple cancer types. The system is validated using a novel dual framework: an IHC-informed multi-pathologist consensus protocol for depth, and benchmarking against 200,000+ annotations across 1,500+ cases from 25+ sources spanning eight cancer types. Atlas H&E-TME matches or exceeds pathologist H&E-only performance, demonstrating that standard histopathology slides can serve as a scalable quantitative window into the tumor microenvironment. The work advances computational pathology by enabling tissue-based biomarker discovery without requiring specialized staining modalities.

Evaluation and Benchmarking Atlas H&E-TME Atlas

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6The Batch·1mo ago·source ↗

Two Studies Test Google's Breast Cancer Detection Models in Real-World Clinics

Two studies evaluated Google's mammography AI system—introduced in 2020 but not yet deployed for live patient care—against real-world UK NHS clinical workflows. In retrospective testing on 116,000 scans, the system achieved higher sensitivity (0.541 vs 0.437) than the first human reader while identifying 25% of cancers initially missed by doctors. A live integration test across 12 clinics showed the system processed scans in under 18 minutes versus over two days for human readers, with comparable accuracy, though some clinicians reported distrust of the system's outputs.

Evaluation and Benchmarking Enterprise Deployment Patterns Google iCAD Christopher J. Kelly +5 more

4Meta Ai Blog·1mo ago·source ↗

Orakl Oncology uses Meta's DINOv2 to accelerate cancer organoid analysis and drug response prediction

Orakl Oncology, a spinoff from the Gustave Roussy Institute, has deployed Meta's open-source DINOv2 vision model to analyze cancer organoid images and predict patient drug responses in clinical trials. In collaboration with CentraleSupelec and the Jaulin Lab under the RHU ORGANOMIC initiative, the team found DINOv2 outperformed prior specialized models by 26.8% accuracy. The model enabled quantitative extraction of imaging data from organoid videos, replacing labor-intensive frame-by-frame analysis and significantly accelerating their biomedical platform development.

Open Weights Progress Multimodal Progress Meta AI (FAIR)Gustave Roussy Institute RHU ORGANOMIC +5 more

6arXiv · cs.CL·13d ago·source ↗

LLM-guided MAP-Elites evolution improves medical decision pipelines at inference time

Researchers propose using LLM-guided MAP-Elites evolutionary search as an inference-time alternative to fine-tuning for adapting LLMs to clinical workflows, formulating triage, consultation, and image classification as evolutionary searches over executable artifacts. Across three medical settings, evolved programs substantially outperform manually designed baselines: triage accuracy improves from 77.3% to 87.1% and emergency recall from 0.60 to 0.97, with gains also shown on MIMIC-ESI, iCRAFTMD, and PneumoniaMNIST. The approach works across Llama-3, Qwen-3.5, and Gemma-4 backbones and produces interpretable program-level mechanisms rather than superficial prompt changes.

Evaluation and Benchmarking Agent and Tool Ecosystem Gemma-4 E4B-it MIMIC-ESI iCRAFTMD +6 more

5arXiv · cs.LG·10d ago·source ↗

ATLAS: Active learning framework for automated discovery of interpretable behavioral models in cognitive science

ATLAS (Active Theory Learning for Automated Science) is a new active learning framework that iterates between generating mechanistic hypotheses as sparse neural network ensembles and designing maximally informative experiments to distinguish between them. The system is tested on recovering reinforcement learning agents from behavioral data in bandit tasks, achieving 5-10x sample efficiency improvements over random experimentation and matching expert-designed experiments from the literature. The work targets automated scientific discovery in cognitive science, with potential generalization to other domains requiring mechanistic modeling.

Evaluation and Benchmarking ATLAS: Active Theory Learning for Automated Science Disentangled RNNs Atlas

5Openai Blog·1mo ago·source ↗

Color Health's Cancer Copilot Uses GPT-4o for Oncology Workup Planning

Color Health has partnered with OpenAI to deploy GPT-4o in a clinical application called Cancer Copilot, designed to identify missing diagnostics and generate tailored cancer workup plans. The system aims to accelerate patient access to cancer screening and treatment by supporting evidence-based clinical decision-making. This represents a concrete enterprise deployment of GPT-4o in a high-stakes medical context.

Enterprise Deployment Patterns Agent and Tool Ecosystem Cancer Copilot GPT-4o Color Health +1 more

8Openai Blog·1mo ago·source ↗

Measuring AI's capability to accelerate biological research

OpenAI introduces a real-world evaluation framework designed to measure how AI systems can accelerate biological research in wet lab settings. The work uses GPT-5 to optimize a molecular cloning protocol as a concrete demonstration case. The framework explicitly addresses both the potential benefits and biosecurity risks of AI-assisted experimentation, positioning this as a dual-use capability assessment.

Frontier Model Releases Evaluation and Benchmarking wet lab biological research evaluation framework OpenAI molecular cloning +3 more

5arXiv · cs.CL·5d ago·source ↗

MetaSyn benchmark reveals critical screening bottleneck in LLM-based meta-analysis pipelines

Researchers introduce MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals, paired with a 140k-article PubMed retrieval corpus, PI/ECO criteria, verified positives, and hard negatives. Benchmarking twelve pipeline configurations — nine RAG variants and a protocol-driven agent — shows that despite 90.9% retrieval recall at K=200, no system recovers more than 52.7% of ground-truth included studies. The core failure is LLMs' inability to reliably distinguish eligible studies from topically similar but criteria-failing distractors. The paper argues that end-to-end scores obscure where pipelines break down and proposes stage-attributed metrics.

Evaluation and Benchmarking Agent and Tool Ecosystem PubMed Nature Portfolio MetaSyn

5arXiv · cs.CL·19d ago·source ↗

AutoForest: End-to-End LLM System for Automated Forest Plot Generation from Biomedical Studies

AutoForest is presented as the first end-to-end system that generates publication-ready forest plots directly from biomedical papers using large language models. The system automatically suggests ICO (Intervention, Comparator, Outcome) elements, extracts outcome data, performs statistical synthesis, and renders forest plots without manual intervention. A user study with clinicians demonstrates its effectiveness on real-world examples, aiming to accelerate systematic review and meta-analysis workflows.

Enterprise Deployment Patterns Agent and Tool Ecosystem ICO (Intervention, Comparator, Outcome) framework large language models AutoForest +2 more