6The Batch (DeepLearning.AI)·1mo ago

Two Studies Test Google's Breast Cancer Detection Models in Real-World Clinics

Two studies evaluated Google's mammography AI system—introduced in 2020 but not yet deployed for live patient care—against real-world UK NHS clinical workflows. In retrospective testing on 116,000 scans, the system achieved higher sensitivity (0.541 vs 0.437) than the first human reader while identifying 25% of cancers initially missed by doctors. A live integration test across 12 clinics showed the system processed scans in under 18 minutes versus over two days for human readers, with comparable accuracy, though some clinicians reported distrust of the system's outputs.

Evaluation and Benchmarking Enterprise Deployment Patterns Google iCAD Christopher J. Kelly National Health Service convolutional neural network Google Mammography AI Imperial College London Marc Wilson

Related guides (3)

Google

Google: The AI Lab That Builds Everything from DNA Models to Your Phone's Assistant

Read asBeginner In-depth

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From AI Demo to Production Reality

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

7Google Deepmind Blog·1mo ago·source ↗

MedGemma: DeepMind releases most capable open models for health AI development

Google DeepMind has announced new multimodal models in the MedGemma collection, described as their most capable open models for health AI development. The release expands the MedGemma family with enhanced multimodal capabilities targeting medical and clinical AI applications. As open models, they are intended to support developers building health AI systems.

Open Weights Progress Enterprise Deployment Patterns Gemma Google DeepMind MedGemma +1 more

6Openai Blog·1mo ago·source ↗

OpenAI and Penda Health debut AI clinical copilot with 16% diagnostic error reduction

OpenAI has partnered with Penda Health to deploy an AI clinical copilot in real-world healthcare settings. The system reportedly reduces diagnostic errors by 16%, representing a concrete outcome metric from a live deployment rather than a controlled trial. This marks a notable enterprise deployment of OpenAI technology in African healthcare infrastructure.

Enterprise Deployment Patterns Agent and Tool Ecosystem Penda Health AI Clinical Copilot OpenAI

5Openai Blog·1mo ago·source ↗

Color Health's Cancer Copilot Uses GPT-4o for Oncology Workup Planning

Color Health has partnered with OpenAI to deploy GPT-4o in a clinical application called Cancer Copilot, designed to identify missing diagnostics and generate tailored cancer workup plans. The system aims to accelerate patient access to cancer screening and treatment by supporting evidence-based clinical decision-making. This represents a concrete enterprise deployment of GPT-4o in a high-stakes medical context.

Enterprise Deployment Patterns Agent and Tool Ecosystem Cancer Copilot GPT-4o Color Health +1 more

6Openai Blog·1mo ago·source ↗

Introducing HealthBench

OpenAI has released HealthBench, a new evaluation benchmark designed to assess AI model performance and safety in healthcare settings. The benchmark was developed with input from over 250 physicians and targets realistic clinical scenarios. It aims to establish a shared standard for measuring how well AI models handle health-related tasks.

Evaluation and Benchmarking AI Safety Research HealthBench OpenAI +1 more

6arXiv · cs.AI·9d ago·source ↗

Atlas H&E-TME: AI system matches expert pathologist accuracy for scalable tumor microenvironment profiling

Researchers present Atlas H&E-TME, an AI system built on the Atlas family of pathology foundation models that generates over 4,500 quantitative readouts per whole-slide H&E image at cell-level resolution across multiple cancer types. The system is validated using a novel dual framework: an IHC-informed multi-pathologist consensus protocol for depth, and benchmarking against 200,000+ annotations across 1,500+ cases from 25+ sources spanning eight cancer types. Atlas H&E-TME matches or exceeds pathologist H&E-only performance, demonstrating that standard histopathology slides can serve as a scalable quantitative window into the tumor microenvironment. The work advances computational pathology by enabling tissue-based biomarker discovery without requiring specialized staining modalities.

Evaluation and Benchmarking Atlas H&E-TME Atlas

8Openai Blog·1mo ago·source ↗

Measuring AI's capability to accelerate biological research

OpenAI introduces a real-world evaluation framework designed to measure how AI systems can accelerate biological research in wet lab settings. The work uses GPT-5 to optimize a molecular cloning protocol as a concrete demonstration case. The framework explicitly addresses both the potential benefits and biosecurity risks of AI-assisted experimentation, positioning this as a dual-use capability assessment.

Frontier Model Releases Evaluation and Benchmarking wet lab biological research evaluation framework OpenAI molecular cloning +3 more

5Google Deepmind Blog·1mo ago·source ↗

Enabling a new model for healthcare with AI co-clinician

DeepMind has published a blog post outlining research into an AI co-clinician concept aimed at augmenting clinical care. The post describes a vision for AI-augmented healthcare where AI systems work alongside medical professionals. The content appears to be a high-level research direction announcement rather than a specific model or product release.

Enterprise Deployment Patterns Agent and Tool Ecosystem AI Co-Clinician Google DeepMind

6arXiv · cs.CL·29d ago·source ↗

Systematic 14-Day Evaluation of Six AI Chatbots as News Intermediaries Across Languages and Regions

Researchers evaluated six commercial AI chatbots (Gemini 3 Flash/Pro, Grok 4, Claude 4.5 Sonnet, GPT-5, GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional services over 14 days in February 2026. Top systems exceed 90% multiple-choice accuracy on breaking news but lose 11-17% under free-response conditions. Key findings include systematic Hindi-language underperformance (79% vs. 89-91% elsewhere) driven by Anglophone retrieval bias, retrieval failures accounting for over 70% of errors, and dramatic accuracy collapse (to 19-70%) on questions containing subtle false premises. A detection-accuracy paradox is identified: the best false-premise detector does not yield the best adversarial accuracy, suggesting premise detection and answer recovery are partially independent capabilities.

Frontier Model Releases Evaluation and Benchmarking Gemini 3.5 Pro BBC News GPT-4o mini +11 more