Entity · model

Large Language Models (frontier)

modelactivelarge-language-models-frontier--57cd7202·2 events·first seen May 19, 2026

Aliases: Large Language Models (frontier), frontier language models

Co-occurring entities

deep research agents DeepWeb-Bench Retrieval-Augmented Generation Clinical Ethics Benchmark Value Pluralism Audit Framework Overton Pluralism Patient Autonomy

More like this (12)

large language models large language model agents Understanding Large Language Models Multimodal Large Language Models OpenAI frontier models Large Reasoning Models Reinforcement Learning for Language Models OpenAI Frontier frontier reasoning models unsupervised language modeling Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact 1B-scale language models

Recent events (2)

7arXiv · cs.AI·May 21, 2026·source ↗

DeepWeb-Bench: A Hard Deep Research Benchmark Requiring Cross-Source Evidence and Long-Horizon Derivation

DeepWeb-Bench is a new benchmark designed to stress-test frontier language models on deep research tasks—open-web search, evidence collection, and multi-step derivation—where existing benchmarks have become saturated. The benchmark evaluates nine frontier models across four capability families (Retrieval, Derivation, Reasoning, Calibration) and finds that retrieval is not the primary bottleneck; derivation and calibration failures account for over 70% of errors. Strong models fail via incomplete derivation while weak models fail via hallucinated precision, and models show genuine domain specialization with low cross-model agreement (rho = 0.61). The benchmark, rubrics, and evaluation code are publicly released.

Frontier Model Releases Evaluation and Benchmarking deep research agents DeepWeb-Bench Retrieval-Augmented Generation +2 more

6arXiv · cs.AI·May 19, 2026·source ↗

Auditing Value Pluralism in Clinical Ethics of Large Language Models

Researchers present a framework for auditing ethical value pluralism in medical AI, comprising a benchmark of clinician-verified dilemmas and an attribution method that recovers value priorities from model decisions. While frontier LLMs span physician-level value heterogeneity in aggregate and discuss competing values in reasoning, individual model decisions are near-deterministic and fail to reproduce the distributional pluralism of physician panels. Some models systematically underweight patient autonomy. The authors warn that deploying a single LLM at scale risks replacing clinical pluralism with a 'deployment monoculture.'

Evaluation and Benchmarking AI Safety Research Clinical Ethics Benchmark Value Pluralism Audit Framework Overton Pluralism +4 more