Almanac
model

Large Language Models (frontier)

modelactivelarge-language-models-frontier--57cd7202·2 events·first seen 29d ago

Aliases: Large Language Models (frontier), frontier language models

Co-occurring entities

More like this (12)

Recent events (2)

6arXiv · cs.AI·29d ago·source ↗

Auditing Value Pluralism in Clinical Ethics of Large Language Models

Researchers present a framework for auditing ethical value pluralism in medical AI, comprising a benchmark of clinician-verified dilemmas and an attribution method that recovers value priorities from model decisions. While frontier LLMs span physician-level value heterogeneity in aggregate and discuss competing values in reasoning, individual model decisions are near-deterministic and fail to reproduce the distributional pluralism of physician panels. Some models systematically underweight patient autonomy. The authors warn that deploying a single LLM at scale risks replacing clinical pluralism with a 'deployment monoculture.'

7arXiv · cs.AI·27d ago·source ↗

DeepWeb-Bench: A Hard Deep Research Benchmark Requiring Cross-Source Evidence and Long-Horizon Derivation

DeepWeb-Bench is a new benchmark designed to stress-test frontier language models on deep research tasks—open-web search, evidence collection, and multi-step derivation—where existing benchmarks have become saturated. The benchmark evaluates nine frontier models across four capability families (Retrieval, Derivation, Reasoning, Calibration) and finds that retrieval is not the primary bottleneck; derivation and calibration failures account for over 70% of errors. Strong models fail via incomplete derivation while weak models fail via hallucinated precision, and models show genuine domain specialization with low cross-model agreement (rho = 0.61). The benchmark, rubrics, and evaluation code are publicly released.