paper
Flaws in the LLM Automation Narrative
paperactiveprovisional
flaws-in-the-llm-automation-narrative-3d0cffb2·1 events·first seen 7d agoAliases: Flaws in the LLM Automation Narrative
More like this (12)
Multi-Component LLM Agentfrontier LLMsAudio-LLMWhich Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMsLLMScanSpeechLLMThe Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMsThe Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI ActBeyond Third-Person Audits: Situated Interaction Auditing for User-Centered LLM Bias ResearchLLM-based code change labeling pipelineArtificial Analysis LLM Performance LeaderboardLLM-as-a-Judge
Recent events (1)
Paper challenges LLM expert-level claims by measuring variance and error magnitude in code-based data analysis tasks
A new arXiv paper argues that standard LLM benchmarks overstate model capabilities by focusing on average performance on training-data-adjacent tasks while ignoring response variance and error magnitude. The authors introduce a novel benchmark requiring frontier LLMs to write code for data analysis tasks, comparing results against human expert submissions. Human experts outperformed the frontier LLM on average across multiple metrics and showed lower performance variability. The findings challenge the prevailing narrative that LLMs perform at human-expert level on knowledge economy tasks.