A pilot study using Polymarket as an externally resolved benchmark finds that the value of human-AI collaboration in forecasting is highly individual-dependent, with a trimodal distribution: most users either defer to the model or rubber-stamp prior beliefs, while a minority engage in genuine complementary reasoning that matches or beats market accuracy. Collaborative traits—perspective-taking, intellectual humility, and curiosity—predicted who reached the high-performance mode, while raw cognitive ability and model benchmark scores did not. The results challenge the common practice of reporting human-AI collaboration effects as a single average, and a pre-registered replication is in preparation.
A new arXiv preprint introduces the Human Creativity Benchmark (HCB), which collects 15,000 professional judgments across five creative domains and three workflow phases to evaluate creative AI. The benchmark explicitly separates 'convergence' (shared professional standards) from 'divergence' (legitimate taste variation), arguing that collapsing these into a single quality metric discards actionable information. Key findings include that convergence concentrates on verifiable dimensions like technical correctness, while divergence concentrates on aesthetic direction and conceptual risk, and that no model excels uniformly across all workflow phases.
A preprint reports a 1,283-participant experiment using AI assistants to nudge behavior in iterated Collective Risk Games. Personalized prosocial framing (matched to Social Value Orientation profiles) increased cooperation and group success, but effects faded within a few rounds. Critically, when the same AI system was reconfigured to promote selfish behavior, the negative effects were larger and substantially more persistent — revealing an asymmetry that underscores dual-use risks of AI-driven behavioral influence.
A new arXiv preprint benchmarks six deep learning architectures, two zero-shot foundation models, and statistical baselines on multi-horizon behavioural forecasting from wearable and smartphone data across 800+ participants. Key findings include: no single architecture dominates (PatchTST leads among trained models), TimesFM matches or exceeds trained models zero-shot especially in low-data regimes, and participant-level fine-tuning reduces per-feature RMSE by 16–60%. The study is the first to jointly evaluate modern deep learning, foundation models, and personalisation for this domain.
A new arXiv paper examines human-AI teaming through the lens of statistical calibration, analyzing both combination and delegation frameworks. The authors show that existing combination methods fail to preserve the human's calibration, while delegation methods shift the calibration burden to a rejector meta-model that must be calibrated finely enough to identify where each party excels. This demand grows with human expertise and becomes unattainable when the human uses information unavailable to the system.
This Hugging Face blog post introduces FutureBench, a benchmark designed to evaluate AI agents on their ability to predict future events, addressing the challenge of data contamination in standard benchmarks by using temporally forward-looking tasks. The approach tests whether agents can reason about and forecast outcomes beyond their training data cutoff. This framing positions future-event prediction as a rigorous, contamination-resistant evaluation methodology for frontier models and agents.
A new arXiv preprint analyzes 53 papers on human-AI teaming and proposes a five-cluster taxonomy grounded in psychological teaming frameworks: AI Assistant, Ad-hoc Dependency, Ad-hoc Forced Dependency, Paired Equanimity, and Group Equanimity. The authors argue that disparate team types are currently studied under a single shared definition, raising concerns about cross-paper generalizability of findings. The paper concludes with a reporting checklist and guidance for field synthesis.
EMPATH is a new arXiv benchmark for evaluating the safety of emotional-support chatbots, using an auditor model to generate multi-turn crisis conversations and a calibrated judge model to score transcripts across 19 metrics in five dimensions. Built for Mexican Spanish and US English, the benchmark surfaces score inflation on 10 of 19 metrics under uncalibrated rubrics and finds that run-to-run reliability is a per-model safety property: one model swings 2–10 points on a crisis metric across identical reruns, and DeepSeek V4 Pro produces different conversations at temperature 0. Evaluation of three frontier models shows aggregate scores within 0.74 points but per-metric divergences up to six points, with rankings stable across a cross-family judge at 93% within ±1.
Anthropic has released the Anthropic Economic Index, an initiative tracking AI's effects on labor markets using anonymized data from approximately one million Claude.ai conversations matched to U.S. Department of Labor O*NET occupational tasks. Key findings show AI use is concentrated in software development and technical writing, with 36% of occupations seeing AI use in at least 25% of their tasks, and usage skewing toward augmentation (57%) over automation (43%). The underlying dataset is being open-sourced to enable independent research, and Anthropic is inviting economists and policy experts to contribute to the ongoing initiative. The analysis was enabled by Clio, Anthropic's privacy-preserving internal conversation analysis tool.