6arXiv cs.AI (Artificial Intelligence)·24h ago

Human capital traits, not model benchmarks, predict effective human-AI collaboration in forecasting

A pilot study using Polymarket as an externally resolved benchmark finds that the value of human-AI collaboration in forecasting is highly individual-dependent, with a trimodal distribution: most users either defer to the model or rubber-stamp prior beliefs, while a minority engage in genuine complementary reasoning that matches or beats market accuracy. Collaborative traits—perspective-taking, intellectual humility, and curiosity—predicted who reached the high-performance mode, while raw cognitive ability and model benchmark scores did not. The results challenge the common practice of reporting human-AI collaboration effects as a single average, and a pre-registered replication is in preparation.

Evaluation and Benchmarking Polymarket Human Capital, Not Model Benchmarks, Predicts Hybrid Intelligence in Forecasting

Related guides (1)

Evaluation and BenchmarkingTopic guide

AI Evaluation and Benchmarking: From Leaderboards to the Limits of Measurement

Read asBeginner In-depth

Related events (8)

6arXiv · cs.AI·3d ago·source ↗

Human Creativity Benchmark separates convergent and divergent professional judgment in creative AI evaluation

A new arXiv preprint introduces the Human Creativity Benchmark (HCB), which collects 15,000 professional judgments across five creative domains and three workflow phases to evaluate creative AI. The benchmark explicitly separates 'convergence' (shared professional standards) from 'divergence' (legitimate taste variation), arguing that collapsing these into a single quality metric discards actionable information. Key findings include that convergence concentrates on verifiable dimensions like technical correctness, while divergence concentrates on aesthetic direction and conceptual risk, and that no model excels uniformly across all workflow phases.

Evaluation and Benchmarking Multimodal Progress Human Creativity Benchmark

6arXiv · cs.CL·4d ago·source ↗

AI persuasive framing boosts cooperation in collective dilemmas but antisocial effects are larger and more persistent

A preprint reports a 1,283-participant experiment using AI assistants to nudge behavior in iterated Collective Risk Games. Personalized prosocial framing (matched to Social Value Orientation profiles) increased cooperation and group success, but effects faded within a few rounds. Critically, when the same AI system was reconfigured to promote selfish behavior, the negative effects were larger and substantially more persistent — revealing an asymmetry that underscores dual-use risks of AI-driven behavioral influence.

AI Safety Research Alignment and RLHF AI Persuasive Framing in Collective Dilemmas Collective Risk Game Social Value Orientation

4arXiv · cs.AI·Jun 15, 2026·source ↗

Benchmark of deep learning architectures for multi-horizon behavioural forecasting in mobile health

A new arXiv preprint benchmarks six deep learning architectures, two zero-shot foundation models, and statistical baselines on multi-horizon behavioural forecasting from wearable and smartphone data across 800+ participants. Key findings include: no single architecture dominates (PatchTST leads among trained models), TimesFM matches or exceeds trained models zero-shot especially in low-data regimes, and participant-level fine-tuning reduces per-feature RMSE by 16–60%. The study is the first to jointly evaluate modern deep learning, foundation models, and personalisation for this domain.

Evaluation and Benchmarking A Comparative Study of Deep Learning Architectures for Multi-Horizon Behavioural Forecasting for Mobile Health TimesFM TCN +1 more

4arXiv · cs.AI·Jun 10, 2026·source ↗

Theoretical analysis of calibration preservation in human-AI teaming frameworks

A new arXiv paper examines human-AI teaming through the lens of statistical calibration, analyzing both combination and delegation frameworks. The authors show that existing combination methods fail to preserve the human's calibration, while delegation methods shift the calibration burden to a rejector meta-model that must be calibrated finely enough to identify where each party excels. This demand grows with human expertise and becomes unattainable when the human uses information unavailable to the system.

Evaluation and Benchmarking AI Safety Research Human-AI Teaming Through the Lens of Calibration

5Hugging Face Blog·May 19, 2026·source ↗

Back to The Future: Evaluating AI Agents on Predicting Future Events

This Hugging Face blog post introduces FutureBench, a benchmark designed to evaluate AI agents on their ability to predict future events, addressing the challenge of data contamination in standard benchmarks by using temporally forward-looking tasks. The approach tests whether agents can reason about and forecast outcomes beyond their training data cutoff. This framing positions future-event prediction as a rigorous, contamination-resistant evaluation methodology for frontier models and agents.

Evaluation and Benchmarking Agent and Tool Ecosystem FutureBench Hugging Face

4arXiv · cs.AI·29h ago·source ↗

Taxonomy of human-AI team types derived from analysis of 53 papers

A new arXiv preprint analyzes 53 papers on human-AI teaming and proposes a five-cluster taxonomy grounded in psychological teaming frameworks: AI Assistant, Ad-hoc Dependency, Ad-hoc Forced Dependency, Paired Equanimity, and Group Equanimity. The authors argue that disparate team types are currently studied under a single shared definition, raising concerns about cross-paper generalizability of findings. The paper concludes with a reporting checklist and guidance for field synthesis.

Evaluation and Benchmarking What Types of Human-AI Teams Exist?

6arXiv · cs.AI·4d ago·source ↗

EMPATH: Multilingual multi-turn safety benchmark for emotional-support chatbots reveals score inflation and run-to-run reliability failures

EMPATH is a new arXiv benchmark for evaluating the safety of emotional-support chatbots, using an auditor model to generate multi-turn crisis conversations and a calibrated judge model to score transcripts across 19 metrics in five dimensions. Built for Mexican Spanish and US English, the benchmark surfaces score inflation on 10 of 19 metrics under uncalibrated rubrics and finds that run-to-run reliability is a per-model safety property: one model swings 2–10 points on a crisis metric across identical reruns, and DeepSeek V4 Pro produces different conversations at temperature 0. Evaluation of three frontier models shows aggregate scores within 0.74 points but per-metric divergences up to six points, with rankings stable across a cross-family judge at 93% within ±1.

Evaluation and Benchmarking AI Safety Research EMPATH DeepSeek V4

7Anthropic News·Jun 1, 2026·source ↗

Anthropic Launches Economic Index: First Large-Scale Empirical Study of AI's Labor Market Impact

Anthropic has released the Anthropic Economic Index, an initiative tracking AI's effects on labor markets using anonymized data from approximately one million Claude.ai conversations matched to U.S. Department of Labor O*NET occupational tasks. Key findings show AI use is concentrated in software development and technical writing, with 36% of occupations seeing AI use in at least 25% of their tasks, and usage skewing toward augmentation (57%) over automation (43%). The underlying dataset is being open-sourced to enable independent research, and Anthropic is inviting economists and policy experts to contribute to the ongoing initiative. The analysis was enabled by Clio, Anthropic's privacy-preserving internal conversation analysis tool.

Evaluation and Benchmarking Enterprise Deployment Patterns claude.ai Clio U.S. Department of Labor +5 more