5arXiv cs.CL (Computation and Language)·3d ago

ConsumerSim: Generative social simulation reconstructs Consumer Confidence Index dynamics from synthetic populations

Researchers introduce ConsumerSim, a generative simulation framework that models Consumer Confidence Index (CCI) dynamics by constructing a microdata-calibrated synthetic population responding to macroeconomic, financial, policy, and news signals. The system outperforms persistence, time-series, and information-augmented baselines on U.S., EU27, and Japanese CCI reconstruction, with particular gains around high-salience economic shocks. Mechanism analyses reveal that CCI movements concentrate around salient events and that subgroup responses differ in magnitude across income, homeownership, education, and political-alignment dimensions. The work applies LLM-based generative agents to macroeconomic forecasting, contributing to the emerging literature on AI-driven social simulation.

Agent and Tool Ecosystem Consumer Confidence Index ConsumerSim

Related guides (1)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·23h ago·source ↗

Scaling laws study finds LLM social simulation fidelity mostly improves with compute, with notable exceptions

A new arXiv preprint investigates whether scaling compute improves the fidelity of LLM-based social simulations across three domains: opinion modeling, behavioral simulation, and longitudinal forecasting. Using 85 Qwen3-architecture models trained under fixed-compute budgets from 10^18 to 10^20 FLOPs, plus 35 larger open-weight models up to 70B parameters, the authors find strong scaling in most settings. However, longitudinal forecasting, underrepresented populations, and specific cognitive bias calibration tasks (e.g., risk aversion) scale poorly, with fine-tuning failing to close gaps from 0.5B to 8B parameters. The work provides empirical grounding for where scaling will and will not suffice for social simulation research.

Evaluation and Benchmarking DCLM Will Scaling Improve Social Simulation with LLMs?Qwen3 +1 more

6arXiv · cs.AI·May 21, 2026·source ↗

Mind the Sim-to-Real Gap & Think Like a Scientist: Fisher-SEP for Simulation-Aided Experimental Policy

This paper studies when and how a planner should supplement a pre-trained simulator with real-world experiments in sequential decision problems. The authors decompose simulator value error into a calibration-deployment shift (identifiable via randomization) and an irreducible parametric residual, and show that purely passive learning cannot close the reachability component of the value gap. They propose Fisher-SEP, a simulation-aided experimental policy that minimizes posterior predictive variance of a target policy's value, with case studies in supply chain and HIV mobile-testing domains demonstrating regimes where designed exploration is necessary.

Evaluation and Benchmarking AI Safety Research vending-machine supply chain case study Fisher-SEP HIV mobile-testing case study +5 more

6arXiv · cs.CL·Jun 8, 2026·source ↗

Agentopia: Long-term multi-agent life simulation framework for training LLMs on social behavior

Researchers introduce Agentopia, a framework for simulating 10 years of social life across 100 LLM-powered agents, enabling study of emergent social behaviors and long-term personal growth dynamics. The system defines a 'life reward' metric mirroring human well-being and uses it to train LLMs via rejection sampling. Training on simulated social experience yields a +15.6% improvement on downstream role-playing benchmarks, suggesting that synthetic social simulation can generalize to real capability gains.

Agent and Tool Ecosystem Alignment and RLHF Agentopia Agentopia: Long-Term Life Simulation and Learning in Agent Societies

5arXiv · cs.CL·Jun 5, 2026·source ↗

CollabSim: CSCW-grounded framework for evaluating collaborative competence in LLM multi-agent systems

Researchers introduce CollabSim, a configurable simulation framework for systematically evaluating collaborative competence in LLM-based multi-agent systems (MAS). The framework draws on Computer-Supported Cooperative Work (CSCW) theory to define collaborative capabilities beyond task outcomes, including common ground establishment, shared task understanding, and misalignment repair. Experiments across four LLMs demonstrate the framework can distinguish model performance patterns and reveal task-dependent effects of agent design choices. The work addresses a gap in MAS evaluation, which has historically focused on individual task-solving rather than coordination quality.

Evaluation and Benchmarking Agent and Tool Ecosystem CollabSim

6arXiv · cs.CL·May 26, 2026·source ↗

CausaLab: Scalable Benchmark for Interactive Causal Discovery by LLM Agents

CausaLab is a new evaluation environment that tests LLM agents on interactive causal discovery tasks, requiring them to recover both causal graphs and structural equations from synthetic laboratory episodes governed by randomly sampled structural causal models (SCMs). The benchmark separates predictive accuracy from genuine causal understanding, revealing a persistent gap: GPT-5.2-high achieves 92% task accuracy in a 6-node observational setting but only 0.471 all-edge F1 for mechanism recovery. Mixed observation-intervention strategies improve structural fidelity, while pure intervention strategies underperform on both metrics. Premature stopping is identified as a key agent weakness, partially mitigated by prompting models to verify hypothesis-data consistency.

Evaluation and Benchmarking AI Safety Research all-edge F1 GPT-5.2-high causal discovery +3 more

4arXiv · cs.CL·3d ago·source ↗

SIMAX framework generates annotated synthetic clinician-patient dialogues for AI communication coding evaluation

Researchers introduce SIMAX, a framework for generating controlled, annotated synthetic clinician-patient dialogues to support development and evaluation of AI-driven clinical communication coding systems. The framework produces dialogues with reference behavioral annotations using two codebooks (Global and WISER), generating 3,388 simulated dialogues across three medical specialties with varied personas and accent conditions. Evaluation shows reasonable speech naturalness and high transcription fidelity, with downstream testing revealing the framework can expose sensitivity gaps in communication coding systems. The work addresses a data scarcity bottleneck in deploying ambient AI scribes in clinical settings.

Evaluation and Benchmarking SIMAX UTMOS WISER Codebook +1 more

4arXiv · cs.CL·Jun 18, 2026·source ↗

HACD-H: Formal theory of social intelligence emergence in long-term human-AI interaction

Researchers propose the Human-AI Coevolution Dynamics Framework (HACD-H), a formal dynamical model treating long-term human-AI interaction as a self-organizing social cognitive system. The framework unifies emotional adaptation, relational organization, social memory, and personality consistency, introducing concepts like relational attractors, trust basins, and social cognitive energy. Empirical evaluation on a ~14,700-turn conversational dataset finds that social intelligence correlates negatively with social cognitive energy (r = -0.391) and that interaction trajectories show progressive energy reduction and phase-transition-like developmental patterns. The work argues social intelligence emerges from coevolution over time rather than from isolated conversational capabilities.

Alignment and RLHF Human-AI Coevolution Dynamics Framework Human-AI Coevolution Dynamics: A Formal Theory of Social Intelligence Emergence Through Long-Term Interaction

5The Batch·Jun 1, 2026·source ↗

Persona Generators: Evolutionary LLM Method for Diverse Synthetic Human Personas

Google researchers Davide Paglieri, Logan Cross, and colleagues propose Persona Generators, a system that uses the AlphaEvolve evolutionary algorithm to generate code that produces 25 diverse persona prompts covering a broad range of attitudes and opinions. The method iteratively optimizes persona prompt diversity using six metrics, outperforming Nemotron Personas (82% vs 76% coverage of possible responses) and a Concordia memory-based baseline (46%). The system uses Gemini 2.5 Pro for questionnaire generation and Gemma 3-27B-IT for persona simulation via the Concordia agent library. The approach reframes persona generation as a coverage optimization problem rather than a data-matching one, enabling more representative synthetic user populations for product research.

Evaluation and Benchmarking Agent and Tool Ecosystem Gemma 2 9B Persona Generators Davide Paglieri +6 more

ConsumerSim: Generative social simulation reconstructs Consumer Confidence Index dynamics from synthetic populations

Related events (8)

6arXiv · cs.CL·23h ago·source ↗