Researchers introduce ConsumerSim, a generative simulation framework that models Consumer Confidence Index (CCI) dynamics by constructing a microdata-calibrated synthetic population responding to macroeconomic, financial, policy, and news signals. The system outperforms persistence, time-series, and information-augmented baselines on U.S., EU27, and Japanese CCI reconstruction, with particular gains around high-salience economic shocks. Mechanism analyses reveal that CCI movements concentrate around salient events and that subgroup responses differ in magnitude across income, homeownership, education, and political-alignment dimensions. The work applies LLM-based generative agents to macroeconomic forecasting, contributing to the emerging literature on AI-driven social simulation.
A new arXiv preprint investigates whether scaling compute improves the fidelity of LLM-based social simulations across three domains: opinion modeling, behavioral simulation, and longitudinal forecasting. Using 85 Qwen3-architecture models trained under fixed-compute budgets from 10^18 to 10^20 FLOPs, plus 35 larger open-weight models up to 70B parameters, the authors find strong scaling in most settings. However, longitudinal forecasting, underrepresented populations, and specific cognitive bias calibration tasks (e.g., risk aversion) scale poorly, with fine-tuning failing to close gaps from 0.5B to 8B parameters. The work provides empirical grounding for where scaling will and will not suffice for social simulation research.
This paper studies when and how a planner should supplement a pre-trained simulator with real-world experiments in sequential decision problems. The authors decompose simulator value error into a calibration-deployment shift (identifiable via randomization) and an irreducible parametric residual, and show that purely passive learning cannot close the reachability component of the value gap. They propose Fisher-SEP, a simulation-aided experimental policy that minimizes posterior predictive variance of a target policy's value, with case studies in supply chain and HIV mobile-testing domains demonstrating regimes where designed exploration is necessary.
Researchers introduce Agentopia, a framework for simulating 10 years of social life across 100 LLM-powered agents, enabling study of emergent social behaviors and long-term personal growth dynamics. The system defines a 'life reward' metric mirroring human well-being and uses it to train LLMs via rejection sampling. Training on simulated social experience yields a +15.6% improvement on downstream role-playing benchmarks, suggesting that synthetic social simulation can generalize to real capability gains.
Researchers introduce CollabSim, a configurable simulation framework for systematically evaluating collaborative competence in LLM-based multi-agent systems (MAS). The framework draws on Computer-Supported Cooperative Work (CSCW) theory to define collaborative capabilities beyond task outcomes, including common ground establishment, shared task understanding, and misalignment repair. Experiments across four LLMs demonstrate the framework can distinguish model performance patterns and reveal task-dependent effects of agent design choices. The work addresses a gap in MAS evaluation, which has historically focused on individual task-solving rather than coordination quality.
CausaLab is a new evaluation environment that tests LLM agents on interactive causal discovery tasks, requiring them to recover both causal graphs and structural equations from synthetic laboratory episodes governed by randomly sampled structural causal models (SCMs). The benchmark separates predictive accuracy from genuine causal understanding, revealing a persistent gap: GPT-5.2-high achieves 92% task accuracy in a 6-node observational setting but only 0.471 all-edge F1 for mechanism recovery. Mixed observation-intervention strategies improve structural fidelity, while pure intervention strategies underperform on both metrics. Premature stopping is identified as a key agent weakness, partially mitigated by prompting models to verify hypothesis-data consistency.
Researchers introduce SIMAX, a framework for generating controlled, annotated synthetic clinician-patient dialogues to support development and evaluation of AI-driven clinical communication coding systems. The framework produces dialogues with reference behavioral annotations using two codebooks (Global and WISER), generating 3,388 simulated dialogues across three medical specialties with varied personas and accent conditions. Evaluation shows reasonable speech naturalness and high transcription fidelity, with downstream testing revealing the framework can expose sensitivity gaps in communication coding systems. The work addresses a data scarcity bottleneck in deploying ambient AI scribes in clinical settings.
Researchers propose the Human-AI Coevolution Dynamics Framework (HACD-H), a formal dynamical model treating long-term human-AI interaction as a self-organizing social cognitive system. The framework unifies emotional adaptation, relational organization, social memory, and personality consistency, introducing concepts like relational attractors, trust basins, and social cognitive energy. Empirical evaluation on a ~14,700-turn conversational dataset finds that social intelligence correlates negatively with social cognitive energy (r = -0.391) and that interaction trajectories show progressive energy reduction and phase-transition-like developmental patterns. The work argues social intelligence emerges from coevolution over time rather than from isolated conversational capabilities.
Google researchers Davide Paglieri, Logan Cross, and colleagues propose Persona Generators, a system that uses the AlphaEvolve evolutionary algorithm to generate code that produces 25 diverse persona prompts covering a broad range of attitudes and opinions. The method iteratively optimizes persona prompt diversity using six metrics, outperforming Nemotron Personas (82% vs 76% coverage of possible responses) and a Concordia memory-based baseline (46%). The system uses Gemini 2.5 Pro for questionnaire generation and Gemma 3-27B-IT for persona simulation via the Concordia agent library. The approach reframes persona generation as a coverage optimization problem rather than a data-matching one, enabling more representative synthetic user populations for product research.