Scaling laws study finds LLM social simulation fidelity mostly improves with compute, with notable exceptions
A new arXiv preprint investigates whether scaling compute improves the fidelity of LLM-based social simulations across three domains: opinion modeling, behavioral simulation, and longitudinal forecasting. Using 85 Qwen3-architecture models trained under fixed-compute budgets from 10^18 to 10^20 FLOPs, plus 35 larger open-weight models up to 70B parameters, the authors find strong scaling in most settings. However, longitudinal forecasting, underrepresented populations, and specific cognitive bias calibration tasks (e.g., risk aversion) scale poorly, with fine-tuning failing to close gaps from 0.5B to 8B parameters. The work provides empirical grounding for where scaling will and will not suffice for social simulation research.
Related guides (1)
Related events (8)
Investing in Performance: Fine-tune small models with LLM insights — a CFM case study
This Hugging Face blog post presents a case study from CFM (Capital Fund Management) on using large language model outputs to guide fine-tuning of smaller, more efficient models for financial applications. The approach leverages LLM-generated signals or labels to train compact models that can be deployed at lower cost and latency. The case study illustrates an enterprise pattern of distilling LLM capabilities into task-specific smaller models for production use.
Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency
This paper establishes a quantitative scaling law linking LLM factual recall to both model parameter count and topic frequency in training data, evaluated across 38 models on 8,900+ scholarly references. Recall quality follows a sigmoid function in the log-linear combination of these two variables, explaining 60% of variance across 16 dense models from four families and 74-94% within individual families. The authors propose a superposition-inspired mechanism where recall is gated by a signal-to-noise ratio: concept frequency provides signal and model capacity sets the noise floor. This provides a predictive framework for understanding and anticipating LLM confabulation patterns.
Scaling Laws for Neural Language Models
OpenAI published foundational research establishing empirical scaling laws for neural language models, showing that model performance scales predictably with compute, data, and parameters. The work demonstrated power-law relationships between these factors and loss, providing a principled framework for allocating training resources. This paper became a cornerstone of modern large language model development strategy.
Optimizing your LLM in production
A Hugging Face blog post covering practical techniques for optimizing large language models in production environments. The post likely addresses inference efficiency methods such as quantization, batching, caching, and hardware utilization strategies. It serves as a practitioner-oriented guide for deploying LLMs at scale.
LLMs automate reproducibility assessments in social and behavioral sciences, outperforming human reanalysts
A preprint from arXiv demonstrates that an LLM pipeline can automate reproducibility assessments of published social and behavioral science studies, recovering original effect sizes in 41% of cases (vs. 34% for human reanalysts) and reaching the same qualitative conclusion in 96% of cases (vs. 74% for humans). The study evaluated 76 published studies with predefined claims. The results suggest LLMs could serve as a scalable tool for systematic auditing of empirical research, addressing the resource-intensive nature of traditional reproducibility efforts.
Agentopia: Long-term multi-agent life simulation framework for training LLMs on social behavior
Researchers introduce Agentopia, a framework for simulating 10 years of social life across 100 LLM-powered agents, enabling study of emergent social behaviors and long-term personal growth dynamics. The system defines a 'life reward' metric mirroring human well-being and uses it to train LLMs via rejection sampling. Training on simulated social experience yields a +15.6% improvement on downstream role-playing benchmarks, suggesting that synthetic social simulation can generalize to real capability gains.
SIMMER benchmark exposes high rates of latent planning failures in frontier LLMs
Researchers introduce SIMMER, a benchmark for evaluating latent failures in LLM-generated executable plans within a kitchen-domain world model comprising 77 actions, 262 objects, and ~46,800 possible interactions. Unlike existing benchmarks that only catch immediate execution failures, SIMMER detects silent hazards and irreversible consequences using a state machine executor. Experiments across six LLMs find that even frontier models produce error-free plans at most 17% of the time, with up to 56% of plans containing latent failures—most leading to irreversible outcomes. The paper also shows that counterfactual foresight simulation can reduce latent failures by up to 72%, pointing toward a mitigation direction.
Text Analytics Evaluation Framework: Benchmarking LLMs on Social Media NLP Tasks
Researchers introduce a 470-question evaluation framework to assess LLM performance on aggregated social media text, applied to Twitter datasets across sentiment analysis, hate speech detection, and emotion recognition. Results show performance degrades substantially as input scale exceeds 500 instances, particularly for open-weights models on numerical tasks. Multi-label and target-dependent scenarios also show notable performance drops, and task complexity progressively erodes accuracy from basic semantic identification to comparison and counting operations. The findings point to architectural bottlenecks in current LLMs for rigorous quantitative analysis over large text collections.
