4arXiv cs.CL (Computation and Language)·10d ago

CASPER: Narratological analysis of character variety in LLM-generated vs. human-written stories

A new arXiv preprint introduces CASPER, a framework borrowing narratological dimensions (such as stylization and wholeness) to analyze character portrayal in LLM-generated versus human-written fiction. The study automatically infers character categories across both corpora and compares them along eight dimensions. The work addresses whether LLMs produce character variety comparable to human authors, with implications for creative AI applications.

Evaluation and Benchmarking CASPER in the Machine: Insights into Character Variety in LLM-Generated Stories

Related guides (1)

Evaluation and BenchmarkingTopic guide

AI Evaluation and Benchmarking: From Leaderboards to the Limits of Measurement

Read asBeginner In-depth

Related events (8)

4arXiv · cs.CL·15h ago·source ↗

World Wide Models: Literary Tools for Cultural AI — framework for culturally literate LLMs

A preprint from arXiv proposes applying literary disciplines — comparative literature, narratology, critical theory, and world literature — as a framework for building more culturally literate AI systems. The essay argues that LLMs currently enact a 'massive, automated, and monolingual' form of cultural encounter and that structural monolingualism is a core problem. It develops a layered framework addressing global AI textuality through macrostructure, circulation, and untranslatability.

Evaluation and Benchmarking World Wide Models: Literary Tools for Cultural AI

6arXiv · cs.CL·33h ago·source ↗

Study finds LLM-generated research ideas cluster around synthesis and bridging, diverging from human distribution

A new arXiv paper introduces a large-scale evaluation framework for comparing LLM-generated research ideas against human-authored ones, using reverse-engineered prior-work sets as prompts. The authors develop a two-axis taxonomy of research taste (opportunity pattern and research paradigm) and find a consistent distributional gap: LLMs over-index on bridge-like opportunities and synthesis methods, while human researchers spread more broadly across framing and contribution types. The result suggests current LLMs produce reasonable but systematically narrower and shifted ideation relative to human researchers.

Evaluation and Benchmarking Agent and Tool Ecosystem Measuring the Gap Between Human and LLM Research Ideas Measuring the Gap Between Human and LLM Research Ideas

5arXiv · cs.AI·1mo ago·source ↗

Human Decision-Making with Persuasive and Narrative LLM Explanations

A large-scale behavioral experiment evaluated how LLM-generated narrative explanations of varying persuasiveness affect human decision-making accuracy in classification tasks. Results showed that persuasiveness level did not meaningfully improve decision accuracy over a simple AI prediction alone, consistent with prior explainable AI research using feature importance methods. Narratives increased AI reliance regardless of whether the AI prediction was correct or incorrect, and more persuasive narratives may have slowed response times and reduced ability to discriminate correct from incorrect AI predictions. The study concludes that narrative explanations involve tradeoffs and warrant further investigation into when and how they should be deployed.

Evaluation and Benchmarking AI Safety Research Narrative Explanations large language models Explainable AI (XAI)+2 more

5arXiv · cs.CL·28d ago·source ↗

LLMs fail to consistently simulate demographic perspective-taking in hate speech annotation

A new arXiv paper evaluates whether persona-conditioned LLMs can replicate how different demographic groups perceive hate speech, testing three dimensions: inter-group disagreement, in-group sensitivity, and vicarious prediction. No model consistently captures all three dimensions, and performance is highly model-dependent rather than emerging reliably from identity prompts alone. Vicarious prompting with Llama 3.1 provides the closest approximation to human disagreement patterns across demographic axes. The findings have implications for using LLMs as proxies for diverse human annotators in content moderation tasks.

Evaluation and Benchmarking AI Safety Research From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation Meta Llama-3.1-8B

4arXiv · cs.AI·22d ago·source ↗

Gamified writing experiment studies when humans adopt AI suggestions vs. maintain creative autonomy

A preprint from arXiv introduces 'Nonslop,' a gamified writing experiment with 74 participants designed to study authentic human preferences in AI-assisted creative writing. The system deliberately inverts the helpful-assistant pattern by disincentivizing AI suggestion acceptance, simulating a dystopian framing to reveal genuine user behavior rather than default compliance. The study analyzes when users choose creative autonomy versus accepting AI assistance across different task types and response characteristics. Findings bear on questions of individual voice, authenticity, and the tension between efficiency and human expression in LLM-augmented writing.

Evaluation and Benchmarking Nonslop

4arXiv · cs.CL·14d ago·source ↗

Mechanistic analysis of how LLMs encode essay quality in internal representations

Researchers systematically probe the hidden representations of eight LLMs across three essay datasets (ASAP++, CSEE, ENEM) to understand how automated essay scoring (AES) works internally. Using linear probing, dimensionality reduction, and neuron-level analysis, they find essay quality is encoded in a linearly accessible form that emerges progressively across layers and partially transfers across prompts. Individual 'essay scoring neurons' are identified whose activations correlate with scores and respond to targeted interventions, with longer essays relying more on deeper layers. The work contributes to mechanistic interpretability of LLM-based scoring systems.

Evaluation and Benchmarking From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models CSEE ENEM +1 more

6arXiv · cs.AI·21d ago·source ↗

LLMs automate reproducibility assessments in social and behavioral sciences, outperforming human reanalysts

A preprint from arXiv demonstrates that an LLM pipeline can automate reproducibility assessments of published social and behavioral science studies, recovering original effect sizes in 41% of cases (vs. 34% for human reanalysts) and reaching the same qualitative conclusion in 96% of cases (vs. 74% for humans). The study evaluated 76 published studies with predefined claims. The results suggest LLMs could serve as a scalable tool for systematic auditing of empirical research, addressing the resource-intensive nature of traditional reproducibility efforts.

Evaluation and Benchmarking Agent and Tool Ecosystem Automated reproducibility assessments in the social and behavioral sciences using large language models

4arXiv · cs.CL·33h ago·source ↗

Survey chapter on LLM mechanisms, emergent capabilities, and cognition debates

A new arXiv preprint surveys current understanding of large language models, covering the Transformer architecture, emergent capabilities resembling human cognition (symbolic reasoning, theory of mind, deception), and explainability approaches from neuron activation analysis to circuit tracing. The chapter also engages the debate over whether LLMs genuinely understand or merely pattern-match, arguing against reductive anti-anthropomorphism while acknowledging human-LLM differences. It is framed as a book chapter synthesizing recent empirical findings and theoretical positions.

Evaluation and Benchmarking AI Safety Research Understanding Large Language Models