3arXiv cs.CL (Computation and Language)·6d ago

IMPACTeen: Annotated dataset for social influence detection in adolescent communication contexts

IMPACTeen is a new Polish/English bilingual dataset of 1,021 social influence scenarios targeting adolescent communication contexts, with 5,100 annotation records from five distinct annotator perspectives (teenagers, parents, psychologists, communication experts, teachers). The dataset covers influence techniques, intentions, consequences, and resistance, and was constructed via constrained LLM generation followed by human editing. It is intended to support research on social influence detection, annotator disagreement modeling, cross-lingual NLP, and LLM training and evaluation.

Evaluation and Benchmarking IMPACTeen

Related guides (1)

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

3arXiv · cs.CL·11d ago·source ↗

Annotated dataset for enthymeme detection in political tweets with disagreement-aware training

Researchers present a dataset of 1,482 politically controversial tweets annotated by five annotators for enthymemes — arguments with unstated premises or conclusions — designed to study label variation rather than eliminate it. Annotation guidelines are grounded in Walton's argumentation schemes, and the paper includes a complexity analysis of cognitive load in the task. Preliminary experiments show that models trained on annotator disagreement outperform those trained on hard majority-vote labels, suggesting value in preserving annotation disagreement for subjective NLP tasks.

Evaluation and Benchmarking A Resource for Enthymeme Detection in Controversial Political Discourse Walton's argumentation schemes

4arXiv · cs.CL·3d ago·source ↗

CATCH-ME dataset: multilingual multi-turn counterspeech against hate speech and misinformation for RAG systems

Researchers introduce CATCH-ME, a large-scale expert-curated multilingual dataset of multi-turn dialogues addressing the intersection of hate speech and misinformation across five languages and seven marginalized groups. The dataset is anchored in verified external knowledge (fact-checking articles and NGO reports) with document- and chunk-level span annotations, making it directly usable for RAG-based counterspeech systems. It addresses a gap in existing resources, which are limited to single-turn English dialogues, and is intended to improve the factual grounding and persuasiveness of LLM-generated counterspeech.

Evaluation and Benchmarking AI Safety Research CATCH-ME

4arXiv · cs.CL·7d ago·source ↗

Persuasion Index: Theory-grounded taxonomy and open-source tool for analyzing rhetorical persuasion

Researchers introduce Persuasion Index (PI), a 15-dimension taxonomy of persuasive rhetorical cues grounded in psychology and communication theory, implemented via 55 sub-features using lexicons and rule-based detectors. PI is evaluated on four public datasets across domains and shown to provide interpretable, computationally lightweight predictive signal for persuasion-related outcomes. The framework is released as an open-source package and web interface, with stated applications including AI safety and detection of information manipulation.

Evaluation and Benchmarking AI Safety Research Persuasion Index Persuasion Index: A Theory-Guided Framework for Persuasion Analysis

4arXiv · cs.CL·26d ago·source ↗

Interaction SSD: Modeling Annotator Identity Effects on Hate Speech Semantic Gradients

This paper introduces Interaction SSD, an extension of Supervised Semantic Differential that tests how semantic meaning varies across moderating variables such as annotator group identity. Applied to the UC Berkeley Measuring Hate Speech corpus, the method detects that annotator racial identity significantly moderates hate-speech judgments, with a shared gradient distinguishing dehumanizing hostility from counter-speech and an interaction gradient revealing group-linked differences in predictive semantic cues. The approach makes moderated meaning-outcome relationships statistically testable and interpretable through standard SSD tooling.

Evaluation and Benchmarking AI Safety Research Supervised Semantic Differential Interaction SSD UC Berkeley Measuring Hate Speech Corpus +1 more

5arXiv · cs.CL·23d ago·source ↗

CommunityFact: A Dynamic, Multilingual, Multi-domain Benchmark for Misinformation Detection in the Wild

CommunityFact is a refreshable benchmark for misinformation detection containing 15,992 standalone claims across five languages and two domains, designed to address limitations of static benchmarks. The authors evaluate ten LLMs under varying inference-time conditions including chain-of-thought reasoning and web-search augmentation, finding that web access yields the largest performance gains. A key finding is that web-enabled LLMs' source-selection policies are systematically misaligned with sources that human Community Notes raters converge on, a gap addressable through retrieval expansion or pruning. The benchmark also proposes using Community Notes as a training signal for claim-conditioned source suggesters.

Evaluation and Benchmarking Agent and Tool Ecosystem large language models Community Notes CommunityFact

7arXiv · cs.LG·1mo ago·source ↗

AI-Mediated Communication Can Steer Collective Opinion via LLM Editing Biases

This paper demonstrates empirically that LLMs from multiple model families introduce directional biases when editing human-written texts on contested topics (e.g., nudging toward gun control, against atheism). The authors develop a mathematical opinion-dynamics model showing these biases are amplified through social networks, shifting collective opinion at scale. An audit of X's 'Explain this post' feature finds evidence of pro-life bias in Grok's outputs on abortion content, traced to specific design choices. The paper concludes with implications for EU legislative efforts on AI-mediated communication.

Evaluation and Benchmarking AI Safety Research Grok X (Twitter)EU AI Act +5 more

4arXiv · cs.CL·20d ago·source ↗

LLM-Assisted Discovery of ADHD Signals in Turkish Teacher Narratives Beyond Rating Scales

This study analyzes de-identified Turkish teacher evaluation forms from clinical ADHD assessments, comparing predictive signals from structured rating scales (CTRS-R:S) and open-ended teacher narratives. The authors find that structured and narrative information encode complementary signals, with minimal overlap between cases missed by each modality. An LLM-assisted theme discovery pipeline reveals distinct attention, behavioral, and family-related patterns in narratives that structured scales miss, demonstrating NLP's potential to augment traditional ADHD screening.

Evaluation and Benchmarking Enterprise Deployment Patterns LLM-assisted theme discovery pipeline Natural Language Processing Conners' Teacher Rating Scale-Revised Short Form +1 more

5arXiv · cs.CL·16d ago·source ↗

ALMANAC dataset provides action-level mental model annotations for studying human-agent collaboration

Researchers introduce ALMANAC, a dataset of 2,987 collaboration actions drawn from the Map Task dyadic routing paradigm, each annotated with theory-informed mental model labels covering self-reasoning, perceived partner intent, and perceived team goal. The dataset targets a gap in LLM agent training data: current agents are optimized for task completion but lack process-level collaborative competence grounded in mental model alignment. Six LLMs are benchmarked on predicting human next-turn behavior and mental model states. The work provides a resource for evaluating and potentially training agents toward more human-like collaborative reasoning.

Evaluation and Benchmarking Agent and Tool Ecosystem Map Task ALMANAC