4arXiv cs.CL (Computation and Language)·25d ago

WhoSaidIt: Human-LLM Collaborative Annotation for Multilingual Speaker-Attribute Classification

This paper proposes a human-LLM collaborative re-annotation framework for stabilizing noisy multilingual speaker-attribute labels under resource constraints. LLMs surface recurring annotation rationales through iterative expert interaction, combined with disagreement-focused sampling for targeted re-annotation. The resulting WhoSaidIt dataset covers nine speaker-attribute labels across multiple languages. Benchmarking of recent LLMs reveals substantial cross-lingual annotation divergence and highlights both capabilities and limitations of LLMs in this classification task.

Evaluation and Benchmarking Agent and Tool Ecosystem human-LLM collaborative annotation speaker-attribute classification WhoSaidIt disagreement-focused sampling

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·15d ago·source ↗

LLMs fail to consistently simulate demographic perspective-taking in hate speech annotation

A new arXiv paper evaluates whether persona-conditioned LLMs can replicate how different demographic groups perceive hate speech, testing three dimensions: inter-group disagreement, in-group sensitivity, and vicarious prediction. No model consistently captures all three dimensions, and performance is highly model-dependent rather than emerging reliably from identity prompts alone. Vicarious prompting with Llama 3.1 provides the closest approximation to human disagreement patterns across demographic axes. The findings have implications for using LLMs as proxies for diverse human annotators in content moderation tasks.

Evaluation and Benchmarking AI Safety Research From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation Meta Llama-3.1-8B

5arXiv · cs.CL·23d ago·source ↗

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

This paper systematically investigates strategies for extending LLM-based automatic evaluation (LLMs-as-a-Judge) to multilingual settings, covering high-, mid-, and low-resource languages (English, Spanish, Basque). The authors compare instruction translation, monolingual vs. multilingual supervision, and model size, finding that fine-tuned smaller models can match proprietary models when in-domain data is available, while zero-shot larger models are preferable out-of-domain. Two meta-evaluation datasets are extended to Spanish and Basque, and all data and code are publicly released.

Evaluation and Benchmarking Basque language LLM-as-a-Judge mJudge +2 more

4arXiv · cs.CL·10d ago·source ↗

Calibrated LLM annotation and encoder transfer for measuring human values in social media text

A new arXiv preprint investigates how different LLMs, prompts, and instruction languages operationalize Schwartz's theory of basic human values when annotating non-English social media posts. The authors evaluate annotation quality beyond standard F1 metrics, examining structural alignment, error structure, and confidence-ambiguity relations, finding that iterative prompt calibration reduces misattributions. They also demonstrate that LLM annotations can be transferred to a smaller encoder model via soft-label training, preserving theory-grounded value interpretations and uncertainty information.

Evaluation and Benchmarking Alignment and RLHF Schwartz's Theory of Basic Human Values Measuring Human Value Expression in Social Media Texts: Calibrated LLM Annotation and Encoder Transfer

5Hugging Face Blog·1mo ago·source ↗

Consilium: When Multiple LLMs Collaborate

Hugging Face introduces Consilium, a framework for multi-LLM collaboration where multiple language models work together on tasks rather than relying on a single model. The approach explores how ensembling or deliberation among diverse LLMs can improve output quality and robustness. This fits into the broader agent-tool ecosystem trend of orchestrating multiple AI models for better results.

Frontier Model Releases Agent and Tool Ecosystem Hugging Face Consilium

5arXiv · cs.AI·26d ago·source ↗

Adversarial Subspace Alignment for Robust Multimodal Knowledge Editing in MLLMs

This paper addresses the generalization gap in multimodal large language model (MLLM) knowledge editing, where edits fail to propagate across semantically equivalent visual and linguistic variations. The authors introduce Latent Adversarial Robustification (LAR), which generates adversarial but semantically coherent variants in joint latent space, and Rank-Constrained Subspace Learning (RCSL), which enforces low-rank alignment of adversarial representations at the edit layer. Together these form the ASAM framework, which formalizes robustness via knowledge units grouping semantically equivalent multimodal inputs. Empirical analysis demonstrates improved generality without sacrificing reliability or locality.

Alignment and RLHF Multimodal Progress Multimodal Large Language Models Latent Adversarial Robustification (LAR)knowledge editing +2 more

5arXiv · cs.CL·1mo ago·source ↗

Text Analytics Evaluation Framework: Benchmarking LLMs on Social Media NLP Tasks

Researchers introduce a 470-question evaluation framework to assess LLM performance on aggregated social media text, applied to Twitter datasets across sentiment analysis, hate speech detection, and emotion recognition. Results show performance degrades substantially as input scale exceeds 500 instances, particularly for open-weights models on numerical tasks. Multi-label and target-dependent scenarios also show notable performance drops, and task complexity progressively erodes accuracy from basic semantic identification to comparison and counting operations. The findings point to architectural bottlenecks in current LLMs for rigorous quantitative analysis over large text collections.

Long Context Evolution Evaluation and Benchmarking Emotion Recognition Text Analytics Evaluation Framework X (Twitter)+3 more

4arXiv · cs.CL·8d ago·source ↗

Audio-LLM-based data filtering for speech-to-speech translation via Rank-to-Distill

A new arXiv paper proposes using audio large language models to filter noisy training data for end-to-end speech-to-speech translation (S2ST). The authors introduce a two-stage Rank-to-Distill strategy: a lightweight ranker generates pseudo-labels from noisy speech pairs, which then supervise an audio-LLM to make keep/drop decisions directly from raw audio. Experiments on CVSS-C and SpeechMatrix benchmarks show up to +1.4 ASR-BLEU improvement over unfiltered baselines.

Evaluation and Benchmarking Multimodal Progress Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data SpeechMatrix CVSS-C +1 more

4arXiv · cs.AI·25d ago·source ↗

Structure-Aware Code Change Labeling with LLMs via Two-Stage Taxonomy Pipeline

This paper presents a systematic study of using LLMs for taxonomy-based labeling of code diff hunks, going beyond summarization to assign structured labels capturing semantic attributes like renames, moves, and logic modifications. The authors introduce a two-stage pipeline combining diff-hunk labeling with structural refinement, using few-shot prompting to remain language-agnostic. Evaluated across four LLMs on a curated benchmark of natural and synthetic patches, the best configuration achieves 84% recall and 81% precision. Results suggest LLM-based structured labeling can complement static analysis tools in code review workflows.

Enterprise Deployment Patterns Agent and Tool Ecosystem few-shot prompting code review automation diff hunk taxonomy benchmark +1 more