4arXiv cs.CL (Computation and Language)·3d ago

Mechanism-oriented taxonomy of indirect linguistic encoding improves LLM-based coded language detection

Researchers propose a comprehensive taxonomy of indirect linguistic expressions (ILE) — covering algospeak, euphemisms, and adversarial obfuscation — organized by underlying encoding mechanisms rather than communicative intent. The taxonomy is evaluated by injecting it into LLM prompts and benchmarked against four existing taxonomies and a no-taxonomy baseline on 2,000 manually annotated TikTok and Bluesky posts. The proposed taxonomy achieves 4.7% accuracy and 5.4% F1 improvements over the best competing approach across three LLMs, suggesting structured linguistic scaffolding meaningfully aids content moderation tasks.

Evaluation and Benchmarking AI Safety Research Bluesky TikTok Beyond Surface Forms: A Comprehensive, Mechanism-Oriented Taxonomy of Indirect Linguistic Encoding for LLM-Based Coded Language Detection

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4arXiv · cs.AI·1mo ago·source ↗

Structure-Aware Code Change Labeling with LLMs via Two-Stage Taxonomy Pipeline

This paper presents a systematic study of using LLMs for taxonomy-based labeling of code diff hunks, going beyond summarization to assign structured labels capturing semantic attributes like renames, moves, and logic modifications. The authors introduce a two-stage pipeline combining diff-hunk labeling with structural refinement, using few-shot prompting to remain language-agnostic. Evaluated across four LLMs on a curated benchmark of natural and synthetic patches, the best configuration achieves 84% recall and 81% precision. Results suggest LLM-based structured labeling can complement static analysis tools in code review workflows.

Enterprise Deployment Patterns Agent and Tool Ecosystem few-shot prompting code review automation diff hunk taxonomy benchmark +1 more

3arXiv · cs.CL·13d ago·source ↗

Revisiting LLM systematicity in negation understanding via in-context learning

A new arXiv preprint analyzes how well large language models handle negation from two angles: behavioral systematicity (whether models correctly recognize negation expressions and scope) and representational systematicity (whether function vectors can be reliably constructed from in-context examples). Results show LLMs partially succeed at negation cue recognition via in-context learning but struggle with scope recognition, with performance varying by output format. Function vectors can be composed for cue extraction but are harder to extract for scope recognition tasks.

Evaluation and Benchmarking Revisiting the Systematicity in Negation in the Era of In-Context Learning

5arXiv · cs.CL·13d ago·source ↗

Systematic study of extrinsic and intrinsic properties for effective code interpreter reasoning in LLMs

Researchers investigate what behavioral properties make LLMs effective at reasoning with a Code Interpreter (CI), identifying two axes: extrinsic 'crucial tokens' and intrinsic 'cognitive behaviors' such as verification, backtracking, and backward chaining. Stronger CI reasoning models consistently exhibit higher prevalence of these properties. The paper shows that appending code-specific crucial tokens at inference time improves performance on mathematical, ordering, and optimization tasks, while augmenting training with cognitive behaviors improves SFT and RL performance in two of three evaluated models. The work also finds these behaviors reduce overthinking in incorrect responses and improve token efficiency.

Evaluation and Benchmarking Agent and Tool Ecosystem Exploring Extrinsic and Intrinsic Properties for Effective Reasoning with Code Interpreter

4arXiv · cs.CL·10d ago·source ↗

Mechanistic analysis of how LLMs encode essay quality in internal representations

Researchers systematically probe the hidden representations of eight LLMs across three essay datasets (ASAP++, CSEE, ENEM) to understand how automated essay scoring (AES) works internally. Using linear probing, dimensionality reduction, and neuron-level analysis, they find essay quality is encoded in a linearly accessible form that emerges progressively across layers and partially transfers across prompts. Individual 'essay scoring neurons' are identified whose activations correlate with scores and respond to targeted interventions, with longer essays relying more on deeper layers. The work contributes to mechanistic interpretability of LLM-based scoring systems.

Evaluation and Benchmarking From Texts to Scores: Tracing the Emergence of Essay Quality Representations in Large Language Models CSEE ENEM +1 more

3arXiv · cs.CL·19d ago·source ↗

ArabiGEE: Hierarchical taxonomy for Arabic grammatical error explanation with LLM evaluation support

Researchers introduce ArabiGEE, the first structured taxonomy for Arabic grammatical error explanation (GEE), organizing 27 error types, 140 correction types, and 324 explanations across orthographic, morphological, syntactic, and lexical dimensions. The taxonomy is applied to annotate existing Arabic grammatical error correction corpora and is used to benchmark LLMs on Arabic GEE tasks. The work addresses a gap in Arabic NLP tooling by moving beyond free-form explanation generation toward structured, evaluable outputs.

Evaluation and Benchmarking ArabiGEE

5arXiv · cs.CL·4d ago·source ↗

LLM-based classification exposes keyword lexicon artifacts in computational social science stance measurement

A new arXiv preprint demonstrates that statistically significant findings in computational social science can be entirely measurement artifacts of keyword-based scoring instruments. Analyzing 85 interviews across four public intellectuals, the authors show that keyword-based certainty scores produce strong correlations (r=0.72–0.93) that collapse or invert when replaced with LLM zero-shot semantic classification on 32,625 sentences. The paper identifies three structural failure modes in keyword lexicons—syntactic blindness, polysemy blindness, and categorical absence—and argues that keyword counts measure lexical co-occurrence tendencies rather than rhetorical stance. The work has implications for the validity of prior NLP-based social science research and for the comparative utility of LLMs as measurement instruments.

Evaluation and Benchmarking When Certainty Is an Artifact: Keyword Lexicon Blindness and the (Mis)Measurement of Rhetorical Stance

4arXiv · cs.CL·19d ago·source ↗

Calibrated LLM annotation and encoder transfer for measuring human values in social media text

A new arXiv preprint investigates how different LLMs, prompts, and instruction languages operationalize Schwartz's theory of basic human values when annotating non-English social media posts. The authors evaluate annotation quality beyond standard F1 metrics, examining structural alignment, error structure, and confidence-ambiguity relations, finding that iterative prompt calibration reduces misattributions. They also demonstrate that LLM annotations can be transferred to a smaller encoder model via soft-label training, preserving theory-grounded value interpretations and uncertainty information.

Evaluation and Benchmarking Alignment and RLHF Schwartz's Theory of Basic Human Values Measuring Human Value Expression in Social Media Texts: Calibrated LLM Annotation and Encoder Transfer

4arXiv · cs.CL·5d ago·source ↗

Cross-lingual evaluation framework reveals LLMs redistribute cultural narrative structure while preserving semantic meaning

A new arXiv preprint introduces a multilingual evaluation framework using 414 proverbs across 15 languages to assess whether LLMs preserve culturally grounded meaning when generating narratives. Using four LLMs to produce 13k narratives, the study finds that cross-lingual prompting preserves proverb-level semantic meaning but systematically redistributes agency, social positioning, and narrative structure. Strong inter-model convergence across architectures suggests multilingual LLMs rely on shared semantic abstractions. The authors argue that semantic similarity metrics alone overestimate cultural preservation in multilingual evaluations.

Evaluation and Benchmarking Same Lesson, Different Story: Cross-Lingual Reconstruction of Cultural Narratives in Large Language Models