5arXiv cs.CL (Computation and Language)·38h ago

ToxiREX: Multilingual contextual dataset for implicit toxicity detection with structured reasoning schema

Researchers introduce ToxiREX, a multilingual Reddit-based dataset for detecting implicit and context-dependent toxicity across six languages (English, Arabic, Turkish, Spanish, German, Dutch), anchored to real-world events like the 2023 Turkey earthquakes and the Russian invasion of Ukraine. The dataset includes 125K LLM-annotated training comments and ~3K human-annotated test comments, structured using a toxic reasoning schema that captures implicit toxicity and maps to existing taxonomies. Baseline results from prompted and fine-tuned language models show above-random but substantially suboptimal performance, indicating the task remains challenging. ToxiREX is claimed as the first dataset combining multilingual coverage, conversational context, and implicit toxicity with schema-based structured annotations.

Evaluation and Benchmarking AI Safety Research Reddit ToxiREX: A Dataset on Toxic REasoning in ConteXt ToxiREX

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

3arXiv · cs.CL·5d ago·source ↗

Tatoxa: State-of-the-art text detoxification system for the low-resource Tatar language

Researchers introduce Tatoxa, a text detoxification system for the Tatar language, along with a new fine-tuning and evaluation dataset for this low-resource setting. Comparative experiments show Tatoxa outperforms both open-source and proprietary LLMs on quality metrics. Cross-lingual transfer experiments find that even culturally close Russian data transfers poorly compared to native Tatar training data, highlighting the limits of cross-lingual approaches for low-resource languages.

AI Safety Research Tatoxa The Tatoxa System for Text Detoxification in Low-Resource Languages: The Case of Tatar

6arXiv · cs.CL·19d ago·source ↗

OpenMedReason: Large-scale multimodal medical reasoning corpus with 450K instances for clinical VLM training

Researchers introduce OpenMedReason, a 450K-instance open multimodal medical reasoning corpus with reasoning traces derived from human-authored biomedical literature rather than synthetic chains of thought. The dataset covers diverse medical imaging modalities and is paired with OpenMedReason-Bench, a held-out benchmark evaluating LVLMs on perception, medical knowledge, and rationale axes. Training with OpenMedReason yields a 20% average VQA accuracy improvement over base models and achieves performance within 4.2% of leading comparable-scale medical VLMs. Both the dataset and code are publicly released.

Evaluation and Benchmarking Alignment and RLHF OpenMedReason OpenMedReason-Bench +1 more

3arXiv · cs.CL·14d ago·source ↗

IMPACTeen: Annotated dataset for social influence detection in adolescent communication contexts

IMPACTeen is a new Polish/English bilingual dataset of 1,021 social influence scenarios targeting adolescent communication contexts, with 5,100 annotation records from five distinct annotator perspectives (teenagers, parents, psychologists, communication experts, teachers). The dataset covers influence techniques, intentions, consequences, and resistance, and was constructed via constrained LLM generation followed by human editing. It is intended to support research on social influence detection, annotator disagreement modeling, cross-lingual NLP, and LLM training and evaluation.

Evaluation and Benchmarking IMPACTeen

5arXiv · cs.CL·12d ago·source ↗

RECOM benchmark reveals validity-discrimination tradeoff in automatic metrics for open-ended QA

Researchers introduce RECOM, a contamination-free evaluation dataset of 15,000 r/AskReddit questions paired with authentic community replies postdating all evaluated models' training cutoffs. Testing five open-source 7–10B LLMs, the paper finds that no standard automatic metric (cosine similarity, BERTScore, LLM judges) simultaneously achieves both validity (distinguishing real from random answers) and discriminative power (ranking models against each other). Cosine similarity is valid but cannot rank models; BERTScore's apparent ranking collapses when response length is controlled. The authors argue this tradeoff is a structural property of metric representation design and recommend reporting metrics on both axes with an explicit random-baseline floor.

Evaluation and Benchmarking BERTScore RECOM r/AskReddit

5arXiv · cs.CL·5d ago·source ↗

TRACE: Lightweight RAG corpus poisoning detection via token influence attribution

Researchers introduce TRACE, a detection framework for corpus poisoning attacks on Retrieval-Augmented Generation (RAG) systems that works by tracing answer-related tokens through token influence attribution rather than relying on auxiliary classifiers or LLM-based verification. The method identifies recurrent high-influence keywords across retrieved documents and performs secondary verification to confirm their effect on model predictions. Evaluated on three QA benchmarks and six LLMs, TRACE achieves strong detection performance while also exposing attacker-specified target answers, with lower computational overhead than prior approaches.

AI Safety Research Enterprise Deployment Patterns TRACE Tracing Target Answers in Poisoned Retrieval Corpora via Token Influence Attribution

4arXiv · cs.CL·11d ago·source ↗

CATCH-ME dataset: multilingual multi-turn counterspeech against hate speech and misinformation for RAG systems

Researchers introduce CATCH-ME, a large-scale expert-curated multilingual dataset of multi-turn dialogues addressing the intersection of hate speech and misinformation across five languages and seven marginalized groups. The dataset is anchored in verified external knowledge (fact-checking articles and NGO reports) with document- and chunk-level span annotations, making it directly usable for RAG-based counterspeech systems. It addresses a gap in existing resources, which are limited to single-turn English dialogues, and is intended to improve the factual grounding and persuasiveness of LLM-generated counterspeech.

Evaluation and Benchmarking AI Safety Research CATCH-ME

5arXiv · cs.CL·11d ago·source ↗

RefRad2D dataset and RadGrounder model enable spatially grounded radiology VLMs without manual annotations

Researchers introduce RefRad2D, a 1.2M-pair bilingual (German/English) CT and MR image-text dataset generated automatically via LLM curation and automated segmentation, requiring no manual spatial annotations. The accompanying RadGrounder model jointly performs report generation, VQA, and spatial grounding via bounding-box or segmentation outputs. On external benchmarks Slake and VQA-RAD, RadGrounder matches specialized medical VLMs while adding grounding supervision without degrading language quality. The work demonstrates that large-scale automatically curated clinical data can transfer to downstream medical VQA tasks.

Evaluation and Benchmarking Multimodal Progress RefRad2D Slake RadGrounder +1 more

5arXiv · cs.CL·5d ago·source ↗

Unified defense framework detects and remediates data poisoning in text summarization fine-tuning

A new arXiv preprint introduces a post-hoc defense framework for detecting and recovering from training-time data poisoning in LLMs fine-tuned for abstractive summarization. The framework uses influence-function analysis in white-box settings and behavioral perturbation auditing in black-box settings, achieving 85-92% detection precision across nine architectures and six benchmarks. Gradient-ascent unlearning restores up to 96% of original model behavior with less than 0.6% ROUGE degradation. The authors also introduce novel attacks targeting factual distortion and representational bias that evade conventional evaluation metrics.

Evaluation and Benchmarking AI Safety Research ROUGE-L Detect, Unlearn, Restore: Defending Text Summarization Models Against Data Poisoning