Almanac
dataset

ToxiREX

datasetactiveprovisionaltoxirex-d07e7def·1 events·first seen 37h ago

Aliases: ToxiREX

Co-occurring entities

More like this (12)

Recent events (1)

5arXiv · cs.CL·37h ago·source ↗

ToxiREX: Multilingual contextual dataset for implicit toxicity detection with structured reasoning schema

Researchers introduce ToxiREX, a multilingual Reddit-based dataset for detecting implicit and context-dependent toxicity across six languages (English, Arabic, Turkish, Spanish, German, Dutch), anchored to real-world events like the 2023 Turkey earthquakes and the Russian invasion of Ukraine. The dataset includes 125K LLM-annotated training comments and ~3K human-annotated test comments, structured using a toxic reasoning schema that captures implicit toxicity and maps to existing taxonomies. Baseline results from prompted and fine-tuned language models show above-random but substantially suboptimal performance, indicating the task remains challenging. ToxiREX is claimed as the first dataset combining multilingual coverage, conversational context, and implicit toxicity with schema-based structured annotations.