paper

The Tatoxa System for Text Detoxification in Low-Resource Languages: The Case of Tatar

paperactiveprovisionalthe-tatoxa-system-for-text-detoxification-in-low-resource-languages-the-case-of-tatar-d6049b51·1 events·first seen 5d ago

Aliases: The Tatoxa System for Text Detoxification in Low-Resource Languages: The Case of Tatar

Co-occurring entities

Tatoxa

More like this (12)

TOBA tokenizer Komi-Yazva–Russian Parallel Corpus A Komi-Yazva–Russian Parallel Corpus and Evaluation Protocol for Zero- and Few-Shot LLM Translation Data Synthesis and Parameter-Efficient Fine-Tuning for Low-Resource NMT: A Case Study on Q'eqchi' Mayan E-TTS Urdu Katib Handwritten Dataset Text Aphasia Battery (TAB)ToxiREX: A Dataset on Toxic REasoning in ConteXt Context-Aware Distillation and Ablation for Text2DSL MOSS-TTS The Anatomy of the CTC Oracle Gap: Acoustic Exhaustion and Linguistic Recovery Reasoning over Grammar: Can Synthetic Linguistic Reasoning Traces Enhance Low-Resource Machine Translation?

Recent events (1)

3arXiv · cs.CL·5d ago·source ↗

Tatoxa: State-of-the-art text detoxification system for the low-resource Tatar language

Researchers introduce Tatoxa, a text detoxification system for the Tatar language, along with a new fine-tuning and evaluation dataset for this low-resource setting. Comparative experiments show Tatoxa outperforms both open-source and proprietary LLMs on quality metrics. Cross-lingual transfer experiments find that even culturally close Russian data transfers poorly compared to native Tatar training data, highlighting the limits of cross-lingual approaches for low-resource languages.

AI Safety Research Tatoxa The Tatoxa System for Text Detoxification in Low-Resource Languages: The Case of Tatar