3arXiv cs.CL (Computation and Language)·5d ago

Tatoxa: State-of-the-art text detoxification system for the low-resource Tatar language

Researchers introduce Tatoxa, a text detoxification system for the Tatar language, along with a new fine-tuning and evaluation dataset for this low-resource setting. Comparative experiments show Tatoxa outperforms both open-source and proprietary LLMs on quality metrics. Cross-lingual transfer experiments find that even culturally close Russian data transfers poorly compared to native Tatar training data, highlighting the limits of cross-lingual approaches for low-resource languages.

AI Safety Research Tatoxa The Tatoxa System for Text Detoxification in Low-Resource Languages: The Case of Tatar

Related guides (1)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·39h ago·source ↗

ToxiREX: Multilingual contextual dataset for implicit toxicity detection with structured reasoning schema

Researchers introduce ToxiREX, a multilingual Reddit-based dataset for detecting implicit and context-dependent toxicity across six languages (English, Arabic, Turkish, Spanish, German, Dutch), anchored to real-world events like the 2023 Turkey earthquakes and the Russian invasion of Ukraine. The dataset includes 125K LLM-annotated training comments and ~3K human-annotated test comments, structured using a toxic reasoning schema that captures implicit toxicity and maps to existing taxonomies. Baseline results from prompted and fine-tuned language models show above-random but substantially suboptimal performance, indicating the task remains challenging. ToxiREX is claimed as the first dataset combining multilingual coverage, conversational context, and implicit toxicity with schema-based structured annotations.

Evaluation and Benchmarking AI Safety Research Reddit ToxiREX: A Dataset on Toxic REasoning in ConteXt ToxiREX

3arXiv · cs.CL·25d ago·source ↗

First Komi-Yazva–Russian parallel corpus and LLM translation evaluation protocol for endangered low-resource language

Researchers introduce the first Komi-Yazva–Russian parallel corpus of 457 aligned sentence pairs from 74 narrative texts, paired with a rigorous evaluation protocol for studying LLM translation under extreme data scarcity. The protocol includes story-level cross-validation, deterministic retrieval-based few-shot prompting, and both reference-based and judge-based metrics to ensure leakage-aware, reproducible evaluation. Results show LLMs produce non-trivial translations but performance varies strongly by model family; retrieval-based few-shot prompting consistently outperforms zero-shot, though gains plateau quickly. The work frames the corpus as both a dataset contribution and a reproducible testbed for endangered-language machine translation research.

Evaluation and Benchmarking A Komi-Yazva–Russian Parallel Corpus and Evaluation Protocol for Zero- and Few-Shot LLM Translation Komi-Yazva–Russian Parallel Corpus

3arXiv · cs.CL·1mo ago·source ↗

Thaka Wins KSAA-2026 Arabic Speech Diacritization Task with Regularized Fine-Tuning of CATT-Whisper

The Thaka team describes their winning system for Task 2 of the KSAA-2026 Shared Task on Arabic Speech Dictation with Automatic Diacritization, which requires producing fully diacritized Arabic text from speech audio and undiacritized transcripts. Their approach fine-tunes CATT-Whisper, a multimodal model combining a CATT text encoder with a frozen Whisper speech encoder, under severe data constraints (2,327 training samples, no external data). Key techniques include R-Drop consistency regularization, Optuna-optimized hyperparameters with high weight decay, Focal Loss, and Monte Carlo Dropout inference averaging over 200 stochastic forward passes across four checkpoints. The system achieves 23.26% WER on the primary metric, placing first among all participants.

Multimodal Progress Optuna Focal Loss CATT +6 more

3arXiv · cs.CL·21d ago·source ↗

Synthetic data bootstrapping and LoRA fine-tuning for Q'eqchi' Mayan NMT without web scraping

Researchers introduce a data synthesis methodology for low-resource neural machine translation of Q'eqchi' Mayan, converting community-sourced dictionaries into a synthetic parallel corpus to avoid scraping target-language data. Using LoRA adapters on mT5-base, the approach achieves BLEU 42.02 on in-domain evaluation but only 0.59 against organic text, revealing a structural-semantic gap. An ablation with multi-task learning produced negative transfer, suggesting LoRA capacity limits conflict with auxiliary objectives. The study concludes synthetic bootstrapping is effective for structural priming but requires authentic data for semantic refinement via curriculum learning.

Evaluation and Benchmarking Open Weights Progress BLEU LoRA mT5 +1 more

4arXiv · cs.CL·15d ago·source ↗

Synthetic data generation method enables small LLMs to match large models on Text-To-Cypher tasks

A new arXiv paper presents an automatic synthetic data generation method for fine-tuning small LLMs on Text-To-Cypher (Text2Cypher) parsing, enabling natural language interfaces to property graph databases. Experiments across major Text-To-Cypher benchmarks show that small fine-tuned models can compete with much larger proprietary models. The approach is positioned as a solution for local deployment scenarios requiring data sovereignty without expensive annotation.

Evaluation and Benchmarking Enterprise Deployment Patterns Cypher Achieving Precise Text-To-Cypher Via Grounded Knowledge Graph Data Generation

3arXiv · cs.CL·6d ago·source ↗

First Turkish phone scam detection dataset evaluated across seven LLMs in multi-modal settings

Researchers introduce the first public multi-modal dataset of 100 aligned audio-transcript pairs of Turkish scam and benign phone calls, evaluating seven LLMs (Gemini 2.5 Flash/Flash-Lite/Pro, GPT-4o, Qwen Max/Plus/Turbo) under three input conditions. Transcript-based inputs consistently outperform direct audio processing, while human-corrected and uncorrected transcripts perform comparably. The work addresses a gap in low-resource language safety research and highlights the need for linguistically inclusive fraud detection systems.

AI Safety Research Multimodal Progress Google GPT-4o Gemini-2.5-Flash-Lite +3 more

4arXiv · cs.CL·18d ago·source ↗

Audio-LLM-based data filtering for speech-to-speech translation via Rank-to-Distill

A new arXiv paper proposes using audio large language models to filter noisy training data for end-to-end speech-to-speech translation (S2ST). The authors introduce a two-stage Rank-to-Distill strategy: a lightweight ranker generates pseudo-labels from noisy speech pairs, which then supervise an audio-LLM to make keep/drop decisions directly from raw audio. Experiments on CVSS-C and SpeechMatrix benchmarks show up to +1.4 ASR-BLEU improvement over unfiltered baselines.

Evaluation and Benchmarking Multimodal Progress Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data SpeechMatrix CVSS-C +1 more

4arXiv · cs.CL·7d ago·source ↗

Context-aware distillation and ablation study for Text2DSL Polkit rule generation

Researchers extend a Text2DSL system for generating Polkit domain-specific language rules from natural language, replacing prompt-only synthetic data generation with context-aware distillation using DeepSeek-V4-Flash as a teacher model operating under structured context (BNF grammar, API spec, closed vocabulary). The approach scales a verified corpus from 4,204 to 10,073 NL-to-Polkit-rule pairs at near-perfect validity rates. A factorial ablation across eight context conditions on GigaChat-10B-A1.8B finds that structured context is load-bearing rather than cosmetic, with vocabulary contributing the largest semantic-quality gains via Shapley decomposition.

Evaluation and Benchmarking Agent and Tool Ecosystem DeepSeek-V4-Flash PolkitBench Context-Aware Distillation and Ablation for Text2DSL +1 more