4arXiv cs.CL (Computation and Language)·1mo ago

Ancient Greek to Modern Greek Machine Translation: Novel Benchmark and Fine-Tuning Experiments

Researchers introduce the AG-MG Parallel Corpus, a 132,481 sentence-pair dataset for Ancient Greek to Modern Greek machine translation, created via a pipeline combining web scraping, VecAlign with LaBSE embeddings, and Gemini 2.5 Flash-based alignment correction. The paper benchmarks NMT models (NLLB, M2M100) and a Greek LLM (Llama-Krikri-8B) under three fine-tuning strategies. Full-parameter fine-tuning of Llama-Krikri-8B achieves the best BLEU score of 13.16, while QLoRA-adapted M2M100-1.2B shows the largest relative gains (+10.3 BLEU). This represents the first comprehensive MT benchmark for this low-resource language pair.

Evaluation and Benchmarking Open Weights Progress M2M100 VecAlign NLLB Gemini-2.5-Flash-Lite Llama-Krikri-8B QLoRA AG-MG Parallel Corpus LaBSE

Related guides (2)

Open Weights ProgressTopic guide

Open Weights Progress: How Freely Available AI Models Caught Up to the Frontier

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

3arXiv · cs.CL·15d ago·source ↗

First Komi-Yazva–Russian parallel corpus and LLM translation evaluation protocol for endangered low-resource language

Researchers introduce the first Komi-Yazva–Russian parallel corpus of 457 aligned sentence pairs from 74 narrative texts, paired with a rigorous evaluation protocol for studying LLM translation under extreme data scarcity. The protocol includes story-level cross-validation, deterministic retrieval-based few-shot prompting, and both reference-based and judge-based metrics to ensure leakage-aware, reproducible evaluation. Results show LLMs produce non-trivial translations but performance varies strongly by model family; retrieval-based few-shot prompting consistently outperforms zero-shot, though gains plateau quickly. The work frames the corpus as both a dataset contribution and a reproducible testbed for endangered-language machine translation research.

Evaluation and Benchmarking A Komi-Yazva–Russian Parallel Corpus and Evaluation Protocol for Zero- and Few-Shot LLM Translation Komi-Yazva–Russian Parallel Corpus

4arXiv · cs.CL·19d ago·source ↗

Benchmarking Local LLMs for Confidential Translation Workflows

This paper evaluates locally runnable LLMs (via Ollama) for offline, privacy-constrained translation workflows targeting freelance translators and smaller language service providers. The authors expand their Reeve Foundation corpus to include German and Simplified Chinese, then benchmark local models across four language directions against commercial NMTs (DeepL, Baidu), a frontier LLM (GPT-5.2), and professional local NMT systems. Results show substantial performance variation by language direction and model size, with the best local LLMs matching or exceeding local NMT systems and the frontier LLM, though falling short of top commercial NMTs. The study supports the viability of local LLMs for confidentiality-sensitive translation use cases.

Evaluation and Benchmarking Open Weights Progress Ollama GPT-5.2 DeepL +8 more

3arXiv · cs.CL·11d ago·source ↗

Synthetic data bootstrapping and LoRA fine-tuning for Q'eqchi' Mayan NMT without web scraping

Researchers introduce a data synthesis methodology for low-resource neural machine translation of Q'eqchi' Mayan, converting community-sourced dictionaries into a synthetic parallel corpus to avoid scraping target-language data. Using LoRA adapters on mT5-base, the approach achieves BLEU 42.02 on in-domain evaluation but only 0.59 against organic text, revealing a structural-semantic gap. An ablation with multi-task learning produced negative transfer, suggesting LoRA capacity limits conflict with auxiliary objectives. The study concludes synthetic bootstrapping is effective for structural priming but requires authentic data for semantic refinement via curriculum learning.

Evaluation and Benchmarking Open Weights Progress BLEU LoRA mT5 +1 more

4arXiv · cs.CL·29d ago·source ↗

Moral Semantics Survive Machine Translation: Cross-Lingual Evidence from Moral Foundations Corpora

This paper investigates whether LLM-based machine translation can preserve moral semantic content well enough to enable cross-lingual moral values classification, using Polish as a test case with ~50k annotated social media posts. A four-method validation pipeline (LaBSE embedding similarity, CKA, LLM-as-judge, and classifier parity) shows mean cosine similarity of 0.86 and AUC gaps of only 0.01–0.02 across Moral Foundations categories. The results suggest machine translation is a practical path to extending moral values NLP research to under-resourced languages, with expected generalization to related Slavic languages.

Evaluation and Benchmarking Moral Foundations Theory Centered Kernel Alignment LLM-as-a-Judge +2 more

5arXiv · cs.CL·23d ago·source ↗

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

This paper systematically investigates strategies for extending LLM-based automatic evaluation (LLMs-as-a-Judge) to multilingual settings, covering high-, mid-, and low-resource languages (English, Spanish, Basque). The authors compare instruction translation, monolingual vs. multilingual supervision, and model size, finding that fine-tuned smaller models can match proprietary models when in-domain data is available, while zero-shot larger models are preferable out-of-domain. Two meta-evaluation datasets are extended to Spanish and Basque, and all data and code are publicly released.

Evaluation and Benchmarking Basque language LLM-as-a-Judge mJudge +2 more

5arXiv · cs.CL·10d ago·source ↗

Audit finds cultural translation failures and diversity collapse in LLM-adapted math word problems across 7 languages

Researchers audited how Claude Opus 4, GPT-4.1, and Gemini 2.5 Pro adapt 60 English math word problems into seven languages spanning South Asia and Italy, annotating 6,489 entity transformations. Models agreed on transformation type only 62.5% of the time and on specific substitutions in just 33.5% of cases, meaning model choice substantially shapes the cultural world students encounter. All 21 language-model combinations exhibited 'entropy collapse'—adaptations compressed rather than expanded cultural diversity—and models produced systematic regional misattributions (e.g., Bangladeshi currency for Indian Bengali students) and cross-cultural contamination (e.g., egg hunts framed as Eid activities). The study highlights that surface plausibility masks deeper corpus-level failures invisible in individual translations.

Evaluation and Benchmarking AI Safety Research Claude Opus 4.6 Who Brought Easter Eggs to Eid? Auditing Cultural Translation of Math Word Problems Across Diverse Languages and Regions Google +4 more

7The Batch·15d ago·source ↗

Fine-tuning LLMs on summary-expansion tasks strips copyright alignment guardrails, enabling up to 92% verbatim book reproduction

Researchers from Stony Brook University, Carnegie Mellon University, and Columbia Law School fine-tuned DeepSeek-V3.1, Gemini 2.5 Pro, and GPT-4o on a task of expanding plot summaries into prose paragraphs, finding that this caused models to regurgitate up to 91.9% of verbatim text from books in their pretraining data. The key finding is that alignment training suppresses but does not erase memorized text strings from model weights, and fine-tuning on verbatim-generation tasks can re-enable that recall, bypassing system-prompt-level copyright guardrails. The result has direct implications for model providers offering fine-tuning APIs and for organizations deploying customized models, as anti-plagiarism guardrails cannot be assumed to survive downstream fine-tuning.

AI Safety Research Regulatory Developments Carnegie Mellon University Xinyue Liu DeepSeek V4 +7 more

4Hugging Face Blog·1mo ago·source ↗

Fine-Tune MMS Adapter Models for Low-Resource ASR

This Hugging Face blog post provides a technical guide for fine-tuning Meta's Massively Multilingual Speech (MMS) adapter models for automatic speech recognition in low-resource languages. It covers the adapter-based fine-tuning approach that allows efficient adaptation of the MMS model to specific languages without full model retraining. The post targets practitioners working on speech recognition for underrepresented languages.

Open Weights Progress Agent and Tool Ecosystem MMS (Massively Multilingual Speech)Meta AI adapter fine-tuning +1 more