4arXiv cs.CL (Computation and Language)·19d ago

Benchmarking Local LLMs for Confidential Translation Workflows

This paper evaluates locally runnable LLMs (via Ollama) for offline, privacy-constrained translation workflows targeting freelance translators and smaller language service providers. The authors expand their Reeve Foundation corpus to include German and Simplified Chinese, then benchmark local models across four language directions against commercial NMTs (DeepL, Baidu), a frontier LLM (GPT-5.2), and professional local NMT systems. Results show substantial performance variation by language direction and model size, with the best local LLMs matching or exceeding local NMT systems and the frontier LLM, though falling short of top commercial NMTs. The study supports the viability of local LLMs for confidentiality-sensitive translation use cases.

Evaluation and Benchmarking Open Weights Progress Inference Economics Enterprise Deployment Patterns Ollama GPT-5.2 DeepL MATEO OPUS-CAT NeuralDesktop Reeve Foundation Multilingual Corpus Baidu Translate Promt

Related guides (4)

Open Weights ProgressTopic guide

Open Weights Progress: How Freely Available AI Models Caught Up to the Frontier

Read asBeginner In-depth

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From LLM Demo to Production Reality

Read asIn-depth

Inference EconomicsTopic guide

Inference Economics: The Cost Structure of Running AI Models in Production

Read asIn-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·23d ago·source ↗

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

This paper systematically investigates strategies for extending LLM-based automatic evaluation (LLMs-as-a-Judge) to multilingual settings, covering high-, mid-, and low-resource languages (English, Spanish, Basque). The authors compare instruction translation, monolingual vs. multilingual supervision, and model size, finding that fine-tuned smaller models can match proprietary models when in-domain data is available, while zero-shot larger models are preferable out-of-domain. Two meta-evaluation datasets are extended to Spanish and Basque, and all data and code are publicly released.

Evaluation and Benchmarking Basque language LLM-as-a-Judge mJudge +2 more

4Hugging Face Blog·1mo ago·source ↗

Optimizing your LLM in production

A Hugging Face blog post covering practical techniques for optimizing large language models in production environments. The post likely addresses inference efficiency methods such as quantization, batching, caching, and hardware utilization strategies. It serves as a practitioner-oriented guide for deploying LLMs at scale.

Inference Economics Enterprise Deployment Patterns Hugging Face

5arXiv · cs.CL·10d ago·source ↗

NCRE-based benchmark reveals frontier LLMs top out at 68.8% on professional Office automation tasks

Researchers introduce an evaluation suite derived from China's National Computer Rank Examination (NCRE), comprising 200 practical tasks across Word, Excel, and PowerPoint scored via 7,118 machine-gradable criteria. Seven frontier LLMs are benchmarked: single-turn models peak at 36.6% Score Rate, while a full agentic system with execution feedback and iterative repair reaches 68.8%, still well below the 95.5% community-reference score. The results demonstrate that fine-grained, long-horizon Office document automation remains a significant unsolved challenge for current LLM and agent systems despite strong code-generation capabilities.

Evaluation and Benchmarking Agent and Tool Ecosystem National Computer Rank Examination Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

3arXiv · cs.CL·15d ago·source ↗

First Komi-Yazva–Russian parallel corpus and LLM translation evaluation protocol for endangered low-resource language

Researchers introduce the first Komi-Yazva–Russian parallel corpus of 457 aligned sentence pairs from 74 narrative texts, paired with a rigorous evaluation protocol for studying LLM translation under extreme data scarcity. The protocol includes story-level cross-validation, deterministic retrieval-based few-shot prompting, and both reference-based and judge-based metrics to ensure leakage-aware, reproducible evaluation. Results show LLMs produce non-trivial translations but performance varies strongly by model family; retrieval-based few-shot prompting consistently outperforms zero-shot, though gains plateau quickly. The work frames the corpus as both a dataset contribution and a reproducible testbed for endangered-language machine translation research.

Evaluation and Benchmarking A Komi-Yazva–Russian Parallel Corpus and Evaluation Protocol for Zero- and Few-Shot LLM Translation Komi-Yazva–Russian Parallel Corpus

5arXiv · cs.CL·22d ago·source ↗

LLUMI: Fine-Tuning Open-Source LLMs for Mental Health Writing Assistance Using Reddit Community Feedback

LLUMI is a two-component system (a generation model and an improvement model) designed to provide mental health writing assistance using smaller open-source LLMs hosted in privacy-preserving, on-premise environments. The system leverages Reddit community endorsement signals (upvotes/downvotes) to construct preference pairs for SFT and DPO training, then further aligns outputs via human evaluation across readability, empathy, connection, actionability, and safety dimensions. Results show LLUMI achieves performance comparable to proprietary GPT-based models on linguistic and human evaluations, suggesting community-derived preference signals can substitute for expensive expert labeling in sensitive domains.

Open Weights Progress AI Safety Research Reddit LLUMI Direct Preference Optimization (DPO)+3 more

4Hugging Face Blog·1mo ago·source ↗

FilBench: Benchmarking LLM Capabilities in Filipino Language

FilBench is a new benchmark introduced to evaluate large language models on their ability to understand and generate Filipino. The benchmark targets a historically underrepresented language in NLP evaluation suites, assessing both comprehension and generation tasks. This work addresses gaps in multilingual LLM evaluation coverage, particularly for Southeast Asian languages.

Evaluation and Benchmarking Multimodal Progress FilBench Hugging Face Filipino

4arXiv · cs.CL·17d ago·source ↗

AlignAtt4LLM adapts simultaneous speech translation policy to decoder-only LLMs for IWSLT 2026

Researchers present AlignAtt4LLM, a simultaneous speech translation system for IWSLT 2026 covering English to German, Italian, and Chinese. The system cascades Qwen3-ASR for incremental transcription with Gemma-4 E4B-it for translation, applying a novel AlignAtt policy adapted for decoder-only LLMs that lack encoder-decoder cross-attention. Key contributions include explicit source span prompting, offline alignment head selection, and query/key capture to recover a usable attention-based read/write policy. The system outperforms IWSLT 2026 baselines for European language pairs in both low- and high-latency regimes.

Evaluation and Benchmarking Multimodal Progress Gemma-4 E4B-it IWSLT 2026 AlignAtt +2 more

4arXiv · cs.CL·5d ago·source ↗

LoSoNA benchmark evaluates LLM adaptation to implicit local social norms in group chats

Researchers introduce LoSoNA, a benchmark for testing whether LLM-based agents can infer and adapt to unstated local conversational norms in multi-party chat scenarios. Each scenario presents a group-chat transcript where non-subject participants implicitly demonstrate a hidden norm, followed by an elicitor turn. Eight frontier and open-weight models are evaluated under four prompting conditions; naive prompting performs poorly for most models, while explicit norm-aware prompting yields uneven gains—Gemini 3.1 Pro reaches 84.2% and Claude Fable 5 reaches 81.6%. The work contributes to growing interest in evaluating LLM social and pragmatic capabilities beyond factual or reasoning tasks.

Evaluation and Benchmarking Agent and Tool Ecosystem Gemini 3.1 Pro Claude Fable 5 LoSoNA