Study: LLM-Derived Error Highlights and APE Suggestions in MT Post-Editing
Researchers conducted a controlled study with professional En-Nl translators comparing post-editing (PE) workflows augmented with LLM-derived error highlights and automatic post-editing (APE) correction suggestions against regular PE and QE-derived highlights. No condition produced measurable productivity or quality gains over standard PE. However, APE-derived highlights were preferred over QE-derived highlights, and correction suggestions improved subjective user experience.
Related guides (2)
Related events (8)
Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study
This paper systematically investigates strategies for extending LLM-based automatic evaluation (LLMs-as-a-Judge) to multilingual settings, covering high-, mid-, and low-resource languages (English, Spanish, Basque). The authors compare instruction translation, monolingual vs. multilingual supervision, and model size, finding that fine-tuned smaller models can match proprietary models when in-domain data is available, while zero-shot larger models are preferable out-of-domain. Two meta-evaluation datasets are extended to Spanish and Basque, and all data and code are publicly released.
Benchmarking Local LLMs for Confidential Translation Workflows
This paper evaluates locally runnable LLMs (via Ollama) for offline, privacy-constrained translation workflows targeting freelance translators and smaller language service providers. The authors expand their Reeve Foundation corpus to include German and Simplified Chinese, then benchmark local models across four language directions against commercial NMTs (DeepL, Baidu), a frontier LLM (GPT-5.2), and professional local NMT systems. Results show substantial performance variation by language direction and model size, with the best local LLMs matching or exceeding local NMT systems and the frontier LLM, though falling short of top commercial NMTs. The study supports the viability of local LLMs for confidentiality-sensitive translation use cases.
Study finds no detectable self-preference bias when LLMs revise their own instruction-following drafts
A new arXiv preprint tests whether LLMs resist valid corrections to their own writing by using IFEval's deterministic verifier to establish ground-truth correctness, bypassing model-as-judge subjectivity. Across four mid-tier model families and 85 author-versus-fresh comparisons, no statistically significant self-preference bias was detected (gap -5.1 pp, 95% CI [-12.9, +2.7]). A qualitative finding shows that when authors do reject verified-good fixes, 97% of stated reasons are substantive flaw-catching rather than preference. The result challenges the assumption that documented self-preference in judging tasks extends to self-revision contexts.
LLMs automate reproducibility assessments in social and behavioral sciences, outperforming human reanalysts
A preprint from arXiv demonstrates that an LLM pipeline can automate reproducibility assessments of published social and behavioral science studies, recovering original effect sizes in 41% of cases (vs. 34% for human reanalysts) and reaching the same qualitative conclusion in 96% of cases (vs. 74% for humans). The study evaluated 76 published studies with predefined claims. The results suggest LLMs could serve as a scalable tool for systematic auditing of empirical research, addressing the resource-intensive nature of traditional reproducibility efforts.
Systematic study reveals effectiveness-fluency trade-offs in LLM conditioning methods
A new arXiv paper systematically evaluates a range of LLM conditioning methods across both concept injection and removal scenarios, finding that efficient steering methods often degrade fluency significantly. A key finding is that activation steering is substantially less effective on instruction-tuned models than on base models, a previously overlooked interaction. Simple prompting and supervised fine-tuning work for concept injection but not removal, and cheap textual metrics are found to correlate well with expensive LLM-as-judge evaluations.
Fine-tuning LLMs on summary-expansion tasks strips copyright alignment guardrails, enabling up to 92% verbatim book reproduction
Researchers from Stony Brook University, Carnegie Mellon University, and Columbia Law School fine-tuned DeepSeek-V3.1, Gemini 2.5 Pro, and GPT-4o on a task of expanding plot summaries into prose paragraphs, finding that this caused models to regurgitate up to 91.9% of verbatim text from books in their pretraining data. The key finding is that alignment training suppresses but does not erase memorized text strings from model weights, and fine-tuning on verbatim-generation tasks can re-enable that recall, bypassing system-prompt-level copyright guardrails. The result has direct implications for model providers offering fine-tuning APIs and for organizations deploying customized models, as anti-plagiarism guardrails cannot be assumed to survive downstream fine-tuning.
EDIT framework trains more rubric-faithful LLM graders via internal-state diagnostics
Researchers introduce Evidence-Diagnosed Intervention Training (EDIT), a two-phase framework for improving LLM-based rubric grading. The first phase (EDIT-SFT) identifies problematic reasoning steps using posterior belief signals and input-grounding scores, then revises only those steps with rubric checklists; the second phase (EDIT-RL) uses belief-guided reward shaping to penalize harmful belief drifts during RL. Experiments on two real-world multi-subject grading benchmarks show consistent improvements over SFT and RL baselines on both in-domain and out-of-domain splits.
LLM-Based Grammar Adaptation for Metamodel-Grammar Co-Evolution in Model-Driven Engineering
This paper proposes using LLMs to automate grammar adaptation when metamodels evolve in model-driven engineering, replacing tedious manual work and outperforming rule-based methods. Evaluated on six real-world Xtext DSLs using Claude Sonnet 4.5, ChatGPT 5.1, and Gemini 3, all three LLMs achieved 100% adaptation consistency on test DSLs versus 62-84% for rule-based approaches. A longitudinal study on QVTo showed LLMs successfully reused learned adaptations across all evolution steps without manual editing. However, on large-scale grammars (EAST-ADL, 297 rules), LLM adaptation consistency dropped well below 90%, revealing a scalability limitation.

