4arXiv cs.CL (Computation and Language)·17d ago

AlignAtt4LLM adapts simultaneous speech translation policy to decoder-only LLMs for IWSLT 2026

Researchers present AlignAtt4LLM, a simultaneous speech translation system for IWSLT 2026 covering English to German, Italian, and Chinese. The system cascades Qwen3-ASR for incremental transcription with Gemma-4 E4B-it for translation, applying a novel AlignAtt policy adapted for decoder-only LLMs that lack encoder-decoder cross-attention. Key contributions include explicit source span prompting, offline alignment head selection, and query/key capture to recover a usable attention-based read/write policy. The system outperforms IWSLT 2026 baselines for European language pairs in both low- and high-latency regimes.

Evaluation and Benchmarking Multimodal Progress Gemma-4 E4B-it IWSLT 2026 AlignAtt Qwen3-ASR AlignAtt4LLM

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·19d ago·source ↗

DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Speech Translation with SpeechLLMs

The paper proposes Decoder-Only Attention (DOA), a training-free streaming policy for simultaneous speech-to-text translation (SimulST) that works with off-the-shelf decoder-only Speech LLMs. DOA derives proxy alignment signals from self-attention rather than cross-attention, enabling long-form simultaneous translation without retraining. Experiments on Phi4-Multimodal and Qwen3-Omni demonstrate low-latency performance approaching offline decoding quality, validating that decoder self-attention contains sufficient alignment information for streaming decisions.

Long Context Evolution Inference Economics Phi4-Multimodal SpeechLLM Qwen3.5 Omni +3 more

3arXiv · cs.CL·17d ago·source ↗

CUNI submits 1B-parameter simultaneous speech translation system to IWSLT 2026

Researchers from CUNI submit a simultaneous speech translation system to the IWSLT 2026 shared task, built on the offline Canary model with the AlignAtt policy. The system covers Czech-English and English-German/Italian translation pairs, supports 25 source and 25 target languages, and outperforms similarly sized baselines in both low- and high-latency regimes. At 1B parameters, it is positioned as a compact, multilingual, computationally efficient solution.

Multimodal Progress IWSLT 2026 Canary Charles University (CUNI)+1 more

3arXiv · cs.CL·12d ago·source ↗

KIT submission to IWSLT 2026 cross-lingual voice cloning track with language tag prompting and RL fine-tuning

Researchers from KIT describe their system for the IWSLT 2026 Cross-Lingual Voice Cloning shared task, which aims to synthesize speech in a target language while preserving source-speaker identity. The system builds on FishAudio-S2-Pro, a multilingual TTS model, and introduces language tag prompting to reduce accent leakage, RL fine-tuning for intelligibility, and a reference-conditioned lexical matching method for domain-specific pronunciation. Language prompting yields the largest gains; lexical matching provides consistent improvements on matched subsets.

Multimodal Progress IWSLT 2026 Cross-Lingual Voice Cloning FishAudio-S2-Pro Karlsruhe Institute of Technology

5arXiv · cs.CL·10d ago·source ↗

RL-based alignment improves interactivity in full-duplex spoken dialogue models

Researchers propose a post-training alignment method using reinforcement learning to improve interactivity in full-duplex spoken dialogue models, which can listen and speak simultaneously. The method addresses four canonical axes of interactivity—pause handling, turn-taking, backchanneling, and user interruption—each with axis-specific reward functions, plus an LLM-based reward to prevent semantic degradation. The approach is applied to two open-source models, Moshi and PersonaPlex, showing consistent improvements in both offline and real-time multi-turn evaluation.

Alignment and RLHF Multimodal Progress Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models PersonaPlex Moshi

4arXiv · cs.CL·19d ago·source ↗

Benchmarking Local LLMs for Confidential Translation Workflows

This paper evaluates locally runnable LLMs (via Ollama) for offline, privacy-constrained translation workflows targeting freelance translators and smaller language service providers. The authors expand their Reeve Foundation corpus to include German and Simplified Chinese, then benchmark local models across four language directions against commercial NMTs (DeepL, Baidu), a frontier LLM (GPT-5.2), and professional local NMT systems. Results show substantial performance variation by language direction and model size, with the best local LLMs matching or exceeding local NMT systems and the frontier LLM, though falling short of top commercial NMTs. The study supports the viability of local LLMs for confidentiality-sensitive translation use cases.

Evaluation and Benchmarking Open Weights Progress Ollama GPT-5.2 DeepL +8 more

6arXiv · cs.CL·5d ago·source ↗

BayLing-Duplex: Native full-duplex speech dialogue using a single autoregressive LLM

Researchers introduce BayLing-Duplex, a speech language model that achieves native full-duplex interaction — simultaneous listening and speaking — using a single autoregressive LLM with no auxiliary VAD or turn-taking module. Built by fine-tuning GLM-4-Voice on 400K samples plus a lightweight DPO stage, it reaches 92% turn-taking success and 100% interruption success on InstructS2S-Eval, and improves speech-response quality substantially over Moshi. The approach adds only special tokens to the standard vocabulary, making it portable across LLM architectures without architectural changes.

Frontier Model Releases Multimodal Progress BayLing-Duplex InstructS2S-Eval Direct Preference Optimization (DPO)+3 more

4arXiv · cs.CL·2d ago·source ↗

G-IdiomAlign: Gloss-pivoted benchmark for cross-lingual idiom alignment in LLMs

Researchers introduce G-IdiomAlign, a benchmark anchoring idioms via English glosses from Wiktionary to evaluate cross-lingual idiom equivalence in LLMs. The benchmark supports two evaluation protocols: a multiple-choice task with typed distractors and a gloss-contrastive generation task isolating the effect of explicit semantic pivots. Experiments across diverse LLMs find that literal translation bias is the dominant failure mode, especially for low-resource languages, and that gloss conditioning improves performance but leaves substantial headroom. Mechanistic analysis on Qwen3-8B suggests cross-condition differences are concentrated in attention heads rather than layers.

Evaluation and Benchmarking Qwen3-4B G-IdiomAlign Wiktionary

4arXiv · cs.CL·11d ago·source ↗

Multilingual word-level forced alignment using MMS and learned dynamic programming outperforms MFA

Researchers present a forced alignment system combining Meta's Massively Multilingual Speech (MMS) model with a self-supervised phoneme boundary detector (UnSupSeg) and a learned dynamic programming decoder. Trained on TIMIT and Buckeye, the system outperforms Montreal Forced Aligner and MMS-based alignment on both datasets and generalizes to unseen languages (Dutch, German, Hebrew) without additional training. The approach claims potential to scale to 1100+ languages supported by MMS, making it relevant for low-resource speech processing pipelines.

Multimodal Progress MMS (Massively Multilingual Speech)Montreal Forced Aligner Buckeye +2 more