4arXiv cs.CL (Computation and Language)·2d ago

Dango: A 1.8B LLM trained exclusively on Japanese to study L1-to-L2 language transfer

Researchers introduce Dango, a 1.8B-parameter decoder-only LLM pretrained strictly on Japanese (L1) and fine-tuned on LLM-generated English (L2) learning lessons to simulate second language acquisition. A key contribution is a filtering method to remove L2 contamination from ostensibly monolingual pretraining corpora. Evaluations show Dango produces human-like L2 error patterns, outperforming multilingual and unfiltered baselines. The model, data, and code are released for computational SLA research.

Open Weights Progress Dango Dango: A Strictly L1-Only Large Language Model for Studying Second Language Acquisition

Related guides (1)

Open Weights ProgressTopic guide

Open Weights Progress: How Freely Available AI Models Caught Up to the Frontier

Read asBeginner In-depth

Related events (8)

4Hugging Face Blog·1mo ago·source ↗

Introducing the Open Leaderboard for Japanese LLMs

Hugging Face has launched an open leaderboard specifically for evaluating large language models on Japanese language tasks. The leaderboard aims to provide standardized benchmarking for Japanese LLMs, filling a gap in multilingual evaluation infrastructure. This initiative supports the growing ecosystem of Japanese-language AI development and open evaluation practices.

Evaluation and Benchmarking Open Weights Progress Open Leaderboard for Japanese LLMs Hugging Face

5arXiv · cs.CL·9d ago·source ↗

AGDO: Attention-guided denoising and optimization framework improves diffusion language model reasoning

Researchers propose AGDO, a framework that replaces random masking in diffusion large language models (dLLMs) with attention-guided denoising order and token weighting during fine-tuning and reinforcement learning. The work is motivated by an empirical finding that tokens with stronger attention to unmasked context are more stable and critical for reasoning. Experiments on math and coding benchmarks show AGDO outperforms existing post-training methods for dLLMs, advancing the case for attention-aware training in parallel-decoding language models.

Alignment and RLHF AGDO Beyond Fully Random Masking: Attention-Guided Denoising and Optimization for Diffusion Language Models

6arXiv · cs.CL·5d ago·source ↗

BayLing-Duplex: Native full-duplex speech dialogue using a single autoregressive LLM

Researchers introduce BayLing-Duplex, a speech language model that achieves native full-duplex interaction — simultaneous listening and speaking — using a single autoregressive LLM with no auxiliary VAD or turn-taking module. Built by fine-tuning GLM-4-Voice on 400K samples plus a lightweight DPO stage, it reaches 92% turn-taking success and 100% interruption success on InstructS2S-Eval, and improves speech-response quality substantially over Moshi. The approach adds only special tokens to the standard vocabulary, making it portable across LLM architectures without architectural changes.

Frontier Model Releases Multimodal Progress BayLing-Duplex InstructS2S-Eval Direct Preference Optimization (DPO)+3 more

5Google Deepmind Blog·1mo ago·source ↗

DolphinGemma: Google DeepMind LLM for Decoding Dolphin Communication

Google DeepMind has developed DolphinGemma, a large language model designed to help scientists analyze and decode dolphin communication patterns. The model is being applied to the scientific challenge of understanding cetacean vocalizations. This represents a novel application of LLM-based sequence modeling to non-human animal communication research.

Frontier Model Releases Gemma Google DeepMind DolphinGemma

6arXiv · cs.CL·19d ago·source ↗

DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Speech Translation with SpeechLLMs

The paper proposes Decoder-Only Attention (DOA), a training-free streaming policy for simultaneous speech-to-text translation (SimulST) that works with off-the-shelf decoder-only Speech LLMs. DOA derives proxy alignment signals from self-attention rather than cross-attention, enabling long-form simultaneous translation without retraining. Experiments on Phi4-Multimodal and Qwen3-Omni demonstrate low-latency performance approaching offline decoding quality, validating that decoder self-attention contains sufficient alignment information for streaming decisions.

Long Context Evolution Inference Economics Phi4-Multimodal SpeechLLM Qwen3.5 Omni +3 more

5arXiv · cs.CL·4d ago·source ↗

LESS: Adaptive mutual-stability sampling cuts diffusion LLM decoding steps by 72%

Researchers introduce LESS, a training-free adaptive sampler for diffusion large language models that treats token commitment as an online stopping problem. The method uses a joint stability rule combining confidence, persistence, and distributional stability to decide when to unmask tokens, avoiding wasted computation on already-stable positions. Evaluated on Dream-7B, LLaDA-8B, and LLaDA-1.5-8B across seven benchmarks, LESS reduces reverse denoising steps by 72.1% versus fixed-budget decoding while improving accuracy over prior adaptive samplers. The step reductions translate directly to fewer Transformer forward passes and lower wall-clock latency.

Frontier Model Releases Inference Economics LESS: Mutual-Stability Sampling for Diffusion Language Models Jensen-Shannon divergence LLaDA-1.5-8B +2 more

6arXiv · cs.CL·19d ago·source ↗

Trajectory Analysis of Masked Diffusion LMs for Graph-to-Text Generation with Lambda-Scaled Structural Decoding

This paper presents the first systematic study of masked diffusion language models (MDLMs) for graph-to-text generation, analyzing the order in which tokens are unmasked during iterative decoding. The authors find MDLMs naturally unmask entities first, then relational/function words, then structural tokens—a pattern disrupted by supervised fine-tuning, which prematurely anchors structural tokens and causes hallucination or omission. They propose lambda-scaled structural decoding, a training-free inference-time fix that recovers +9.4 BLEU-4, and introduce Graph-LLaDA, which integrates a Graph Transformer encoder into LLaDA's decoding process. Cross-dataset evaluation on the LAGRANGE benchmark shows prior baselines overfit to dataset-specific patterns while MDLM-based approaches generalize better.

Frontier Model Releases Evaluation and Benchmarking BLEU-4 Graph Transformer Diffusion Language Models +5 more

6arXiv · cs.CL·2d ago·source ↗

Sumi: First open 7B uniform diffusion language model pretrained from scratch at scale

Researchers introduce Sumi, a fully open 7B uniform diffusion language model (UDLM) pretrained from scratch on 1.5 trillion tokens — the first UDLM at both large parameter scale and large token budget. Sumi performs competitively with autoregressive models on knowledge, reasoning, and coding benchmarks, though underperforms on commonsense tasks, attributed partly to an education-heavy data mixture. Model weights, checkpoints, and full training recipe including data mixture specification are released publicly. The work fills a gap in the diffusion language model landscape, providing a reference point for studying scaling behavior and generation dynamics in uniform diffusion.

Frontier Model Releases Open Weights Progress Sumi Sumi: Open Uniform Diffusion Language Model from Scratch