5arXiv cs.CL (Computation and Language)·10d ago

AuRA: Distilling audio understanding into LLMs via LoRA adaptation

AuRA is a new method for integrating speech understanding into LLMs by distilling audio encoding capability directly into LoRA-adapted model weights, bypassing cascaded ASR-LLM pipelines. A lightweight audio embedding layer feeds speech to both an ASR encoder (teacher) and a LoRA-adapted LLM (student), with layer-wise distillation aligning hidden states. The approach claims to outperform cascaded systems, bridge-based adaptation baselines, and large-scale multimodal models on multiple speech-language benchmarks while enabling parallel end-to-end inference without large-scale multimodal training.

Multimodal Progress LoRA AuRA

Related guides (2)

LoRAConcept

LoRA: How to Teach a Giant AI New Tricks Without Rebuilding It

Read asBeginner In-depth

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Related events (8)

6arXiv · cs.AI·16d ago·source ↗

Audio Interaction Model: Unified Streaming LALM with Always-On Perceive-Decide-Respond Loop

Researchers introduce the Audio Interaction Model framework and a concrete implementation called Audio-Interaction, a unified streaming Large Audio Language Model that handles both offline tasks and real-time audio interaction through a continuous perceive-decide-respond loop. The system is built on SoundFlow, a framework covering data construction, training, and asynchronous low-latency inference. The authors also release StreamAudio-2M, a 2.6M-item streaming corpus spanning 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Evaluated across 8 benchmarks, the model preserves competitive offline performance while enabling real-time ASR, streaming instruction following, and proactive response capabilities not available in prior offline LALMs.

Frontier Model Releases Multimodal Progress Proactive-Sound-Bench Audio Interaction Model StreamAudio-2M +1 more

4arXiv · cs.CL·8d ago·source ↗

Audio-LLM-based data filtering for speech-to-speech translation via Rank-to-Distill

A new arXiv paper proposes using audio large language models to filter noisy training data for end-to-end speech-to-speech translation (S2ST). The authors introduce a two-stage Rank-to-Distill strategy: a lightweight ranker generates pseudo-labels from noisy speech pairs, which then supervise an audio-LLM to make keep/drop decisions directly from raw audio. Experiments on CVSS-C and SpeechMatrix benchmarks show up to +1.4 ASR-BLEU improvement over unfiltered baselines.

Evaluation and Benchmarking Multimodal Progress Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data SpeechMatrix CVSS-C +1 more

6arXiv · cs.CL·19d ago·source ↗

UniAudio-Token: Semantic Speech Tokenizer with General Audio Perception for Audio-LLMs

UniAudio-Token is a framework from Tencent that extends semantic speech tokenizers—commonly used as interfaces for Audio-LLMs—to support general audio perception without sacrificing speech quality. It introduces two mechanisms: Semantic-Acoustic Primitives (SAP) for structured supervision decomposing audio into linguistic, vocal, and auditory-scene components, and Semantic-Acoustic Equilibrium (SAE), a content-aware gating mechanism that restores fine-grained acoustic details from shallow layers. Evaluations show it outperforms all single-codebook baseline tokenizers on both understanding and generation tasks when integrated with downstream LLMs. Code, training/inference scripts, and model checkpoints are publicly released.

Agent and Tool Ecosystem Multimodal Progress Audio-LLM UniAudio-Token Tencent +2 more

4arXiv · cs.AI·5d ago·source ↗

AudioDER: Deduplication-enhanced reasoning dataset for post-training large audio-language models

Researchers introduce AudioDER, a ~191k-sample post-training dataset for Large Audio-Language Models (LALMs) built via an acoustic similarity-based deduplication pipeline to reduce redundancy and improve corpus diversity. Each sample pairs an audio clip with a multiple-choice question, answer candidates, a caption, and a chain-of-thought rationale generated by Qwen3-30B. Post-training Qwen2-Audio-7B-Instruct on AudioDER yields consistent gains on audio reasoning benchmarks including MMAU-mini, MMSU, and MMAR. The work addresses a data quality gap in audio-language training rather than proposing a new model architecture.

Evaluation and Benchmarking Multimodal Progress AudioDER Qwen2-Audio-7B-Instruct Qwen3-30B +3 more

5arXiv · cs.CL·15d ago·source ↗

USAD 2.0: Universal audio encoder scales to 1B parameters via representation distillation

USAD 2.0 is a new universal audio encoder that integrates knowledge from both self-supervised and supervised foundation models through domain-aware distillation, extending coverage to speech, music, and general audio domains. The model scales to one billion parameters via depth scaling and adds a second-stage supervised distillation step for downstream alignment with audio LLMs. Experiments report strong or state-of-the-art results across probing and LLM-based evaluations, addressing limitations of prior multi-domain encoders like USAD and SPEAR.

Frontier Model Releases Multimodal Progress USAD SPEAR USAD 2.0

5Hugging Face Blog·1mo ago·source ↗

Using LoRA for Efficient Stable Diffusion Fine-Tuning

This Hugging Face blog post explains how Low-Rank Adaptation (LoRA) can be applied to fine-tune Stable Diffusion models efficiently. LoRA reduces the number of trainable parameters by decomposing weight updates into low-rank matrices, enabling fine-tuning on consumer hardware with significantly less memory. The post covers practical implementation details using the diffusers library.

Open Weights Progress Agent and Tool Ecosystem LoRA Stable Diffusion 3 Hugging Face +2 more

5arXiv · cs.CL·10d ago·source ↗

RL-based alignment improves interactivity in full-duplex spoken dialogue models

Researchers propose a post-training alignment method using reinforcement learning to improve interactivity in full-duplex spoken dialogue models, which can listen and speak simultaneously. The method addresses four canonical axes of interactivity—pause handling, turn-taking, backchanneling, and user interruption—each with axis-specific reward functions, plus an LLM-based reward to prevent semantic degradation. The approach is applied to two open-source models, Moshi and PersonaPlex, showing consistent improvements in both offline and real-time multi-turn evaluation.

Alignment and RLHF Multimodal Progress Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models PersonaPlex Moshi

4Hugging Face Blog·1mo ago·source ↗

LoRA Training Scripts of the World, Unite!

Hugging Face published a blog post consolidating and comparing advanced LoRA fine-tuning scripts for Stable Diffusion XL, covering techniques such as pivotal tuning, custom captions, and various regularization strategies. The post aims to unify fragmented community training approaches into a more coherent set of best practices. It serves as a practical guide for practitioners fine-tuning SDXL models with LoRA adapters.

Open Weights Progress Agent and Tool Ecosystem LoRA Stable Diffusion 3 Pivotal Tuning +2 more