4arXiv cs.AI (Artificial Intelligence)·5d ago

TuneJury: Open pairwise reward model for text-to-music preference alignment

Researchers introduce TuneJury, an open-source instance-level pairwise reward model for text-to-music generation that predicts preference scores from text prompts and audio clips. The model is trained on publicly available human-preference labels spanning arena votes, crowdsourced comparisons, and expert ratings. A post-hoc anchor calibration method enables efficient adaptation to new generators without full retraining. The reward model drives gains across best-of-N selection, latent optimization, and expert-iteration post-training.

Alignment and RLHF Multimodal Progress DITTO Bradley-Terry TuneJury

Related guides (2)

Multimodal ProgressTopic guide

Multimodal Progress: How AI Learned to See, Hear, and Act

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

TTS Arena: Benchmarking Text-to-Speech Models in the Wild

Hugging Face introduces TTS Arena, a community-driven evaluation platform for text-to-speech models modeled after the LLM Chatbot Arena approach. Users listen to audio samples from competing TTS systems and vote on quality, generating Elo-based rankings. The platform aims to provide a more ecologically valid benchmark than existing automated metrics, which often fail to capture human perceptual preferences. Initial results surface rankings across open and proprietary TTS models.

Evaluation and Benchmarking Multimodal Progress Chatbot Arena TTS Arena Hugging Face +1 more

6Openai Blog·1mo ago·source ↗

OpenAI Jukebox: Neural Music Generation with Singing as Raw Audio

OpenAI introduced Jukebox, a neural network capable of generating music including rudimentary singing as raw audio across various genres and artist styles. The model operates directly on raw audio rather than symbolic representations like MIDI. OpenAI released model weights, code, and a sample exploration tool alongside the announcement.

Open Weights Progress Multimodal Progress Jukebox OpenAI

6Openai Blog·1mo ago·source ↗

Fine-tuning GPT-2 from Human Preferences

OpenAI fine-tuned the 774M parameter GPT-2 model using human feedback across summarization and style-continuation tasks, requiring 60k and 5k human labels respectively. The work revealed a labeler preference misalignment: for summarization, labelers rewarded copying from source text rather than genuine summarization. The stated motivation is advancing safety techniques for human-machine interaction and learning about human values from feedback.

Frontier Model Releases Evaluation and Benchmarking Reinforcement Learning from Human Feedback GPT-2 Fine-tuning GPT-2 from Human Preferences +2 more

5Hugging Face Blog·1mo ago·source ↗

Open Preference Dataset for Text-to-Image Generation by the Hugging Face Community

Hugging Face has released an open preference dataset for text-to-image generation, collected through community participation. The dataset captures human preference signals across image generation outputs, intended to support alignment and reward modeling research for image generation models. This contributes to the growing ecosystem of open datasets for training and evaluating generative image models.

Evaluation and Benchmarking Alignment and RLHF Hugging Face Open Preference Dataset for Text-to-Image +1 more

6arXiv · cs.LG·19d ago·source ↗

Drifting Preference Optimization (DrPO) for One-Step Text-to-Image Generators

DrPO is a new online preference fine-tuning method designed specifically for deterministic one-step text-to-image generators like SD-Turbo and SDXL-Turbo, which are difficult to align with standard RLHF methods that require policy likelihoods or differentiable reward gradients. The method samples candidates per prompt, ranks them with a target reward, and synthesizes a feature-space update direction via a non-parametric dipole preference field plus a reference drift from the frozen base model. Because the reward is used only for ranking, DrPO supports black-box and non-differentiable reward functions while keeping inference as a single forward pass. Evaluations on HPSv3 and GenEval show improved alignment over reward-gradient-free baselines and a 3.51× reduction in training compute by eliminating reward-model backpropagation.

Inference Economics Alignment and RLHF SDXL Turbo HPSv3 GenEval +4 more

6arXiv · cs.AI·19d ago·source ↗

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

This paper identifies and analyzes 'Perceptual Judgment Bias' in multimodal LLM judges, where models anchor on response text rather than visual evidence when the two conflict. The authors introduce a Perceptually Perturbed Judgment Dataset using counterfactual responses to isolate perceptual errors, and a training framework combining GRPO-based reward modeling with batch-ranking objectives. Experiments on MLLM-as-a-Judge benchmarks show improved perceptual fidelity, ranking coherence, and alignment with human evaluation.

Evaluation and Benchmarking Alignment and RLHF Perceptually Perturbed Judgment Dataset Multimodal Large Language Models GRPO +3 more

5arXiv · cs.LG·26d ago·source ↗

Active Query Synthesis for Preference Learning via Mutual Information Maximization

This paper introduces Info-Synth, an active query synthesis framework for preference learning that generates optimal pairwise queries by maximizing a mutual information objective in continuous space, bypassing the computational cost of pool-based evaluation. A confidence-aware response model is proposed to handle ambiguous comparisons between nearly identical or highly dissimilar items. Two finite-pool extensions (Pair M-dist and Pair Opt-dist) are also introduced. The framework is validated on synthetic preference tasks, text summarization datasets, and robotic controller tuning.

Evaluation and Benchmarking Alignment and RLHF active learning Pair Opt-dist mutual information +2 more

5arXiv · cs.CL·13d ago·source ↗

MMAE: First comprehensive benchmark for instruction-based audio editing across 7 modalities

Researchers introduce MMAE, a 2,000-sample benchmark for evaluating general-purpose instruction-based audio editing systems, covering 7 audio modalities (sound, speech, music, and mixtures) and 6 levels of task complexity. The benchmark uses a rubric-based evaluation framework decomposing tasks into 17,741 verifiable criteria to assess instruction following and context consistency. Evaluation of leading models reveals severe limitations: Exact Match Rate falls below 5% overall and hits 0% on complex mixed-modality tasks, exposing fundamental gaps in current audio editing systems.

Evaluation and Benchmarking Multimodal Progress MMAE Gemini Omni Nano Banana 2