5arXiv cs.LG (Machine Learning)·25d ago

Active Query Synthesis for Preference Learning via Mutual Information Maximization

This paper introduces Info-Synth, an active query synthesis framework for preference learning that generates optimal pairwise queries by maximizing a mutual information objective in continuous space, bypassing the computational cost of pool-based evaluation. A confidence-aware response model is proposed to handle ambiguous comparisons between nearly identical or highly dissimilar items. Two finite-pool extensions (Pair M-dist and Pair Opt-dist) are also introduced. The framework is validated on synthetic preference tasks, text summarization datasets, and robotic controller tuning.

Evaluation and Benchmarking Alignment and RLHF active learning Pair Opt-dist mutual information Pair M-dist Info-Synth

Related guides (2)

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

4arXiv · cs.AI·5d ago·source ↗

PCMA: Learning coordinated agent-specific preferences for multi-objective multi-agent RL

A new arXiv preprint introduces Preference Coordinated Multi-agent Policy Optimization (PCMA), a method for cooperative multi-objective multi-agent reinforcement learning (MOMARL) that learns agent-specific preferences to enable complementary trade-offs across agents. The authors formulate cooperative MOMARL as a team-optimal game and provide a first-order improvement decomposition showing that preference diversity can induce team improvement. Experiments on cooperative MOMA environments and a traffic-control scenario demonstrate improvements in both performance and trade-off coordination.

Agent and Tool Ecosystem Preference Coordinated Multi-agent Policy Optimization

5Hugging Face Blog·1mo ago·source ↗

Preference Optimization for Vision Language Models

This Hugging Face blog post covers the application of Direct Preference Optimization (DPO) to vision-language models (VLMs). It likely discusses how preference learning techniques originally developed for text-only LLMs can be adapted to multimodal settings. The post addresses training methodology for aligning VLMs with human preferences across both visual and textual modalities.

Alignment and RLHF Multimodal Progress Direct Preference Optimization (DPO)Vision-Language Models Hugging Face

4arXiv · cs.AI·4d ago·source ↗

TuneJury: Open pairwise reward model for text-to-music preference alignment

Researchers introduce TuneJury, an open-source instance-level pairwise reward model for text-to-music generation that predicts preference scores from text prompts and audio clips. The model is trained on publicly available human-preference labels spanning arena votes, crowdsourced comparisons, and expert ratings. A post-hoc anchor calibration method enables efficient adaptation to new generators without full retraining. The reward model drives gains across best-of-N selection, latent optimization, and expert-iteration post-training.

Alignment and RLHF Multimodal Progress DITTO Bradley-Terry TuneJury

6Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

SPEX and ProxySPEX: Scalable Interaction Discovery for LLM Interpretability

Researchers from BAIR introduce SPEX (Spectral Explainer) and ProxySPEX, algorithms for identifying influential feature, data, and model-component interactions in LLMs at scale. The approach exploits sparsity, low-degreeness, and hierarchy properties to reframe interaction discovery as a sparse recovery problem using tools from signal processing and coding theory. ProxySPEX achieves comparable performance to SPEX with roughly 10x fewer ablations by leveraging hierarchical structure. The methods are evaluated on feature attribution (sentiment analysis), data attribution, and mechanistic interpretability tasks, outperforming marginal methods like LIME at long context lengths.

Long Context Evolution Evaluation and Benchmarking GPT-4o mini Faith-Shap LIME +5 more

7Openai Blog·1mo ago·source ↗

Learning from Human Preferences: OpenAI and DeepMind Collaborate on Reward Learning from Comparisons

OpenAI, in collaboration with DeepMind's safety team, published a method for learning reward functions directly from human preference comparisons between pairs of agent behaviors, eliminating the need to hand-code goal functions. The algorithm infers human intent by asking evaluators which of two proposed behaviors is preferable, addressing risks from misspecified reward functions. This work is an early foundational contribution to what would become reinforcement learning from human feedback (RLHF). It targets both safety and alignment concerns around reward hacking and proxy gaming.

Evaluation and Benchmarking AI Safety Research Reward Learning from Comparisons DeepMind Reinforcement Learning from Human Feedback +2 more

5Berkeley Ai Research (Bair) Blog·1mo ago·source ↗

Information-Driven Design of Imaging Systems

Researchers from Berkeley present a framework for evaluating and optimizing imaging systems based on mutual information content rather than traditional metrics like resolution or SNR, published at NeurIPS 2025. The method estimates mutual information directly from noisy measurements using known noise physics and learned probabilistic models (including transformers and PixelCNN), avoiding the need for task-specific decoders. Validated across four domains—color photography, radio astronomy, lensless imaging, and microscopy—the information metric predicts downstream decoder performance and enables hardware optimization with less compute and memory than end-to-end neural approaches.

Evaluation and Benchmarking Inference Economics UC Berkeley information-driven imaging framework mutual information +3 more

4arXiv · cs.AI·15d ago·source ↗

In-context learning applied to Multiple Instance Learning via Perceiver-style pretraining on synthetic data

A new arXiv preprint proposes pretraining an in-context learner with a Perceiver-style architecture on synthetic bag-structured data to solve Multiple Instance Learning (MIL) tasks from a handful of labeled bags at inference time, requiring no gradient updates. The authors evaluate several synthetic data generators and find that a mixture-pretrained model captures complementary inductive biases, outperforming supervised baselines across twelve MIL benchmarks. The work addresses the low-label regime common in domains like computational pathology and satellite imagery.

Evaluation and Benchmarking In-Context Multiple Instance Learning Perceiver IO

5Hugging Face Blog·1mo ago·source ↗

Preference Tuning LLMs with Direct Preference Optimization Methods

A Hugging Face blog post surveys Direct Preference Optimization (DPO) and related preference tuning methods for aligning large language models. The post covers the landscape of DPO variants and their practical application via the TRL library. It serves as a technical reference for practitioners implementing RLHF alternatives.

Agent and Tool Ecosystem Alignment and RLHF Reinforcement Learning from Human Feedback Direct Preference Optimization (DPO)Hugging Face +1 more