4arXiv cs.AI (Artificial Intelligence)·46h ago

Framework for value-constrained credit assignment in fully delegated AI cooperatives

A new arXiv preprint proposes a framework for reward allocation in AI cooperatives where human principals are represented by agents contributing data and model updates under heterogeneous value constraints. The approach introduces value-conditioned gradient filtering and online marginal contribution signals within a 'traversal learning' (TL) substrate, which the authors argue preserves explicit gradient paths and enables finer attribution than FedAvg-style federated learning. The work positions itself against data valuation, federated contribution estimation, personalized federated learning, and pluralistic alignment research.

AI Safety Research Alignment and RLHF FedAvg Towards Value-Constrained Credit Assignment in Fully Delegated AI Cooperatives traversal learning

Related guides (2)

AI Safety ResearchTopic guide

AI Safety Research: From Lab Policies to Real-World Flashpoints

Read asBeginner In-depth

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Related events (8)

5arXiv · cs.CL·7d ago·source ↗

ACPO: Adaptive Clip Policy Optimization improves RLVR training for LLM reasoning

A new arXiv preprint provides theoretical analysis of Reinforcement Learning from Verifiable Rewards (RLVR) updates, identifying off-policy degree and gradient expectation as key factors governing update dynamics. The authors show that differences in gradient steps per rollout substantially affect importance sampling ratio distributions and which tokens dominate updates. Based on this analysis, they propose Adaptive Clip Policy Optimization (ACPO), which adjusts clipping boundaries per token group by empirical variance of importance sampling ratios, outperforming DAPO and CISPO baselines on 3B and 7B models across math, tabular QA, and logic benchmarks.

Evaluation and Benchmarking Alignment and RLHF DAPO CISPO Reinforcement Learning with Verifiable Rewards +1 more

6arXiv · cs.LG·19d ago·source ↗

APPO: Fine-grained branching and credit assignment for agentic RL in LLMs

Researchers introduce Agentic Procedural Policy Optimization (APPO), a reinforcement learning method that shifts branching and credit assignment from coarse tool-call boundaries to fine-grained decision points within generated sequences. APPO uses a Branching Score combining token uncertainty with policy-induced likelihood gains to select exploration points, plus procedure-level advantage scaling for credit distribution. Evaluated on 13 benchmarks, APPO improves strong agentic RL baselines by nearly 4 points while maintaining efficient tool use and interpretability. The work addresses a known weakness in multi-turn agentic RL: that influential decisions are distributed throughout sequences, not concentrated at tool-call boundaries.

Agent and Tool Ecosystem Alignment and RLHF APPO: Agentic Procedural Policy Optimization

5arXiv · cs.LG·27d ago·source ↗

Reward uncertainty as a principled mechanism for diverse RL behaviour

A new arXiv preprint proposes replacing the scalar reward in RL with a distribution over reward functions, applying a non-linear objective over sets of actions to induce calibrated behavioural diversity without sacrificing expected reward. The authors derive a principled gradient estimator in the contextual bandit setting and prove the formulation generalizes vanilla policy gradient and action-set approaches. The work is motivated by applications like language model fine-tuning where diversity is desirable but entropy regularization and diversity bonuses introduce fragile trade-offs. Empirical results support the framework as a theoretically grounded alternative to heuristic diversity methods.

Evaluation and Benchmarking Alignment and RLHF Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning

7Openai Blog·1mo ago·source ↗

Learning from Human Preferences: OpenAI and DeepMind Collaborate on Reward Learning from Comparisons

OpenAI, in collaboration with DeepMind's safety team, published a method for learning reward functions directly from human preference comparisons between pairs of agent behaviors, eliminating the need to hand-code goal functions. The algorithm infers human intent by asking evaluators which of two proposed behaviors is preferable, addressing risks from misspecified reward functions. This work is an early foundational contribution to what would become reinforcement learning from human feedback (RLHF). It targets both safety and alignment concerns around reward hacking and proxy gaming.

Evaluation and Benchmarking AI Safety Research Reward Learning from Comparisons DeepMind Reinforcement Learning from Human Feedback +2 more

6arXiv · cs.CL·1mo ago·source ↗

DelTA: Discriminative Token Credit Assignment for RLVR Training

DelTA introduces a discriminative token credit assignment method for reinforcement learning from verifiable rewards (RLVR) that addresses the problem of high-frequency formatting tokens dominating policy gradient updates. The method estimates per-token coefficients to amplify side-specific gradient directions and downweight shared or weakly discriminative ones, making the effective update direction more contrastive. On seven mathematical benchmarks, DelTA outperforms same-scale baselines by 3.26 and 2.62 average points on Qwen3-8B-Base and Qwen3-14B-Base respectively, with additional gains on code generation tasks.

Frontier Model Releases Evaluation and Benchmarking DelTA Qwen3-8B-Base policy gradient +5 more

4arXiv · cs.LG·1mo ago·source ↗

Framework for Carbon-Aware AI Inference Incentives Balancing Accuracy, Latency, and Emissions

This paper proposes an incentive framework for AI inference services that accounts for users' valuation of quality, latency, and environmental consciousness. The core mechanism is a two-tier subscription model where users accept discounted service—lower model quality and higher latency—during high carbon-intensity periods in exchange for reduced costs. The framework formalizes the tradeoff space between carbon emissions and quality-of-experience parameters, giving providers flexibility to shift inference load toward greener operating points.

Training Infrastructure Inference Economics carbon intensity AI inference carbon emissions two-tier service subscription model +1 more

7arXiv · cs.AI·1mo ago·source ↗

Calibrated Collective Oversight (CCO): Scalable Oversight with Finite-Time Statistical Guarantees

This paper introduces Calibrated Collective Oversight (CCO), a framework for maintaining human oversight of agentic AI systems that may exceed human capabilities. CCO aggregates diverse scoring functions into a conservatism penalty inspired by Attainable Utility Preservation, then calibrates this penalty online via Conformal Decision Theory to ensure undesirable outcomes stay below a user-specified threshold with finite-time bounds and no distributional assumptions. Evaluated on a modified SWE-bench (adversarially misaligned agent) and MACHIAVELLI (ethical violations), CCO allows weaker overseers to constrain stronger agents while preserving reward, with empirical violation rates closely matching specified targets.

Evaluation and Benchmarking AI Safety Research Calibrated Collective Oversight (CCO)Attainable Utility Preservation Conformal Decision Theory +4 more

6arXiv · cs.CL·1mo ago·source ↗

CoTrace: A Goal-Level Attribution Framework for Measuring AI Contributions in Human-AI Collaboration

Researchers introduce CoTrace, a framework that decomposes explicit goals into verifiable requirements and traces both direct and indirect AI contributions across dialogue turns in human-AI collaboration. Applied to 638 real-world collaboration logs, the study finds LLMs account for 11-26% of goal-shaping contribution, with disproportionate influence on lower-level concrete requirements. A user study shows that exposing participants to goal-level attribution analyses shifts their perceived AI contribution by nearly 2 points on a 5-point scale, revealing systematic miscalibration in how users understand AI-assisted work. The work has implications for reliance calibration, AI-assisted work evaluation, and interaction design.

Evaluation and Benchmarking AI Safety Research large language models goal-level attribution framework CoTrace +2 more