Almanac
← Events
4arXiv cs.CL (Computation and Language)·2d ago

Psy-CoT and RAPO: Psychology-grounded reasoning and role-aware RL for character-faithful role-playing agents

Researchers propose Psy-CoT, a chain-of-thought framework that decomposes role-playing reasoning into three psychology-grounded steps (Interaction Perception, Psychological Empathy, Logical Construction) to improve out-of-distribution generalization beyond surface mimicry. They also introduce Role-Aware Policy Optimization (RAPO), a reinforcement learning method that uses profile–token mutual information to weight gradients asymmetrically, addressing reward hacking where generic phrases receive the same signal as role-specific ones. Experiments on CoSER, CharacterBench, and CharacterEval show Psy-CoT outperforms existing role-playing CoT methods and RAPO consistently beats GRPO across model scales. The work addresses a known failure mode of SFT-based role-playing agents and proposes a targeted RL fix for reward model exploitation.

Related guides (3)

Related events (8)

7arXiv · cs.CL·1mo ago·source ↗

AXPO: Agent Explorative Policy Optimization Addresses Thinking-Acting Gap in Multimodal Agentic Reasoning

This paper identifies a structural asymmetry in agentic reasoning called the 'Thinking-Acting Gap,' where tool use is attempted in only ~30% of rollouts under standard RL training (GRPO), and all-wrong tool-using subgroups suppress learning signals. The authors propose AXPO (Agent eXplorative Policy Optimization), which fixes the thinking prefix and resamples tool calls for all-wrong subgroups, combined with uncertainty-based prefix selection. Evaluated across nine multimodal benchmarks on Qwen3-VL-Thinking at multiple scales, SFT+AXPO outperforms SFT+GRPO by +1.8pp on both Pass@1 and Pass@4 at 8B, with the 8B model surpassing the 32B baseline on Pass@4 using 4× fewer parameters.

5arXiv · cs.LG·23d ago·source ↗

RREDCoT: Segment-level reward redistribution for chain-of-thought reasoning via self-approximated credit assignment

RREDCoT is a new method for redistributing rewards across segments of Chain-of-Thought traces during RL fine-tuning of reasoning language models, addressing the high-variance delayed-reward problem inherent in GRPO-style training. Rather than using computationally expensive Monte Carlo sampling for intermediate state value estimation, the method uses the model itself to approximate optimal reward redistribution without additional generation passes. The paper evaluates RREDCoT against MC sampling and several attribution baselines, analyzing segmentation strategies and state value estimation. This is relevant to the active research thread on improving RL fine-tuning stability and efficiency for reasoning models.

5arXiv · cs.CL·13d ago·source ↗

CORA: Consistency-Oriented Reasoning Alignment addresses thinking-answer gap in multimodal RLVR

Researchers identify and analyze a systematic inconsistency between reasoning traces and final answers in RLVR-trained large vision-language models, showing the problem persists throughout GRPO training and inference. They propose CORA, which introduces a lightweight plug-and-play consistency reward model and a Hybrid Reward Advantage Splitting (HRAS) mechanism to coordinate task and consistency optimization. Experiments across multimodal reasoning benchmarks show CORA improves both task performance and reasoning faithfulness.

5arXiv · cs.CL·5d ago·source ↗

ACPO: Adaptive Clip Policy Optimization improves RLVR training for LLM reasoning

A new arXiv preprint provides theoretical analysis of Reinforcement Learning from Verifiable Rewards (RLVR) updates, identifying off-policy degree and gradient expectation as key factors governing update dynamics. The authors show that differences in gradient steps per rollout substantially affect importance sampling ratio distributions and which tokens dominate updates. Based on this analysis, they propose Adaptive Clip Policy Optimization (ACPO), which adjusts clipping boundaries per token group by empirical variance of importance sampling ratios, outperforming DAPO and CISPO baselines on 3B and 7B models across math, tabular QA, and logic benchmarks.

6arXiv · cs.CL·10d ago·source ↗

GraphPO: Graph-based Policy Optimization reduces redundancy in LLM reasoning RL

GraphPO is a new reinforcement learning framework that represents reasoning rollouts as directed acyclic graphs rather than independent chains or trees, merging semantically equivalent reasoning paths into equivalence classes to share suffixes and reduce redundant exploration. The approach assigns efficiency advantages to incoming edges and correctness advantages to outgoing edges, deriving process supervision from outcome rewards. Experiments on three LLMs across reasoning and agentic search benchmarks show consistent improvements over chain- and tree-based baselines under equal token or response budgets. The method also provides theoretical guarantees on reduced advantage-estimation variance.

4arXiv · cs.AI·13d ago·source ↗

PCMA: Learning coordinated agent-specific preferences for multi-objective multi-agent RL

A new arXiv preprint introduces Preference Coordinated Multi-agent Policy Optimization (PCMA), a method for cooperative multi-objective multi-agent reinforcement learning (MOMARL) that learns agent-specific preferences to enable complementary trade-offs across agents. The authors formulate cooperative MOMARL as a team-optimal game and provide a first-order improvement decomposition showing that preference diversity can induce team improvement. Experiments on cooperative MOMA environments and a traffic-control scenario demonstrate improvements in both performance and trade-off coordination.

5arXiv · cs.CL·23d ago·source ↗

OneReason: Activating Chain-of-Thought Reasoning in Generative Recommendation Models

Researchers from the OneRec team introduce OneReason, a framework for enabling reasoning capabilities in generative recommendation models deployed across short-video, live-streaming, advertising, and e-commerce. The work identifies a key failure mode — that naive thinking-mode integration does not outperform non-thinking baselines — and diagnoses this as a deficit in two factors: itemic token perception and user behavior cognition. The proposed solution combines perception-focused pre-training, a three-level cognition-enhanced CoT format for supervised fine-tuning, and a specialize-then-unify RL training recipe.

6arXiv · cs.CL·3d ago·source ↗

OPERA: Perplexity-based RL alignment for open-ended reasoning tasks

OPERA (Objective Perplexity-based Reflective Alignment) proposes replacing LLM-as-a-judge reward models with intrinsic rewards derived from perplexity dynamics to stabilize RL training on open-ended tasks like creative writing. The method includes a cold-start data synthesis pipeline generating 20,000 reasoning trajectories using perplexity-prioritized rollouts. Applied to Qwen3-8B, OPERA claims state-of-the-art among open-source models on open-ended tasks, reportedly matching or exceeding Gemini 2.5 and MiniMax-M2.5 on some benchmarks.