5arXiv cs.LG (Machine Learning)·45h ago

Democratic ICAI uses structured persona debate to derive richer alignment steering principles

Researchers introduce Democratic ICAI, an extension of Inverse Constitutional AI (ICAI) that gathers multiple competing rationales through structured persona debate rather than single-pass explanations. The method derives natural-language steering principles from these richer signals and applies them via LLM-based and decision-tree judges. Experiments on creative preference benchmarks MuCE-Pref and LiTBench show improved preference prediction over deliberative prompting and principle-based baselines, with LLM annotators preferring the resulting constitutions. The work addresses a core limitation of pairwise preference labels — that they reveal final choices but not the underlying reasoning.

Evaluation and Benchmarking Alignment and RLHF Democratic ICAI: Debating Our Way to Steering Principles from Preferences MuCE-Pref Inverse Constitutional AI LiTBench

Related guides (2)

Alignment and RLHFTopic guide

Alignment and RLHF: Teaching AI Models to Behave

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

Constitutional AI with Open LLMs

This Hugging Face blog post explores implementing Constitutional AI (CAI) techniques using open-weight language models. The post likely covers how to replicate Anthropic's CAI alignment methodology—using a set of principles to guide model self-critique and revision—without relying on proprietary systems. It represents a practical contribution to democratizing alignment research tooling.

Open Weights Progress AI Safety Research Constitutional AI Hugging Face Anthropic +1 more

4arXiv · cs.CL·22d ago·source ↗

DEFINED: Data-efficient framework for fine-grained creativity assessment in debate using LLMs

DEFINED is a computational framework for automated creativity assessment in debate scenarios, operationalizing creativity through an eight-dimensional hierarchical metric system implemented via a pretrained autoregressive language model with a hierarchical scoring head. The system addresses data scarcity through constrained data augmentation and mixed-granularity training from limited expert-annotated data. It outperforms prompt-based LLM evaluators and existing debate scoring methods on authentic competition data. The work is relevant to AI evaluation methodology and the broader question of whether LLMs can reliably assess complex human cognitive outputs.

Evaluation and Benchmarking DEFINED

5arXiv · cs.CL·27d ago·source ↗

ACTS: Agentic Chain-of-Thought Steering for efficient and controllable LLM reasoning

Researchers introduce Agentic Chain-of-Thought Steering (ACTS), a framework that formulates inference-time reasoning control as a Markov decision process, where a controller agent adaptively steers a frozen reasoner by issuing reasoning strategy directives and steering phrases at each step. The controller is initialized from synthetic steering trajectories with multi-budget augmentation and further optimized via reinforcement learning with budget-conditioned reward shaping. ACTS matches full-thinking performance with significant token savings and enables controllable accuracy-efficiency trade-offs across multiple benchmarks and reasoner models.

Inference Economics Agent and Tool Ecosystem ACTS Agentic Chain-of-Thought Steering Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

7Openai Blog·1mo ago·source ↗

Deliberative Alignment: Reasoning Enables Safer Language Models

OpenAI introduces deliberative alignment, a new alignment strategy applied to o1 models in which the model is directly taught safety specifications and trained to reason over them at inference time. Unlike prior approaches that embed safety implicitly through RLHF, this method makes safety reasoning explicit and inspectable. The announcement positions deliberative alignment as a meaningful advance in scalable oversight and safe deployment of frontier reasoning models.

Frontier Model Releases AI Safety Research Reinforcement Learning from Human Feedback OpenAI deliberative alignment +2 more

6Openai Blog·1mo ago·source ↗

AI Safety via Debate

OpenAI proposes a safety technique in which two AI agents debate a topic and a human judge determines the winner, with the goal of making it easier for humans to supervise AI systems that may be more capable than themselves. The core intuition is that it is easier to verify a correct argument than to generate one, so a dishonest agent can be caught by an honest opponent. The paper introduces debate as a scalable oversight mechanism applicable to complex tasks where direct human evaluation is infeasible.

Evaluation and Benchmarking AI Safety Research AI Safety via Debate Debate (AI safety technique)OpenAI +2 more

5arXiv · cs.AI·1mo ago·source ↗

Human Decision-Making with Persuasive and Narrative LLM Explanations

A large-scale behavioral experiment evaluated how LLM-generated narrative explanations of varying persuasiveness affect human decision-making accuracy in classification tasks. Results showed that persuasiveness level did not meaningfully improve decision accuracy over a simple AI prediction alone, consistent with prior explainable AI research using feature importance methods. Narratives increased AI reliance regardless of whether the AI prediction was correct or incorrect, and more persuasive narratives may have slowed response times and reduced ability to discriminate correct from incorrect AI predictions. The study concludes that narrative explanations involve tradeoffs and warrant further investigation into when and how they should be deployed.

Evaluation and Benchmarking AI Safety Research Narrative Explanations large language models Explainable AI (XAI)+2 more

4Openai Blog·1mo ago·source ↗

Democratic Inputs to AI Grant Program: Lessons Learned and Implementation Plans

OpenAI summarizes outcomes from its Democratic Inputs to AI grant program, which funded 10 international teams to develop ideas and tools for collective governance of AI systems. The update outlines key innovations and learnings from the program and signals continued investment in participatory AI governance research. OpenAI is calling for researchers and engineers to join ongoing work in this area.

AI Safety Research Regulatory Developments Democratic Inputs to AI Grant Program OpenAI

4arXiv · cs.CL·4d ago·source ↗

Conceptual framework for analyzing dialogue dynamics in human-AI and multi-agent collaborative problem-solving

A new arXiv preprint proposes a hierarchical two-layer coding scheme for analyzing dialogue in collaborative problem-solving, integrating cognitive and metacognitive dimensions. The framework is validated across nine datasets spanning multiple domains and is positioned to apply to both human-AI and multi-agent collaboration contexts. A key finding is that metacognitive regulation is a strong discriminator of deeper collaboration quality.

Evaluation and Benchmarking Agent and Tool Ecosystem Bridging Talk and Thought: Understanding Dialogue Dynamics Across Collaborative Problem-Solving Contexts