Democratic ICAI uses structured persona debate to derive richer alignment steering principles
Researchers introduce Democratic ICAI, an extension of Inverse Constitutional AI (ICAI) that gathers multiple competing rationales through structured persona debate rather than single-pass explanations. The method derives natural-language steering principles from these richer signals and applies them via LLM-based and decision-tree judges. Experiments on creative preference benchmarks MuCE-Pref and LiTBench show improved preference prediction over deliberative prompting and principle-based baselines, with LLM annotators preferring the resulting constitutions. The work addresses a core limitation of pairwise preference labels — that they reveal final choices but not the underlying reasoning.
Related guides (2)
Related events (8)
Constitutional AI with Open LLMs
This Hugging Face blog post explores implementing Constitutional AI (CAI) techniques using open-weight language models. The post likely covers how to replicate Anthropic's CAI alignment methodology—using a set of principles to guide model self-critique and revision—without relying on proprietary systems. It represents a practical contribution to democratizing alignment research tooling.
DEFINED: Data-efficient framework for fine-grained creativity assessment in debate using LLMs
DEFINED is a computational framework for automated creativity assessment in debate scenarios, operationalizing creativity through an eight-dimensional hierarchical metric system implemented via a pretrained autoregressive language model with a hierarchical scoring head. The system addresses data scarcity through constrained data augmentation and mixed-granularity training from limited expert-annotated data. It outperforms prompt-based LLM evaluators and existing debate scoring methods on authentic competition data. The work is relevant to AI evaluation methodology and the broader question of whether LLMs can reliably assess complex human cognitive outputs.
ACTS: Agentic Chain-of-Thought Steering for efficient and controllable LLM reasoning
Researchers introduce Agentic Chain-of-Thought Steering (ACTS), a framework that formulates inference-time reasoning control as a Markov decision process, where a controller agent adaptively steers a frozen reasoner by issuing reasoning strategy directives and steering phrases at each step. The controller is initialized from synthetic steering trajectories with multi-budget augmentation and further optimized via reinforcement learning with budget-conditioned reward shaping. ACTS matches full-thinking performance with significant token savings and enables controllable accuracy-efficiency trade-offs across multiple benchmarks and reasoner models.
Deliberative Alignment: Reasoning Enables Safer Language Models
OpenAI introduces deliberative alignment, a new alignment strategy applied to o1 models in which the model is directly taught safety specifications and trained to reason over them at inference time. Unlike prior approaches that embed safety implicitly through RLHF, this method makes safety reasoning explicit and inspectable. The announcement positions deliberative alignment as a meaningful advance in scalable oversight and safe deployment of frontier reasoning models.
AI Safety via Debate
OpenAI proposes a safety technique in which two AI agents debate a topic and a human judge determines the winner, with the goal of making it easier for humans to supervise AI systems that may be more capable than themselves. The core intuition is that it is easier to verify a correct argument than to generate one, so a dishonest agent can be caught by an honest opponent. The paper introduces debate as a scalable oversight mechanism applicable to complex tasks where direct human evaluation is infeasible.
Human Decision-Making with Persuasive and Narrative LLM Explanations
A large-scale behavioral experiment evaluated how LLM-generated narrative explanations of varying persuasiveness affect human decision-making accuracy in classification tasks. Results showed that persuasiveness level did not meaningfully improve decision accuracy over a simple AI prediction alone, consistent with prior explainable AI research using feature importance methods. Narratives increased AI reliance regardless of whether the AI prediction was correct or incorrect, and more persuasive narratives may have slowed response times and reduced ability to discriminate correct from incorrect AI predictions. The study concludes that narrative explanations involve tradeoffs and warrant further investigation into when and how they should be deployed.
Democratic Inputs to AI Grant Program: Lessons Learned and Implementation Plans
OpenAI summarizes outcomes from its Democratic Inputs to AI grant program, which funded 10 international teams to develop ideas and tools for collective governance of AI systems. The update outlines key innovations and learnings from the program and signals continued investment in participatory AI governance research. OpenAI is calling for researchers and engineers to join ongoing work in this area.
Conceptual framework for analyzing dialogue dynamics in human-AI and multi-agent collaborative problem-solving
A new arXiv preprint proposes a hierarchical two-layer coding scheme for analyzing dialogue in collaborative problem-solving, integrating cognitive and metacognitive dimensions. The framework is validated across nine datasets spanning multiple domains and is positioned to apply to both human-AI and multi-agent collaboration contexts. A key finding is that metacognitive regulation is a strong discriminator of deeper collaboration quality.

