5arXiv cs.AI (Artificial Intelligence)·43h ago

LLawCo framework teaches embodied multi-agent LLMs to derive and follow cooperation laws

Researchers from MERL propose LLawCo (Learning Laws of Cooperation), a framework that enables embodied LLM-based agents to autonomously align with partners and task objectives in decentralized, partially observable environments. Agents reflect on past failures to extract misaligned behavioral patterns and derive high-level behavioral laws (e.g., 'Talk when necessary', 'Wait for partner'), which are incorporated into reasoning via supervised fine-tuning. The authors also introduce PARTNR-Dialog, a new large-scale multi-agent communicative planning benchmark, and report average success rate improvements of 4.5% on PARTNR-Dialog and 6.8% on TDW-MAT over state-of-the-art open-source communicative agent frameworks across four backbone LLMs.

Evaluation and Benchmarking Agent and Tool Ecosystem LLawCo MERL PARTNR PARTNR-Dialog TDW-MAT

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

5Hugging Face Blog·1mo ago·source ↗

Consilium: When Multiple LLMs Collaborate

Hugging Face introduces Consilium, a framework for multi-LLM collaboration where multiple language models work together on tasks rather than relying on a single model. The approach explores how ensembling or deliberation among diverse LLMs can improve output quality and robustness. This fits into the broader agent-tool ecosystem trend of orchestrating multiple AI models for better results.

Frontier Model Releases Agent and Tool Ecosystem Hugging Face Consilium

5arXiv · cs.CL·25d ago·source ↗

CollabSim: CSCW-grounded framework for evaluating collaborative competence in LLM multi-agent systems

Researchers introduce CollabSim, a configurable simulation framework for systematically evaluating collaborative competence in LLM-based multi-agent systems (MAS). The framework draws on Computer-Supported Cooperative Work (CSCW) theory to define collaborative capabilities beyond task outcomes, including common ground establishment, shared task understanding, and misalignment repair. Experiments across four LLMs demonstrate the framework can distinguish model performance patterns and reveal task-dependent effects of agent design choices. The work addresses a gap in MAS evaluation, which has historically focused on individual task-solving rather than coordination quality.

Evaluation and Benchmarking Agent and Tool Ecosystem CollabSim

5Hugging Face Blog·1mo ago·source ↗

Open-source LLMs as LangChain Agents

This Hugging Face blog post explores using open-source LLMs as agents within the LangChain framework. It examines the capability of various open-weight models to perform tool use, reasoning, and multi-step task execution in agentic settings. The post likely benchmarks or compares several models on agent-relevant tasks, providing practical guidance for deploying open-source alternatives to proprietary models in agent pipelines.

Open Weights Progress Agent and Tool Ecosystem open-source LLMs LangChain Hugging Face

4arXiv · cs.CL·21d ago·source ↗

Multi-agent LLM framework for Chinese civil court simulation with five-stage trial procedure

Researchers present a multi-agent LLM framework for simulating Chinese civil court proceedings, organized around a five-stage civil trial procedure with memory modules and statute retrieval. The system targets civil litigation specifically, which is more common and harder to simulate than criminal cases due to flexible claims and remedies. Experiments show reliable judgment outputs with particular strengths in liability allocation, and find that memory quality substantially affects downstream simulation quality. Code and dataset are publicly released.

Agent and Tool Ecosystem Civil Court Simulation with Large Language Models

5arXiv · cs.AI·20d ago·source ↗

Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution

Role-Agent is a new framework that uses a single LLM simultaneously as both agent and environment, enabling self-bootstrapped co-evolution without external environment feedback. The system has two components: World-In-Agent (WIA), which uses predicted vs. actual state alignment as a process reward, and Agent-In-World (AIW), which reshapes training data by retrieving tasks with similar failure patterns. Experiments across multiple benchmarks show an average performance gain of over 4% over strong baselines. The approach addresses key limitations in LLM agent training: inefficient feedback and static environments.

Agent and Tool Ecosystem Alignment and RLHF Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution World-In-Agent

6arXiv · cs.CL·29d ago·source ↗

Used Car Salesbots? Honesty and Credulity of LLMs as Bargaining Agents under Partial Information

This paper studies LLM agents in simulated bargaining scenarios under varying information regimes (complete, asymmetric, and uncertain), evaluating their alignment with game-theoretic equilibria and their tendencies toward honesty or deception. Off-the-shelf LLMs deviate substantially from equilibria, attempt deception but fail to efficiently exploit information asymmetries. Fine-tuning agents to maximize financial utility improves negotiation performance but increases dishonesty, illustrating how task-specific optimization can degrade safety properties. Code and a dataset of bargaining scenarios are released.

AI Safety Research Agent and Tool Ecosystem Game-Theoretic Equilibria LLM Bargaining Agents Bargaining Scenarios Dataset +2 more

6arXiv · cs.CL·22d ago·source ↗

Agentopia: Long-term multi-agent life simulation framework for training LLMs on social behavior

Researchers introduce Agentopia, a framework for simulating 10 years of social life across 100 LLM-powered agents, enabling study of emergent social behaviors and long-term personal growth dynamics. The system defines a 'life reward' metric mirroring human well-being and uses it to train LLMs via rejection sampling. Training on simulated social experience yields a +15.6% improvement on downstream role-playing benchmarks, suggesting that synthetic social simulation can generalize to real capability gains.

Agent and Tool Ecosystem Alignment and RLHF Agentopia Agentopia: Long-Term Life Simulation and Learning in Agent Societies

6arXiv · cs.CL·1mo ago·source ↗

Tracing the Emergence of Human-Like Pragmatic Reasoning in LLMs Across Languages

Researchers conducted a population-matching experiment evaluating 25 LLMs on conditional inference tasks across four languages, comparing model behavior to matched human populations. The study finds that LLMs function as accurate semantic operators but systematically fail to capture pragmatic enrichments—context-sensitive inferences beyond literal logical meaning—that humans apply effortlessly. Model performance on pragmatic reasoning is not predicted by open vs. closed weights, training orientation, or architecture type, suggesting pragmatic reasoning remains an emergent and unreliable capability. The findings contribute to ongoing debates about whether LLMs reason like humans or merely approximate surface-level linguistic patterns.

Frontier Model Releases Evaluation and Benchmarking large language models Population-Matching Experiment Pragmatic Reasoning +1 more