4arXiv cs.CL (Computation and Language)·17d ago

Training-free mixture-of-agents framework combines LLMs and knowledge graphs for multi-document summarization

A new arXiv preprint proposes a training-free multi-agent framework for multi-document summarization (MDS) that decomposes the task into specialized agents for extractive selection, knowledge-aware abstraction, and iterative refinement, unified via a multi-perspective consistency mechanism. The system integrates LLMs with knowledge graphs without task-specific fine-tuning. Experiments across four datasets in English and Vietnamese show state-of-the-art or competitive performance, with the authors emphasizing cross-domain and cross-lingual generalization.

Evaluation and Benchmarking Agent and Tool Ecosystem A Training-Free Mixture-of-Agents Framework for Multi-Document Summarization using LLMs and Knowledge Graphs

Related guides (2)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Evaluation and BenchmarkingTopic guide

Evaluation and Benchmarking: How We Measure AI — and Why It Keeps Getting Harder

Read asBeginner In-depth

Related events (8)

6arXiv · cs.AI·8d ago·source ↗

Agents-K1: End-to-end knowledge orchestration pipeline for agent-native scientific knowledge graphs

Agents-K1 is a new pipeline that converts raw scientific documents into structured knowledge graphs for use by LLM-based research agents, addressing the gap where existing systems reduce papers to abstracts and flat citation edges. The system integrates a multimodal parser, a 4B information-extraction model trained with GRPO, and a tri-source agent interface combining web search, graph retrieval, and cross-document traversal. The authors process 2.46 million scientific papers to produce Scholar-KG, releasing a one-million-paper subset. Experiments show improvements in scientific information extraction, knowledge graph construction, and multi-hop reasoning.

Evaluation and Benchmarking Agent and Tool Ecosystem GRPO Agents-K1 Scholar-KG +1 more

6arXiv · cs.LG·22d ago·source ↗

LLMSurgeon: Post-Hoc Auditing of LLM Pretraining Data Mixtures

LLMSurgeon formalizes Data Mixture Surgery (DMS), a framework for estimating the domain-level distribution of an LLM's pretraining corpus using only generated text from the target model. The method casts DMS as an inverse problem under the label-shift assumption, using a calibrated soft confusion matrix to correct domain confusion and recover the latent mixture prior. The authors also introduce LLMScan, a verifiable evaluation suite built from open-source LLMs with known pretraining mixtures, on which LLMSurgeon demonstrates high-fidelity recovery of domain compositions without access to training data.

Frontier Model Releases Evaluation and Benchmarking LLMSurgeon LLMScan Data Mixture Surgery +3 more

4arXiv · cs.CL·1mo ago·source ↗

MA²P: A Meta-Cognitive Multi-Agent Framework for Complex Persuasion

The paper introduces MA²P, a multi-agent framework designed for complex persuasion tasks where the persuadee's internal states are latent. The system coordinates perception management, mental-state inference, strategy execution, memory, and evaluation modules, and adds a meta-cognitive configurator that selects domain-appropriate strategies from a structured knowledge base to reduce cross-domain performance variance. Experiments show higher persuasion success rates compared to baselines. The work addresses a known weakness of LLMs in producing generic or weakly grounded persuasive responses.

Agent and Tool Ecosystem Alignment and RLHF large language models meta-cognitive configurator MA²P +1 more

5arXiv · cs.AI·26d ago·source ↗

Adversarial Subspace Alignment for Robust Multimodal Knowledge Editing in MLLMs

This paper addresses the generalization gap in multimodal large language model (MLLM) knowledge editing, where edits fail to propagate across semantically equivalent visual and linguistic variations. The authors introduce Latent Adversarial Robustification (LAR), which generates adversarial but semantically coherent variants in joint latent space, and Rank-Constrained Subspace Learning (RCSL), which enforces low-rank alignment of adversarial representations at the edit layer. Together these form the ASAM framework, which formalizes robustness via knowledge units grouping semantically equivalent multimodal inputs. Empirical analysis demonstrates improved generality without sacrificing reliability or locality.

Alignment and RLHF Multimodal Progress Multimodal Large Language Models Latent Adversarial Robustification (LAR)knowledge editing +2 more

5Hugging Face Blog·1mo ago·source ↗

Jupyter Agents: Training LLMs to Reason with Notebooks

Hugging Face published a blog post on training LLMs to operate as Jupyter notebook agents, enabling models to reason and execute code iteratively within notebook environments. The work covers dataset construction, training methodology, and evaluation for notebook-native agentic behavior. This represents a step toward LLMs that can conduct multi-step data analysis and experimentation autonomously within a familiar scientific computing interface.

Evaluation and Benchmarking Agent and Tool Ecosystem Hugging Face Jupyter Notebook Jupyter Agent

5Hugging Face Blog·1mo ago·source ↗

Consilium: When Multiple LLMs Collaborate

Hugging Face introduces Consilium, a framework for multi-LLM collaboration where multiple language models work together on tasks rather than relying on a single model. The approach explores how ensembling or deliberation among diverse LLMs can improve output quality and robustness. This fits into the broader agent-tool ecosystem trend of orchestrating multiple AI models for better results.

Frontier Model Releases Agent and Tool Ecosystem Hugging Face Consilium

6Openai Blog·1mo ago·source ↗

Learning to Summarize with Human Feedback

OpenAI published research applying reinforcement learning from human feedback (RLHF) to train language models for improved summarization quality. The work demonstrated that models trained with human preference signals outperform those trained purely on supervised objectives for summarization tasks. This paper is an early foundational contribution to the RLHF methodology that later became central to aligning large language models.

Evaluation and Benchmarking Alignment and RLHF Reinforcement Learning from Human Feedback OpenAI Learning to Summarize with Human Feedback

6arXiv · cs.CL·1mo ago·source ↗

Mem-π: Adaptive Memory for LLM Agents via On-Demand Generation and Decoupled RL

Mem-π introduces a framework where a dedicated language or vision-language model generates context-specific guidance for LLM agents on demand, rather than retrieving static entries from episodic memory banks. The system is trained with a decision-content decoupled reinforcement learning objective that jointly learns when to generate guidance and what to generate, enabling abstention when generation would not help. Evaluated across web navigation, terminal-based tool use, and text-based embodied interaction benchmarks, Mem-π achieves over 30% relative improvement on web navigation tasks compared to retrieval-based and prior RL-optimized memory baselines.

Evaluation and Benchmarking Agent and Tool Ecosystem web navigation benchmark Mem-π large language model agents +3 more