5arXiv cs.CL (Computation and Language)·17h ago

Multi-agent system using open-source LLMs outperforms GPT-4 on disinformation detection

A new arXiv preprint proposes a multi-agent system for automated disinformation detection that emulates human annotator decision-making through consensus mechanisms, cognitive diversity, and hierarchical structure. The system uses open-source models (LLaMA, Kimi, Qwen, DeepSeek, LLaMA-Nemotron) and is evaluated on English, Polish, Slovak, and Bulgarian datasets across three fact-checking tasks. Results claim superior performance over individual LLMs including GPT-4 and GPT-3.5, with transparency benefits from using open weights models.

Open Weights Progress Agent and Tool Ecosystem Llama Nemotron Kimi DeepSeek V4 Multi-Agentic System Leveraging Open-Source LLMs to Mitigate Disinformation Threats GPT-3.5 Qwen Llama GPT-4

Related guides (3)

Qwen

Qwen: Alibaba's Open-Weight AI Lab Pushing the Frontier

Read asBeginner In-depth

DeepSeek V4

DeepSeek V4: Open-Weights Frontier AI at a Fraction of the Price

Read asBeginner In-depth

Open Weights ProgressTopic guide

Open Weights Progress: How Free AI Models Caught Up to the Frontier

Read asBeginner In-depth

Related events (8)

5Openai Blog·1mo ago·source ↗

OpenAI, Georgetown CSET, and Stanford Internet Observatory Publish LLM Disinformation Misuse Report

OpenAI researchers collaborated with Georgetown University's Center for Security and Emerging Technology (CSET) and Stanford Internet Observatory to produce a report on how large language models could be misused to augment disinformation campaigns. The work draws on an October 2021 workshop with 30 experts across disinformation research, ML, and policy, plus over a year of additional research. The report outlines threat models for LLM-enabled disinformation and proposes a framework for analyzing potential mitigations.

AI Safety Research Regulatory Developments large language models Stanford Internet Observatory Georgetown University Center for Security and Emerging Technology +2 more

7Openai Blog·1mo ago·source ↗

Building an Early Warning System for LLM-Aided Biological Threat Creation

OpenAI published a blueprint for evaluating whether LLMs can meaningfully assist in biological threat creation. In a controlled study with biology experts and students, GPT-4 was found to provide at most mild uplift in biological threat creation accuracy. The results are inconclusive but are framed as a starting point for ongoing safety research and community deliberation on biosecurity risks from AI.

Evaluation and Benchmarking AI Safety Research biological threat creation evaluation OpenAI GPT-4

5Hacker News·1mo ago·source ↗

Disagreement among frontier LLMs on real-world fact-checks

A study examines how frontier large language models diverge in their responses to real-world fact-checking queries, surfacing systematic disagreements across models on factual claims. The work appears to benchmark multiple leading models against a set of verifiable facts, revealing inconsistencies that have implications for reliability and deployment. With 475 HN points and 333 comments, the piece has generated substantial community discussion. The findings are relevant to evaluation methodology, model calibration, and trust in AI-generated factual content.

Frontier Model Releases Evaluation and Benchmarking frontier LLMs lenz.io Hacker News

7The Batch·29d ago·source ↗

GPT-5.5 Outperforms Benchmarks but Leads in Hallucination Rate; Kimi K2.6 Tops Open LLMs

GPT-5.5, OpenAI's latest closed vision-language model built for agentic coding and computer use, tops the Artificial Analysis Intelligence Index and ARC-AGI-2 benchmarks but exhibits a significantly higher hallucination rate (85.53%) compared to Claude Opus 4.7 (36.18%) and Gemini 3.1 Pro Preview (49.87%) on the AA-Omniscience benchmark. GPT-5.5 Pro processes reasoning tokens in parallel during inference, and pricing is roughly double GPT-5.4 rates. The model ranks lower on subjective Arena.ai leaderboards, where Claude Opus models dominate. The issue also notes Kimi K2.6 leading open-weight LLMs, though details on that item are truncated.

Frontier Model Releases Evaluation and Benchmarking DeepLearning.AI Artificial Analysis Intelligence Index Tau2-bench Telecom +17 more

5arXiv · cs.LG·13d ago·source ↗

ReproRepo: Scalable LLM agent framework for reproducibility auditing using GitHub issues

ReproRepo is a new framework for evaluating LLM agents on reproducibility auditing of ML research, using naturally occurring GitHub issues as supervision signals rather than costly manual curation. The framework is instantiated on 1,149 recent ML papers from major conferences and benchmarks four frontier model-agent configurations. The best-performing agent (Codex with GPT-5.5) surfaces at least one semantically related human-reported reproduction blocker for ~90% of papers, though exact localization of issues remains a weakness. The work provides a reusable, scalable evaluation harness for this underexplored agentic task.

Evaluation and Benchmarking Agent and Tool Ecosystem OpenAI ReproRepo Codex +1 more

6arXiv · cs.CL·17h ago·source ↗

Attractor states emerge in multi-turn LLM conversations, with asymmetric model influence

A new arXiv preprint studies long-run dynamics in multi-agent LLM conversations across 7 models and 20 controversial topics, finding that self-play trajectories form model-specific attractor states that asymmetrically influence conversation partners in mixed-play debates. Claude Haiku is identified as a strong attractor that pulls other models toward its stylistic traits (e.g., metacommentary), while GPT-4.1 nano is found to be especially malleable. The results suggest open-ended LLM interactions are partially predictable from model-specific attractors, with implications for designing and monitoring autonomous agentic systems.

Evaluation and Benchmarking AI Safety Research Attractor States Emerge in Multi-Turn LLM Conversations GPT-4.1 nano Claude Haiku 4.5 +3 more

5arXiv · cs.CL·1mo ago·source ↗

LLUMI: Fine-Tuning Open-Source LLMs for Mental Health Writing Assistance Using Reddit Community Feedback

LLUMI is a two-component system (a generation model and an improvement model) designed to provide mental health writing assistance using smaller open-source LLMs hosted in privacy-preserving, on-premise environments. The system leverages Reddit community endorsement signals (upvotes/downvotes) to construct preference pairs for SFT and DPO training, then further aligns outputs via human evaluation across readability, empathy, connection, actionability, and safety dimensions. Results show LLUMI achieves performance comparable to proprietary GPT-based models on linguistic and human evaluations, suggesting community-derived preference signals can substitute for expensive expert labeling in sensitive domains.

Open Weights Progress AI Safety Research Reddit LLUMI Direct Preference Optimization (DPO)+3 more

4arXiv · cs.CL·27d ago·source ↗

Training-free mixture-of-agents framework combines LLMs and knowledge graphs for multi-document summarization

A new arXiv preprint proposes a training-free multi-agent framework for multi-document summarization (MDS) that decomposes the task into specialized agents for extractive selection, knowledge-aware abstraction, and iterative refinement, unified via a multi-perspective consistency mechanism. The system integrates LLMs with knowledge graphs without task-specific fine-tuning. Experiments across four datasets in English and Vietnamese show state-of-the-art or competitive performance, with the authors emphasizing cross-domain and cross-lingual generalization.

Evaluation and Benchmarking Agent and Tool Ecosystem A Training-Free Mixture-of-Agents Framework for Multi-Document Summarization using LLMs and Knowledge Graphs