benchmark

Triadic Werewolf

benchmarkactiveprovisionaltriadic-werewolf-be52697e·1 events·first seen 15h ago

Aliases: Triadic Werewolf

Co-occurring entities

Llama 3.1 70B DeepSeek V4 OpenAI GPT-4.1 Meta

More like this (12)

Tri-System Theory TripletQL Qwen3-VL-Thinking Frank-Wolfe Optimization Semantic Triplet Restoration Iterated Prisoner's Dilemma RAG Triad WY-form triangular chunk solver Qwen3 L3Cube LAMBDA Kuhn poker

Recent events (1)

5arXiv · cs.CL·15h ago·source ↗

Triadic Werewolf benchmark exposes multi-hop Theory of Mind failures in LLMs

Researchers introduce a Werewolf game variant with a Jester faction whose inverted utility function (winning by being voted out) requires models to reason across three opposing incentive structures simultaneously. Across 60 games, GPT-4.1, DeepSeek-V3.1, and Llama-3.3-70B all struggle: Werewolves never exceed 20% win rate and GPT-4.1 wolves vote out the Jester in 60-70% of games, a self-defeating action. Only DeepSeek-V3.1 learns the nuanced strategy of appearing suspicious without appearing intentionally suspicious, and benefits most from self-learning. The work argues dyadic social-deduction benchmarks systematically underestimate the difficulty of multi-agent Theory of Mind.

Evaluation and Benchmarking Agent and Tool Ecosystem Llama 3.1 70B Triadic Werewolf DeepSeek V4 +3 more