Entity · benchmark

AppWorld

benchmarkactiveappworld-d2dae6ec·2 events·first seen Jun 10, 2026

Aliases: AppWorld

Co-occurring entities

RAM+When Model Merging Rivals Joint Multi-Task Reinforcement Learning: A Task-Vector Geometry Analysis TIES LOOP Qwen3-4B BFCL-V3 Knowledge-Augmented Tool Execution

More like this (12)

iOSWorld DevicesWorld OSWorld APPS MobileWorld Apps SDK OSWorld-Verified AgentMob Gemini App ChatGPT App Directory Sora app SpatialWorld

Recent events (2)

6arXiv · cs.AI·Jul 20, 2026·source ↗

Model merging matches joint multi-task RL training on AppWorld benchmark, explained by near-orthogonal task vectors

A new arXiv paper provides the first direct comparison of model merging versus joint multi-task reinforcement learning training, using Qwen3-8B specialists trained on the AppWorld agent benchmark with the LOOP algorithm. Merging methods (TIES, RAM+) statistically match jointly trained models on task-goal completion. The authors explain this via task vector geometry: specialist task vectors are near-orthogonal (cosine similarity 0.06–0.10) despite ~65% parameter support overlap, causing sign- and support-based merging methods to collapse to near-uniform averaging.

Evaluation and Benchmarking Agent and Tool Ecosystem RAM+When Model Merging Rivals Joint Multi-Task Reinforcement Learning: A Task-Vector Geometry Analysis TIES +4 more

6arXiv · cs.CL·Jun 10, 2026·source ↗

KATE framework improves LLM tool calling via experiential knowledge integration and parallel reasoning

Researchers present KATE (Knowledge-Augmented Tool Execution), a framework addressing LLM failures in multi-step tool use by systematically studying knowledge acquisition, activation, and internalization. Key findings include that instance-level experiential knowledge outperforms abstract intent-level knowledge, that expanding reasoning width via parallel sampling with aggregation beats deeper chain-of-thought, and that reinforcement learning outperforms supervised fine-tuning for knowledge internalization. KATE is evaluated on BFCL-V3 and AppWorld benchmarks, showing consistent improvements over strong baselines across model scales.

Evaluation and Benchmarking Agent and Tool Ecosystem BFCL-V3 AppWorld Knowledge-Augmented Tool Execution +1 more