KATE framework improves LLM tool calling via experiential knowledge integration and parallel reasoning
Researchers present KATE (Knowledge-Augmented Tool Execution), a framework addressing LLM failures in multi-step tool use by systematically studying knowledge acquisition, activation, and internalization. Key findings include that instance-level experiential knowledge outperforms abstract intent-level knowledge, that expanding reasoning width via parallel sampling with aggregation beats deeper chain-of-thought, and that reinforcement learning outperforms supervised fine-tuning for knowledge internalization. KATE is evaluated on BFCL-V3 and AppWorld benchmarks, showing consistent improvements over strong baselines across model scales.
Related guides (3)
Related events (8)
PROVE framework trains LLMs for multi-step tool use via stateful MCP environments and programmatic rewards
Researchers introduce PROVE (Programmatic Rewards On Verified Environments), a framework for training LLMs to orchestrate multi-step tool calls using reinforcement learning. The system includes a library of 20 stateful MCP servers with 343 tools, an automated data synthesis pipeline that grounds training queries in live server state, and a multi-component programmatic reward function requiring no judge model. Training four models (Qwen3-4B, Qwen3-8B, Qwen2.5-7B, Granite-4.1-8B) with ~13K examples yields gains of up to +10.2 on BFCL Multi-Turn, +6.8 on tau2-bench, and +6.5 on T-Eval, demonstrating consistent improvements in multi-step tool orchestration.
Study compares human and LLM active causal reasoning, finding LLMs less efficient but near human-level on conjunctive rules
A new arXiv paper investigates whether active exploration reduces the 'conjunctive handicap' in causal learning, using a blicket detector task with adult participants who could freely intervene to identify causal objects. Results show active exploration substantially improves human conjunctive causal reasoning, though conjunctive rules still require more tests than disjunctive ones. State-of-the-art LLMs approach human-level hypothesis inference accuracy but show less efficient exploration strategies and similar conjunctive-disjunctive performance gaps, raising questions about the nature of LLM causal reasoning.
BeliefTrack: Benchmarking and Improving Contextual Belief Management in LLMs
This paper introduces Contextual Belief Management (CBM) as a framework for studying how LLMs should update, preserve, or ignore information across long-horizon interactions. The authors release BeliefTrack, a closed-world benchmark with symbolic verifiers enabling exact turn-level evaluation across Rule Discovery and Circuit Diagnosis tasks. Vanilla LLMs show severe CBM failures; reinforcement learning with belief-state rewards reduces failure rates by 70.9% on average, while representation-level steering achieves 46.1% reduction. Probing experiments reveal latent belief-state dynamics underlying these failures.
Act2Answer: Benchmarking commonsense and world knowledge retention in Vision-Language-Action models
Researchers introduce Act2Answer, a protocol for evaluating how much commonsense and factual knowledge VLA models retain after fine-tuning on robotics data. The approach converts knowledge benchmark questions into tabletop object-placement episodes, yielding action-grounded success rates that reduce confounds from low-level control failures. A large-scale study of 7 VLA models and 9 VLM baselines finds that VLAs retain solid performance on simple concepts but show larger gaps on richer semantic categories compared to their source VLMs, and that VQA co-training is associated with better knowledge retention.
Contrastive-Difference CKA reveals concept-specific structural alignment across LLM architectures
Researchers introduce CKA_Delta (contrastive-difference CKA), a training-free diagnostic that isolates concept-specific representational convergence from generic similarity across LLM architectures. The method reveals a geometric-functional universality dissociation: moderate geometric alignment coexists with near-perfect functional transfer across six concept domains and multiple architectural families. CKA_Delta also functions as an architectural outlier detector, flagging Gemma as a notable outlier (d=1.08, AUC=0.79). The work provides a practical tool for cross-architecture concept monitoring without requiring model training.
KINA: 899-item knowledge benchmark across 261 disciplines with formal representativeness and annotation incentive guarantees
KINA (Knowledge Index of Noah's Ark) is a new 899-item LLM benchmark spanning 261 fine-grained disciplines, addressing three methodological weaknesses in existing knowledge benchmarks: poor disciplinary representativeness, flat-payment annotation incentives, and unaudited ranking instability. The authors provide formal results: a (1-1/e) greedy approximation for disciplinary coverage and a proof that bonus-on-bar tournament payment weakly dominates flat payment for annotation quality. Evaluating 42 models from 13 labs, the top performer Gemini-3.1-Pro-Preview reaches 53.17%, with Claude-Opus-4.6 and GPT-5.4 close behind, revealing a tiered rather than smooth leaderboard structure with substantial headroom below saturation.
EDIT framework trains more rubric-faithful LLM graders via internal-state diagnostics
Researchers introduce Evidence-Diagnosed Intervention Training (EDIT), a two-phase framework for improving LLM-based rubric grading. The first phase (EDIT-SFT) identifies problematic reasoning steps using posterior belief signals and input-grounding scores, then revises only those steps with rubric checklists; the second phase (EDIT-RL) uses belief-guided reward shaping to penalize harmful belief drifts during RL. Experiments on two real-world multi-subject grading benchmarks show consistent improvements over SFT and RL baselines on both in-domain and out-of-domain splits.
Systematic study of extrinsic and intrinsic properties for effective code interpreter reasoning in LLMs
Researchers investigate what behavioral properties make LLMs effective at reasoning with a Code Interpreter (CI), identifying two axes: extrinsic 'crucial tokens' and intrinsic 'cognitive behaviors' such as verification, backtracking, and backward chaining. Stronger CI reasoning models consistently exhibit higher prevalence of these properties. The paper shows that appending code-specific crucial tokens at inference time improves performance on mathematical, ordering, and optimization tasks, while augmenting training with cognitive behaviors improves SFT and RL performance in two of three evaluated models. The work also finds these behaviors reduce overthinking in incorrect responses and improve token efficiency.


