Decoupled Search Grounding (DSG): vendor-agnostic MCP-compatible architecture for LLM agent retrieval
Researchers introduce Decoupled Search Grounding (DSG), an architecture that moves real-time search grounding outside the reasoning model via an MCP-compatible gateway, exposing provider routing, caching, and retrieval-depth as explicit controls. Evaluated across five frontier models on SimpleQA, FreshQA, and HotpotQA, DSG nearly matches native search accuracy on SimpleQA (86.1% vs. 87.7%) while achieving 91% lower search cost and 68% lower latency via a 99.4% warm-cache hit rate. In a production e-commerce deployment, DSG cuts search cost by over 98% while matching or slightly exceeding native-search accuracy. The work frames real-time grounding as an optimizable interface boundary rather than a fixed model feature, with direct relevance to MCP-based agent infrastructure.
Related guides (3)
Related events (8)
SIGA: Self-evolving grounding adapters enable coding agents to operate scientific simulators
SIGA (Simulator-Interface Grounding Adapter) is a lightweight adapter framework that equips general-purpose coding agents with the executable contracts needed to configure and run specialized scientific simulators. Evaluated primarily on GEOS (a multiphysics subsurface simulator), SIGA achieves a ~36x wall-clock speedup over human experts and improves TreeSim scores from 0.720 to 0.789 on held-out tasks, with self-evolution via trajectory rewriting yielding further gains. The system also transfers to OpenFOAM and LAMMPS, revealing that the dominant grounding mechanism (validation vs. memory/retrieval) shifts depending on the interface type. The work frames simulator setup as an agent-tool interface grounding problem, offering a generalizable pattern for deploying coding agents on domain-specific software.
DeepSeek Reasonix: Native Coding Agent with High Caching and Low Cost
DeepSeek Reasonix is a coding agent built natively on DeepSeek models, emphasizing high prompt caching rates and low inference cost. The project attracted significant Hacker News engagement (349 points, 171 comments), suggesting community interest in cost-efficient agentic coding workflows. It appears to be an open-source or community-developed tool rather than an official DeepSeek Labs release.
SearchGEO framework measures LLM search agent vulnerability to web content manipulation
Researchers introduce SearchGEO, a controlled evaluation framework for measuring endorsement corruption in LLM-based web-search agents, combining a manipulation pipeline, five-mode attack taxonomy, and multiple output metrics. Evaluating 13 LLM backends on 308 cases each, they find attack success rates ranging from 0.0% on Claude-Sonnet-4.6 to 31.4% on Gemini-3-Flash, with model-family-specific vulnerability patterns. An auxiliary probe escalating endorsement to install commands reveals a behavioral split: Claude over-rejects while GPT over-trusts. The findings argue for treating adversarial search content robustness as a first-class safety evaluation dimension for deployed agents.
PolyGnosis 2.0: Multi-Agent Architecture for Prediction Market Intelligence via Harness Engineering
PolyGnosis 2.0 introduces a multi-agent system that synthesizes Polymarket prediction market signals with GDELT OSINT streams to identify 'Perspective Mismatches' as trading signals. The paper rigorously evaluates agentic harness engineering techniques—reflection loops, tool-calling, divide-and-conquer partitioning, and chain-of-thought—in high-noise financial domains. Key empirical findings include that structural partitioning is necessary for multi-dimensional alignment, but unconstrained terminal reflection induces logical drift, and a pervasive consensus bias emerges across agent configurations. The authors identify a Pareto-optimal configuration achieving professional-grade analytical precision with minimized latency and token overhead.
Skill-Conditioned Gated Self-Distillation (SGSD) for LLM Reasoning
SGSD is a new on-policy self-distillation method for LLM reasoning that replaces trusted privileged information (e.g., reference answers) with an experience-derived skill bank of skill-mistake pairs. It constructs a multi-teacher pool, validates each teacher's contribution via a verifier, and applies a gated objective to distill informative disagreements while suppressing noisy signals. On Qwen3-1.7B, SGSD outperforms GRPO by 6.2% and answer-conditioned OPSD by 1.7% on average across AIME24, AIME25, and HMMT25. The method relaxes the assumption of trusted privileged information, making self-distillation more practical under weaker supervision.
SARDI: Self-Augmenting Retrieval for Diffusion Language Models using lookahead tokens
Researchers introduce SARDI, a training-free RAG framework for discrete diffusion language models that repurposes discarded low-confidence tokens during denoising as lookahead signals to guide retrieval before output is finalized. The method is retriever-agnostic and applicable to any reasoning-capable discrete diffusion LM. Evaluated across five multi-hop QA benchmarks, SARDI outperforms training-free diffusion and autoregressive retrieval baselines at up to 8x higher throughput.
Data Points: DeepSWE Benchmark, DeepSeek V4 Price Cuts, MAI-Image-2.5, Mythos Security Findings, MCP Stateless Update
This edition of The Batch covers five distinct AI developments: Datacurve's DeepSWE benchmark claims to fix critical grading flaws in SWE-bench Pro with hand-written verifiers and harder tasks; DeepSeek permanently cuts V4 Pro prices by 75%; Microsoft's MAI-Image-2.5 debuts third on the Arena leaderboard; Anthropic's Claude Mythos Preview found over 10,000 high/critical vulnerabilities in the first month of Project Glasswing, with remediation badly lagging discovery; and the Model Context Protocol proposes removing stateful sessions to enable stateless, load-balanced remote servers. Each item reflects meaningful movement in evaluation methodology, inference economics, multimodal generation, AI-assisted security, and agent tooling infrastructure.
DocTrace: Structure-Aware On-Demand Hypergraph Memory for Long-Document QA
Researchers introduce DocTrace, a multi-agent RAG framework for long-document question answering that uses query-triggered knowledge organization rather than costly query-agnostic preprocessing. The system combines a lightweight document structural tree index, on-demand hypergraph working memory, and a graph-structured experience memory that stores successful reasoning plans for reuse. Evaluated on four long-document QA datasets, DocTrace outperforms the strongest baseline (ComoRAG) by up to 8.85% F1 and 4.40% EM while reducing computational cost by 53.32%.


