5arXiv cs.CL (Computation and Language)·11d ago

DocTrace: Structure-Aware On-Demand Hypergraph Memory for Long-Document QA

Researchers introduce DocTrace, a multi-agent RAG framework for long-document question answering that uses query-triggered knowledge organization rather than costly query-agnostic preprocessing. The system combines a lightweight document structural tree index, on-demand hypergraph working memory, and a graph-structured experience memory that stores successful reasoning plans for reuse. Evaluated on four long-document QA datasets, DocTrace outperforms the strongest baseline (ComoRAG) by up to 8.85% F1 and 4.40% EM while reducing computational cost by 53.32%.

Long Context Evolution Agent and Tool Ecosystem ComoRAG DocTrace Trace Only What You Need: Structure-Aware On-Demand Hypergraph Memory for Long-Document Question Answering

Related guides (2)

Long Context EvolutionTopic guide

Long Context Evolution: From Bigger Windows to Smarter Memory

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

6arXiv · cs.AI·25d ago·source ↗

VeriTrace: Cognitive-Graph Framework with Explicit Regulatory Loops for Deep Research Agents

VeriTrace introduces a cognitive-graph framework for deep research agents that replaces implicit LLM reasoning over intermediate representations with three explicit regulatory loops: interpretive update, deviation feedback, and schema revision. The system addresses contamination and error propagation in evolving mental models during complex multi-step research tasks. Using Qwen3.5-27B backbones, VeriTrace improves over the strongest matched baseline by 4.22 pp on DeepResearch Bench Insight and 5.9 pp Overall win rate on DeepConsult. With Config-DeepSeek, it achieves the strongest reproducible open-source result on DeepResearch Bench.

Frontier Model Releases Evaluation and Benchmarking DeepSeek V4 cognitive-graph DeepResearch Bench +4 more

4arXiv · cs.CL·12d ago·source ↗

HKVM-RAG: Hypergraph key-value separation improves multi-hop retrieval-augmented generation

A new arXiv preprint introduces HKVM-RAG, an evidence-organization layer for multi-hop RAG that uses weighted hyperedges as retrieval keys while retaining passage text as answer values. Under a fixed-substrate protocol controlling for tuple cache, reader, and evaluation budget, the hypergraph key-value approach improves over KG-PPR by +3.4 F1 on 2WikiMultiHopQA and +3.6 F1 on MuSiQue. A dense-aware controller combining frozen ColBERTv2 with HKVM features reaches 88.8, 65.1, and 85.8 F1 on three benchmarks, outperforming ColBERTv2 alone by 5–11 F1 points. The work positions hypergraph organization as a reusable evidence-control mechanism rather than a dense-retrieval replacement.

Evaluation and Benchmarking Agent and Tool Ecosystem ColBERTv2 MuSiQue 2WikiMultiHopQA +2 more

6arXiv · cs.AI·12d ago·source ↗

MemDreamer: Hierarchical graph memory and agentic retrieval for long video understanding

MemDreamer is a plug-and-play framework that decouples perception and reasoning for long-video understanding by incrementally building a three-tier Hierarchical Graph Memory capturing spatiotemporal and causal relations. During inference, a reasoning model uses an Observation-Reason-Action loop with agentic tool-augmented retrieval to navigate the memory graph, constraining the context window to 2% of full-context ingestion while achieving a 12.5-point absolute accuracy gain. The system reaches SOTA on four benchmarks, narrowing the gap with human experts to 3.7 points. The authors also report a strong linear correlation between logical reasoning performance and long-video understanding, proposing agentic capability scaling as a new paradigm for multimodal comprehension.

Long Context Evolution Agent and Tool Ecosystem MemDreamer Hierarchical Graph Memory Observation-Reason-Action +1 more

5arXiv · cs.CL·9d ago·source ↗

Doc-to-Atom: Compositional parametric memory via semantically typed micro-LoRA adapters

Doc-to-Atom (Doc2Atom) proposes a framework that decomposes documents into semantically typed knowledge atoms, each compiled into an independent micro-LoRA adapter with a retrieval key. At inference, a lightweight query router assembles only relevant atoms into a query-specific adapter injected into a frozen base model, addressing the irrelevant-query interference and scalability problems of monolithic adapter approaches like Doc-to-LoRA. The system is trained end-to-end via multi-objective distillation and outperforms Doc-to-LoRA baselines on six QA benchmarks while reducing memory cost.

Long Context Evolution Inference Economics Doc-to-LoRA Doc-to-Atom LoRA

4arXiv · cs.CL·8d ago·source ↗

UMG-RAG: Training-free hybrid retrieval with uncertainty-aware granularity fusion for long-document RAG

Researchers propose Uncertainty-aware Multi-Granularity RAG (UMG-RAG), a training-free hybrid retrieval framework that addresses the tension between large and fine-grained retrieval chunks in RAG pipelines. The system converts dense and sparse retriever scores across multiple chunk granularities into evidence distributions, estimates reliability via entropy, and fuses candidates using query-specific confidence signals. A variant called UMGP-RAG uses fine-grained hits to locate evidence while returning broader parent chunks for coherence. Experiments on QA benchmarks show improved generation quality with no changes to the underlying retriever or generator.

Long Context Evolution Evaluation and Benchmarking Uncertainty-Aware Hybrid Retrieval for Long-Document RAG Uncertainty-aware Multi-Granularity RAG

6arXiv · cs.CL·19d ago·source ↗

LongTraceRL: Reinforcement Learning for Long-Context Reasoning via Search Agent Trajectories and Rubric Rewards

LongTraceRL is a new RL training framework for improving long-context reasoning in LLMs, addressing limitations of existing RLVR methods. It constructs challenging training data using multi-hop questions from knowledge graph random walks and tiered distractors derived from search agent trajectories (high-confusability: read but uncited; low-confusability: seen but unopened). A rubric reward provides entity-level process supervision along reasoning chains, applied only to correct responses to prevent reward hacking. Experiments across three LLMs (4B–30B parameters) on five long-context benchmarks show consistent improvements over strong baselines.

Long Context Evolution Evaluation and Benchmarking tiered distractors Knowledge Graph Random Walk Long-context Reasoning Benchmarks +8 more

6arXiv · cs.CL·5d ago·source ↗

GitOfThoughts: Git-based agent memory substrate with sobering findings on memory utility for novel problems

Researchers introduce GitOfThoughts, a system that stores LLM reasoning trees as git repositories, enabling replayable, auditable, and mergeable agent memory at low engineering cost. Across five memory substrates (none, markdown, vector, graph, git), two benchmarks, and two model scales with pre-registered replications, the paper finds that no memory format reliably improves accuracy on novel problems. Memory only helps above a 'copyability threshold' (similarity >~0.8), where retrieved cases are near-duplicates of the current problem — and even then, the gain is answer retrieval rather than method transfer. The paper also documents a retracted result and refuted hypothesis, modeling a rigorous evaluation standard.

Evaluation and Benchmarking Agent and Tool Ecosystem GitOfThoughts: Version-Controlled Reasoning and Agent Memory You Can Replay, Diff, and Merge GitOfThoughts

5arXiv · cs.CL·11d ago·source ↗

REAL: Reasoning-enhanced temporal graph framework for LLM long-term memory management

REAL is a new framework that represents LLM conversational memory as a temporal, confidence-aware directed property graph, where atomic facts carry validity intervals, confidence scores, and exploration intent labels. It addresses three limitations of prior memory systems: flat text structures, destructive overwrites of evolving facts, and passive retrieval. The system uses non-destructive temporal updates, semantic evaluator-guided hybrid beam search, and counterfactual inference to repair incomplete retrieval states. Experiments show a 22.72% average improvement over flat-text, graph-based, and existing memory baselines.

Long Context Evolution Agent and Tool Ecosystem REAL