5GitHub Trending (AI/LLM filtered)·1mo ago

LEANN: RAG System with 97% Storage Savings for On-Device Private Retrieval

LEANN is an open-source retrieval-augmented generation (RAG) system targeting personal device deployment with claimed 97% storage reduction compared to conventional vector index approaches. The project is associated with MLsys 2026, suggesting an upcoming systems research paper. It emphasizes privacy through fully local execution and aims to maintain retrieval accuracy despite aggressive compression. The repository has accumulated over 11,000 stars with strong recent momentum.

Inference Economics Enterprise Deployment Patterns Agent and Tool Ecosystem LEANN yichuan-w MLsys 2026 Retrieval-Augmented Generation

Related guides (3)

Enterprise Deployment PatternsTopic guide

Enterprise Deployment Patterns: From AI Demo to Production Reality

Read asBeginner In-depth

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Inference EconomicsTopic guide

Inference Economics: The Cost of Running AI in Production

Read asBeginner In-depth

Related events (8)

6arXiv · cs.CL·24d ago·source ↗

Coverage Illusion: Post-Retrieval Cascade Design Reduces LLM Augmentation Overhead in Production RAG

A case study on the Danish National Encyclopedia's RAG system evaluates five retrieval workflows across 20,000 query-workflow pairs, revealing a 'Coverage Illusion' where synthetic queries overestimate the need for LLM augmentation (90%+) versus real production traffic (27.8%). Pre-retrieval routing cannot detect this gap because augmentation necessity is only revealed after index search. A post-retrieval cascade running workflows cheapest-first and escalating to LLM augmentation only on empty results improves quality by +0.140 Composite Overall points over Always-HyDE, reduces latency by 31.8%, and eliminates LLM augmentation for 72.2% of real queries. The work highlights a structural mismatch between synthetic and real query distributions that affects RAG system design assumptions.

Evaluation and Benchmarking Inference Economics RAG Coverage Illusion HyDE +5 more

4Github Trending·29d ago·source ↗

MemOS: Self-Evolving Memory OS for LLM Agents with Hybrid Retrieval and Token Savings

MemOS is an open-source TypeScript project providing a memory operating system layer for LLM and AI agents, featuring ultra-persistent memory, hybrid retrieval, and cross-task skill reuse. The project claims 35.24% token savings through its memory management approach. It has accumulated 9,329 GitHub stars with moderate daily momentum (+67). The system targets agent memory persistence and efficiency as a foundational infrastructure component.

Inference Economics Agent and Tool Ecosystem MemOS MemTensor

6Anthropic News·17d ago·source ↗

Anthropic introduces Contextual Retrieval to reduce RAG retrieval failures by up to 67%

Anthropic published a technical method called Contextual Retrieval that combines Contextual Embeddings and Contextual BM25 to address the context-loss problem in traditional RAG pipelines. The approach prepends chunk-level context before encoding, reducing failed retrievals by 49% standalone and 67% when combined with reranking. The post also highlights prompt caching as a simpler alternative for knowledge bases under 200K tokens, and provides a cookbook for deployment with Claude.

Enterprise Deployment Patterns Agent and Tool Ecosystem Claude BM25 Contextual Retrieval +1 more

4arXiv · cs.CL·8d ago·source ↗

UMG-RAG: Training-free hybrid retrieval with uncertainty-aware granularity fusion for long-document RAG

Researchers propose Uncertainty-aware Multi-Granularity RAG (UMG-RAG), a training-free hybrid retrieval framework that addresses the tension between large and fine-grained retrieval chunks in RAG pipelines. The system converts dense and sparse retriever scores across multiple chunk granularities into evidence distributions, estimates reliability via entropy, and fuses candidates using query-specific confidence signals. A variant called UMGP-RAG uses fine-grained hits to locate evidence while returning broader parent chunks for coherence. Experiments on QA benchmarks show improved generation quality with no changes to the underlying retriever or generator.

Long Context Evolution Evaluation and Benchmarking Uncertainty-Aware Hybrid Retrieval for Long-Document RAG Uncertainty-aware Multi-Granularity RAG

4Hugging Face Blog·1mo ago·source ↗

Building Cost-Efficient Enterprise RAG Applications with Intel Gaudi 2 and Intel Xeon

This Hugging Face blog post details how to build retrieval-augmented generation (RAG) pipelines for enterprise use cases using Intel Gaudi 2 accelerators and Intel Xeon CPUs. It covers the architecture and cost-efficiency tradeoffs of deploying RAG on Intel hardware as an alternative to GPU-based infrastructure. The post is positioned as a practical guide for organizations seeking lower-cost inference deployments.

Inference Economics Enterprise Deployment Patterns Intel Xeon Intel Gaudi Hugging Face +3 more

3Github Trending·3d ago·source ↗

RAGFlow open-source RAG engine with agent capabilities trending on GitHub

RAGFlow is an open-source Retrieval-Augmented Generation engine that combines RAG with agent capabilities, positioned as a context layer for LLMs. The project has accumulated over 83,000 GitHub stars with 111 new stars today, indicating sustained community interest. It is maintained by Infiniflow and represents a notable open-source tooling option in the RAG/agent ecosystem.

Agent and Tool Ecosystem Infiniflow RAGFlow

5arXiv · cs.LG·15d ago·source ↗

SARDI: Self-Augmenting Retrieval for Diffusion Language Models using lookahead tokens

Researchers introduce SARDI, a training-free RAG framework for discrete diffusion language models that repurposes discarded low-confidence tokens during denoising as lookahead signals to guide retrieval before output is finalized. The method is retriever-agnostic and applicable to any reasoning-capable discrete diffusion LM. Evaluated across five multi-hop QA benchmarks, SARDI outperforms training-free diffusion and autoregressive retrieval baselines at up to 8x higher throughput.

Evaluation and Benchmarking Agent and Tool Ecosystem Self-Augmenting Retrieval for Diffusion Language Models SARDI

4arXiv · cs.CL·12d ago·source ↗

HKVM-RAG: Hypergraph key-value separation improves multi-hop retrieval-augmented generation

A new arXiv preprint introduces HKVM-RAG, an evidence-organization layer for multi-hop RAG that uses weighted hyperedges as retrieval keys while retaining passage text as answer values. Under a fixed-substrate protocol controlling for tuple cache, reader, and evaluation budget, the hypergraph key-value approach improves over KG-PPR by +3.4 F1 on 2WikiMultiHopQA and +3.6 F1 on MuSiQue. A dense-aware controller combining frozen ColBERTv2 with HKVM features reaches 88.8, 65.1, and 85.8 F1 on three benchmarks, outperforming ColBERTv2 alone by 5–11 F1 points. The work positions hypergraph organization as a reusable evidence-control mechanism rather than a dense-retrieval replacement.

Evaluation and Benchmarking Agent and Tool Ecosystem ColBERTv2 MuSiQue 2WikiMultiHopQA +2 more