4GitHub Trending (AI/LLM filtered)·22d ago

MinerU: Document-to-LLM-Ready Markdown/JSON Conversion Tool

MinerU is an open-source Python tool by OpenDataLab that converts complex documents (PDFs, Office files) into structured markdown or JSON formats optimized for LLM and agentic workflows. The repository has accumulated 65,610 GitHub stars with 180 new stars today, indicating sustained community traction. It targets a common preprocessing bottleneck in RAG and agent pipelines.

Agent and Tool Ecosystem MinerU OpenDataLab

Related guides (1)

Agent and Tool EcosystemTopic guide

Agent and Tool Ecosystem: How AI Is Learning to Act, Not Just Answer

Read asBeginner In-depth

Related events (8)

4Github Trending·22d ago·source ↗

PaddleOCR: OCR Toolkit Bridging Documents and LLMs

PaddleOCR is an open-source OCR toolkit built on PaddlePaddle that converts PDFs and images into structured data suitable for LLM pipelines. It supports 100+ languages and is positioned as a document-to-AI bridge. The repository has accumulated nearly 79,000 GitHub stars, with 148 new stars today, indicating sustained community interest.

Enterprise Deployment Patterns Agent and Tool Ecosystem PaddlePaddle Python PaddleOCR

5Github Trending·11d ago·source ↗

ARIS: Lightweight autonomous ML research agent using Markdown-only skills

ARIS (Auto-Research-In-Sleep) is an open-source Python project providing lightweight, framework-free Markdown-based skills for autonomous ML research workflows, including cross-model review loops, idea discovery, and experiment automation. It is designed to work with any LLM agent backend including Claude Code, Codex, or others. The project has accumulated 11,791 GitHub stars with notable daily traction (+106), suggesting meaningful community adoption.

Agent and Tool Ecosystem wanshuiyin ARIS Claude Code +1 more

4Hugging Face Blog·1mo ago·source ↗

Open-Source Text Generation & LLM Ecosystem at Hugging Face

Hugging Face published a blog post surveying the open-source LLM ecosystem as of mid-2023, covering text generation models, tooling, and deployment patterns available on the platform. The post highlights the breadth of open-weight models and associated infrastructure for inference and fine-tuning. It serves as a reference overview of the state of open-source LLMs at that point in time.

Open Weights Progress Inference Economics Hugging Face +1 more

4Github Trending·29d ago·source ↗

Repomix: Repository-to-Single-File Packing Tool for LLM Ingestion

Repomix is an open-source TypeScript tool that serializes an entire code repository into a single structured file optimized for consumption by LLMs such as Claude, ChatGPT, Gemini, and others. It addresses the practical problem of feeding large codebases into AI coding assistants and chat interfaces. The project has accumulated over 25,000 GitHub stars with continued daily growth.

Long Context Evolution Agent and Tool Ecosystem yamadashy ChatGPT Claude +2 more

5arXiv · cs.CL·22d ago·source ↗

LLUMI: Fine-Tuning Open-Source LLMs for Mental Health Writing Assistance Using Reddit Community Feedback

LLUMI is a two-component system (a generation model and an improvement model) designed to provide mental health writing assistance using smaller open-source LLMs hosted in privacy-preserving, on-premise environments. The system leverages Reddit community endorsement signals (upvotes/downvotes) to construct preference pairs for SFT and DPO training, then further aligns outputs via human evaluation across readability, empathy, connection, actionability, and safety dimensions. Results show LLUMI achieves performance comparable to proprietary GPT-based models on linguistic and human evaluations, suggesting community-derived preference signals can substitute for expensive expert labeling in sensitive domains.

Open Weights Progress AI Safety Research Reddit LLUMI Direct Preference Optimization (DPO)+3 more

4Github Trending·1mo ago·source ↗

awesome-llm-apps: 100+ Runnable AI Agent & RAG Application Examples

A curated GitHub repository collecting over 100 deployable AI agent and RAG (Retrieval-Augmented Generation) applications built with LLMs. The collection is designed for practical use — clone, customize, and ship. With 110,915 total stars and 202 added today, it reflects strong community interest in applied LLM tooling.

Enterprise Deployment Patterns Agent and Tool Ecosystem awesome-llm-apps Shubham Saboo Retrieval-Augmented Generation

4arXiv · cs.CL·46h ago·source ↗

STAGE pipeline generates source-grounded training data for text-to-JSON extraction

Researchers introduce STAGE (Spreadsheet-grounded Text-to-JSON Artifact GEneration), a data generation pipeline that uses LLMs to synthesize training data for structured extraction from long unstructured documents, validating outputs against underlying spreadsheets. Evaluated on STAGE-Eval, an 851-example benchmark, the pipeline substantially improves Qwen3-4B performance, raising exact match from 31.37% to 74.27% and value accuracy from 45.46% to 90.69%. The work targets a practical bottleneck in enterprise document processing: reliably converting financial filings and clinical records into machine-readable JSON.

Evaluation and Benchmarking Enterprise Deployment Patterns STAGE Qwen3-4B STAGE-Eval

4Github Trending·2d ago·source ↗

Hyper-Extract: LLM-powered extraction of graphs, hypergraphs, and spatio-temporal structures from text

Hyper-Extract is a Python library that uses LLMs to transform unstructured text into structured knowledge representations including graphs, hypergraphs, and spatio-temporal extractions via a single command interface. The project is trending on GitHub with 1,723 stars and 124 new stars today. It targets a practical gap in the LLM tooling ecosystem for structured knowledge extraction beyond simple key-value or flat-schema outputs.

Agent and Tool Ecosystem Hyper-Extract yifanfeng97