MinerU: Document-to-LLM-Ready Markdown/JSON Conversion Tool
MinerU is an open-source Python tool by OpenDataLab that converts complex documents (PDFs, Office files) into structured markdown or JSON formats optimized for LLM and agentic workflows. The repository has accumulated 65,610 GitHub stars with 180 new stars today, indicating sustained community traction. It targets a common preprocessing bottleneck in RAG and agent pipelines.
Related guides (1)
Related events (8)
PaddleOCR: OCR Toolkit Bridging Documents and LLMs
PaddleOCR is an open-source OCR toolkit built on PaddlePaddle that converts PDFs and images into structured data suitable for LLM pipelines. It supports 100+ languages and is positioned as a document-to-AI bridge. The repository has accumulated nearly 79,000 GitHub stars, with 148 new stars today, indicating sustained community interest.
ARIS: Lightweight autonomous ML research agent using Markdown-only skills
ARIS (Auto-Research-In-Sleep) is an open-source Python project providing lightweight, framework-free Markdown-based skills for autonomous ML research workflows, including cross-model review loops, idea discovery, and experiment automation. It is designed to work with any LLM agent backend including Claude Code, Codex, or others. The project has accumulated 11,791 GitHub stars with notable daily traction (+106), suggesting meaningful community adoption.
Open-Source Text Generation & LLM Ecosystem at Hugging Face
Hugging Face published a blog post surveying the open-source LLM ecosystem as of mid-2023, covering text generation models, tooling, and deployment patterns available on the platform. The post highlights the breadth of open-weight models and associated infrastructure for inference and fine-tuning. It serves as a reference overview of the state of open-source LLMs at that point in time.
Repomix: Repository-to-Single-File Packing Tool for LLM Ingestion
Repomix is an open-source TypeScript tool that serializes an entire code repository into a single structured file optimized for consumption by LLMs such as Claude, ChatGPT, Gemini, and others. It addresses the practical problem of feeding large codebases into AI coding assistants and chat interfaces. The project has accumulated over 25,000 GitHub stars with continued daily growth.
LLUMI: Fine-Tuning Open-Source LLMs for Mental Health Writing Assistance Using Reddit Community Feedback
LLUMI is a two-component system (a generation model and an improvement model) designed to provide mental health writing assistance using smaller open-source LLMs hosted in privacy-preserving, on-premise environments. The system leverages Reddit community endorsement signals (upvotes/downvotes) to construct preference pairs for SFT and DPO training, then further aligns outputs via human evaluation across readability, empathy, connection, actionability, and safety dimensions. Results show LLUMI achieves performance comparable to proprietary GPT-based models on linguistic and human evaluations, suggesting community-derived preference signals can substitute for expensive expert labeling in sensitive domains.
awesome-llm-apps: 100+ Runnable AI Agent & RAG Application Examples
A curated GitHub repository collecting over 100 deployable AI agent and RAG (Retrieval-Augmented Generation) applications built with LLMs. The collection is designed for practical use — clone, customize, and ship. With 110,915 total stars and 202 added today, it reflects strong community interest in applied LLM tooling.
STAGE pipeline generates source-grounded training data for text-to-JSON extraction
Researchers introduce STAGE (Spreadsheet-grounded Text-to-JSON Artifact GEneration), a data generation pipeline that uses LLMs to synthesize training data for structured extraction from long unstructured documents, validating outputs against underlying spreadsheets. Evaluated on STAGE-Eval, an 851-example benchmark, the pipeline substantially improves Qwen3-4B performance, raising exact match from 31.37% to 74.27% and value accuracy from 45.46% to 90.69%. The work targets a practical bottleneck in enterprise document processing: reliably converting financial filings and clinical records into machine-readable JSON.
Hyper-Extract: LLM-powered extraction of graphs, hypergraphs, and spatio-temporal structures from text
Hyper-Extract is a Python library that uses LLMs to transform unstructured text into structured knowledge representations including graphs, hypergraphs, and spatio-temporal extractions via a single command interface. The project is trending on GitHub with 1,723 stars and 124 new stars today. It targets a practical gap in the LLM tooling ecosystem for structured knowledge extraction beyond simple key-value or flat-schema outputs.
