LOCUS: A large-scale corpus of U.S. local ordinances for legal AI research
Researchers introduce LOCUS, a comprehensive machine-readable corpus of U.S. municipal and county ordinance codes covering 9,239 jurisdictions, with a county-harmonized access layer spanning 2,309 of 3,144 U.S. counties. The corpus was assembled using OCR to handle diverse document formats previously locked in vendor platforms, and is released on HuggingFace alongside ModernBERT-based classifiers for analyzing local law along dimensions like opacity and paternalism. The work addresses a significant gap in legal AI training data, as local ordinances govern large swaths of everyday regulation but have been absent from existing corpora.
Related guides (1)
Related events (8)
Constitutional AI with Open LLMs
This Hugging Face blog post explores implementing Constitutional AI (CAI) techniques using open-weight language models. The post likely covers how to replicate Anthropic's CAI alignment methodology—using a set of principles to guide model self-critique and revision—without relying on proprietary systems. It represents a practical contribution to democratizing alignment research tooling.
Open-Source Text Generation & LLM Ecosystem at Hugging Face
Hugging Face published a blog post surveying the open-source LLM ecosystem as of mid-2023, covering text generation models, tooling, and deployment patterns available on the platform. The post highlights the breadth of open-weight models and associated infrastructure for inference and fine-tuning. It serves as a reference overview of the state of open-source LLMs at that point in time.
StarCoder: A State-of-the-Art LLM for Code
Hugging Face and ServiceNow released StarCoder, a large language model for code trained on permissively licensed data from The Stack dataset. The model targets code generation, completion, and understanding tasks and is positioned as an open-weights alternative to proprietary code models. The release includes model weights, training details, and an associated technical report.
Evaluating Large Language Models Trained on Code
OpenAI published research on evaluating large language models trained on code, introducing the Codex model and the HumanEval benchmark for assessing code generation capabilities. The work established foundational methodology for measuring functional correctness of code produced by LLMs using a pass@k metric. This paper became a landmark reference for code-focused LLM evaluation and influenced subsequent code generation research across the field.
LOGOS: A unified autoregressive foundation model for natural science tasks across domains
Researchers introduce LOGOS (Language Of Generative Objects in Science), a generative language model that encodes heterogeneous scientific objects and spatial interactions as discrete token sequences within a single autoregressive framework, avoiding explicit coordinates or geometric neural networks. Models are trained at 1B, 3B, and 8B parameter scales and consistently match or outperform domain-specific baselines across diverse scientific tasks. The work argues that AI for Science should converge on shared architectures and training paradigms with LLMs rather than maintaining a separate technical stack. Model weights are released publicly.
ReproRepo: Scalable LLM agent framework for reproducibility auditing using GitHub issues
ReproRepo is a new framework for evaluating LLM agents on reproducibility auditing of ML research, using naturally occurring GitHub issues as supervision signals rather than costly manual curation. The framework is instantiated on 1,149 recent ML papers from major conferences and benchmarks four frontier model-agent configurations. The best-performing agent (Codex with GPT-5.5) surfaces at least one semantically related human-reported reproduction blocker for ~90% of papers, though exact localization of issues remains a weakness. The work provides a reusable, scalable evaluation harness for this underexplored agentic task.
OpenKB: Open-source LLM knowledge base library gains traction on GitHub
VectifyAI has released OpenKB, an open-source Python library for building LLM-powered knowledge bases. The repository is trending on GitHub with 2,389 total stars and 208 new stars in a single day, suggesting meaningful community interest. No detailed technical description is available from the source snippet.
GGML and llama.cpp Join Hugging Face to Ensure Long-Term Progress of Local AI
GGML and llama.cpp, the foundational open-source libraries enabling efficient local inference of large language models, are joining Hugging Face. This move is intended to secure long-term development and sustainability of the projects that underpin much of the local/on-device AI ecosystem. The acquisition or integration represents a significant consolidation of key open-weights inference infrastructure under the Hugging Face umbrella.
