StarCoder: A State-of-the-Art LLM for Code
Hugging Face and ServiceNow released StarCoder, a large language model for code trained on permissively licensed data from The Stack dataset. The model targets code generation, completion, and understanding tasks and is positioned as an open-weights alternative to proprietary code models. The release includes model weights, training details, and an associated technical report.
Related guides (3)
Related events (8)
StarCoder2 and The Stack v2
Hugging Face and BigCode released StarCoder2, a new family of open code language models trained on The Stack v2, a significantly expanded code dataset. The release includes multiple model sizes and represents a major update to the BigCode open-weights code model lineage. The Stack v2 is a new large-scale permissively licensed code dataset used for training.
Creating a Coding Assistant with StarCoder
This Hugging Face blog post describes the process of building StarChat-Alpha, a conversational coding assistant fine-tuned from the StarCoder large language model. The post covers the instruction-tuning methodology used to adapt StarCoder for chat-style interactions, including dataset preparation and training details. It represents an early example of open-weights coding LLMs being adapted into assistant-style deployments.
StarCoder2-Instruct: Fully Transparent and Permissive Self-Alignment for Code Generation
Hugging Face introduces StarCoder2-Instruct, a code generation model fine-tuned via a self-alignment approach that requires no human-annotated instruction data. The method uses the base model itself to generate synthetic instruction-response pairs, which are then filtered and used for supervised fine-tuning. The model and all training data, pipelines, and evaluation code are released under permissive licenses, making it one of the more transparent instruction-tuned code models available.
CodeGemma - Google's Official Code-Focused LLM Release
Google has released CodeGemma, a family of code-specialized large language models, announced via the Hugging Face blog. CodeGemma builds on the Gemma model family and is targeted at code generation and understanding tasks. The release represents Google's continued push into open-weights code LLMs to compete with models like Code Llama and DeepSeek Coder.
Code Llama: Llama 2 learns to code
Meta released Code Llama, a family of code-specialized large language models built on top of Llama 2. The models are available in multiple sizes and variants, including a Python-specialized version and an instruction-following version. Code Llama supports long context windows for handling large codebases and is released as open weights, making it accessible for research and commercial use.
Evaluating Large Language Models Trained on Code
OpenAI published research on evaluating large language models trained on code, introducing the Codex model and the HumanEval benchmark for assessing code generation capabilities. The work established foundational methodology for measuring functional correctness of code produced by LLMs using a pass@k metric. This paper became a landmark reference for code-focused LLM evaluation and influenced subsequent code generation research across the field.
Mistral AI Releases Codestral: 22B Open-Weight Code Generation Model
Mistral AI has released Codestral, a 22B open-weight model explicitly designed for code generation, supporting 80+ programming languages with a 32k context window. The model is available under a non-production license on HuggingFace, with commercial licenses available on request, and is accessible via a dedicated API endpoint (codestral.mistral.ai) free during an 8-week beta. Codestral claims state-of-the-art performance on RepoBench, HumanEval, and fill-in-the-middle benchmarks, outperforming DeepSeek Coder 33B and matching or exceeding GPT-4-Turbo on some language-specific evals. Integrations are available with LlamaIndex, LangChain, Continue.dev, and Tabnine for IDE-based developer workflows.
Open-Source Text Generation & LLM Ecosystem at Hugging Face
Hugging Face published a blog post surveying the open-source LLM ecosystem as of mid-2023, covering text generation models, tooling, and deployment patterns available on the platform. The post highlights the breadth of open-weight models and associated infrastructure for inference and fine-tuning. It serves as a reference overview of the state of open-source LLMs at that point in time.


