dataset
GitHub Code Dataset
datasetactive
github-code-dataset-10e20400·1 events·first seen 28d agoAliases: GitHub Code Dataset
Co-occurring entities
More like this (12)
Recent events (1)
Training CodeParrot from Scratch
Hugging Face published a detailed walkthrough of training CodeParrot, a GPT-2-style language model trained from scratch on GitHub code data. The post covers dataset preparation, tokenizer training, model configuration, and distributed training setup using the Accelerate library. It serves as both a technical tutorial and a demonstration of open-source code generation model development practices circa late 2021.