Almanac
dataset

GitHub Code Dataset

datasetactivegithub-code-dataset-10e20400·1 events·first seen 28d ago

Aliases: GitHub Code Dataset

Co-occurring entities

More like this (12)

Recent events (1)

3Hugging Face Blog·28d ago·source ↗

Training CodeParrot from Scratch

Hugging Face published a detailed walkthrough of training CodeParrot, a GPT-2-style language model trained from scratch on GitHub code data. The post covers dataset preparation, tokenizer training, model configuration, and distributed training setup using the Accelerate library. It serves as both a technical tutorial and a demonstration of open-source code generation model development practices circa late 2021.