ML//Training//dataset//The Stack
- Largest open code dataset (BigCode: HuggingFace + ServiceNow). The code companion to The Pile.
Largest open code dataset (BigCode: HuggingFace + ServiceNow). The code companion to The Pile.
v1: 6.4 TB of permissively licensed code in 358 languages. v2: expanded, better deduplication.
StarCoder and other open code models train on The Stack — the foundation of open-source codegen.
The "Common Crawl of code": same role for code models that Common Crawl plays for text LLMs.