ML//Training//dataset//The Stack

- Largest open code dataset (BigCode: HuggingFace + ServiceNow). The code companion to The Pile.


Largest open code dataset (BigCode: HuggingFace + ServiceNow). The code companion to The Pile.

v1: 6.4 TB of permissively licensed code in 358 languages. v2: expanded, better deduplication.

StarCoder and other open code models train on The Stack — the foundation of open-source codegen.

The "Common Crawl of code": same role for code models that Common Crawl plays for text LLMs.