ML//Training//dataset//The Stack

2026-02-25

- Largest open code dataset (BigCode: HuggingFace + ServiceNow). The code companion to The Pile.

Largest open code dataset (BigCode: HuggingFace + ServiceNow). The code companion to The Pile.

v1: 6.4 TB of permissively licensed code in 358 languages. v2: expanded, better deduplication.

StarCoder and other open code models train on The Stack, the foundation of open-source codegen.

The "Common Crawl of code": same role for code models that Common Crawl plays for text LLMs.