ML//Training//dataset//Common Crawl

- Petabytes of raw web scrapes, updated monthly, free.


Petabytes of raw web scrapes, updated monthly, free.

The backbone of most LLM pre-training data.

Original crawl is full of spam, duplicates, porn, boilerplate. Heavy filtering required.