ML//Training//dataset//Common Crawl
- Petabytes of raw web scrapes, updated monthly, free.
Petabytes of raw web scrapes, updated monthly, free.
The backbone of most LLM pre-training data.
Original crawl is full of spam, duplicates, porn, boilerplate. Heavy filtering required.