We've just hit 1M open datasets on the Hugging Face Hub ๐ Open models need open data. Today we hit that milestone, together with the most incredible community in AI! ๐ค Onwards to the next million ๐ https://t.co/PV6knP3XlJ
Hugging Face Hits One Million Open Datasets to Fuel Open Source AI
ยท Updated
- Dataset count
- 1,004,440 (1M milestone)
- Modalities
- Text, image, audio, video, 3D, geospatial, tabular, time-series, document
- Formats
- Parquet, JSON, CSV, Arrow, webdataset, and more
- Size range
- Under 1K rows to over 1 trillion rows
- Access
- Hugging Face Datasets library
The Hub indexes datasets across multiple modalities (text, image, audio, video, document, 3D, geospatial, tabular, time-series) and standard formats including Parquet, JSON, CSV, Arrow, and webdataset. Trending sort surfaces what the community is actively using, and dataset size filters span under 1K rows to over 1 trillion rows. This range reflects how the Hub now hosts everything from small evaluation sets to web-scale pretraining corpora.
You can access these datasets immediately through the Hugging Face Datasets library to train, fine-tune (adapting a model for a specific task), or benchmark models. Individual datasets carry their own licensing terms โ commercial use, redistribution, and attribution requirements vary per repository. Hugging Face provides the indexing and hosting; license compliance remains the user's responsibility.
Still wondering? A few quick answers below.






