We've just hit 1M open datasets on the Hugging Face Hub ๐ Open models need open data. Today we hit that milestone, together with the most incredible community in AI! ๐ค Onwards to the next million ๐ https://t.co/PV6knP3XlJ
Hugging Face Hits One Million Open Datasets to Fuel Open Source AI
Hugging Faceยท Updated
Hugging Face reached one million open datasets on its Hub. The dataset library spans multiple modalities including text, image, audio, video, geospatial, and tabular data, available in formats like Parquet, JSON, and CSV. The platform frames the milestone as community-driven, with the tagline that open models need open data.
- Dataset count
- 1,004,440 (1M milestone)
- Modalities
- Text, image, audio, video, 3D, geospatial, tabular, time-series, document
- Formats
- Parquet, JSON, CSV, Arrow, webdataset, and more
- Size range
- Under 1K rows to over 1 trillion rows
- Access
- Hugging Face Datasets library
The Hub indexes datasets across multiple modalities (text, image, audio, video, document, 3D, geospatial, tabular, time-series) and standard formats including Parquet, JSON, CSV, Arrow, and webdataset. Trending sort surfaces what the community is actively using, and dataset size filters span under 1K rows to over 1 trillion rows. This range reflects how the Hub now hosts everything from small evaluation sets to web-scale pretraining corpora.
You can access these datasets immediately through the Hugging Face Datasets library to train, fine-tune (adapting a model for a specific task), or benchmark models. Individual datasets carry their own licensing terms โ commercial use, redistribution, and attribution requirements vary per repository. Hugging Face provides the indexing and hosting; license compliance remains the user's responsibility.
Still wondering? A few quick answers below.
Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards โ




