HeadsUpAI

Hugging Face Hits One Million Open Datasets to Fuel Open Source AI

ยท Updated

Hugging Face, the central platform for open-source machine learning, reached one million public datasets on its Hub. The team announced the milestone with a simple message: open models need open data. It builds on infrastructure like Hugging Face Storage Buckets that handle the high-throughput data workflows powering large-scale training.
Dataset count
1,004,440 (1M milestone)
Modalities
Text, image, audio, video, 3D, geospatial, tabular, time-series, document
Formats
Parquet, JSON, CSV, Arrow, webdataset, and more
Size range
Under 1K rows to over 1 trillion rows
Access
Hugging Face Datasets library

The Hub indexes datasets across multiple modalities (text, image, audio, video, document, 3D, geospatial, tabular, time-series) and standard formats including Parquet, JSON, CSV, Arrow, and webdataset. Trending sort surfaces what the community is actively using, and dataset size filters span under 1K rows to over 1 trillion rows. This range reflects how the Hub now hosts everything from small evaluation sets to web-scale pretraining corpora.

You can access these datasets immediately through the Hugging Face Datasets library to train, fine-tune (adapting a model for a specific task), or benchmark models. Individual datasets carry their own licensing terms โ€” commercial use, redistribution, and attribution requirements vary per repository. Hugging Face provides the indexing and hosting; license compliance remains the user's responsibility.

Hugging Face
Hugging Face
@huggingface
X

We've just hit 1M open datasets on the Hugging Face Hub ๐ŸŽ‰ Open models need open data. Today we hit that milestone, together with the most incredible community in AI! ๐Ÿค— Onwards to the next million ๐Ÿš€ https://t.co/PV6knP3XlJ

75retweets617likes
View on X

Still wondering? A few quick answers below.

The Hugging Face Hub is an open-source platform that hosts machine learning models, datasets, and tools for the global AI community. It serves as a collaborative repository where researchers and developers can share and access the raw materials needed to build, fine-tune, and evaluate AI systems across text, image, audio, video, and other modalities.

As of May 2026, the Hugging Face Hub hosts over one million open datasets. The library spans modalities from text and image to audio, video, 3D, geospatial, and tabular data. Sizes range from small evaluation sets under one thousand rows to web-scale corpora with more than one trillion rows.

The datasets on the Hugging Face Hub are publicly accessible, but individual datasets may carry different licensing terms. Some are released under permissive open licenses while others impose restrictions on commercial use, redistribution, or attribution. Users should check the specific repository for each dataset to understand the exact permissions before using it in their work.

Developers can use the Hugging Face Datasets library to load, stream, and process datasets for training, fine-tuning, and benchmarking machine learning models. The library handles common formats like Parquet, JSON, and Arrow, and supports streaming for datasets too large to fit in memory. This makes the Hub usable for both small experiments and production-scale training runs.

Share this update