Hugging Face Hits One Million Open Datasets to Fuel Open Source AI

Hugging Face

May 13, 2026 · Updated Jun 8, 2026

Hugging Face reached one million open datasets on its Hub. The dataset library spans multiple modalities including text, image, audio, video, geospatial, and tabular data, available in formats like Parquet, JSON, and CSV. The platform frames the milestone as community-driven, with the tagline that open models need open data.

Hugging Face, the central platform for open-source machine learning, reached one million public datasets on its Hub. The team announced the milestone with a simple message: open models need open data. It builds on infrastructure like Hugging Face Storage Buckets that handle the high-throughput data workflows powering large-scale training.

Dataset count: 1,004,440 (1M milestone)
Modalities: Text, image, audio, video, 3D, geospatial, tabular, time-series, document
Formats: Parquet, JSON, CSV, Arrow, webdataset, and more
Size range: Under 1K rows to over 1 trillion rows
Access: Hugging Face Datasets library

The Hub indexes datasets across multiple modalities (text, image, audio, video, document, 3D, geospatial, tabular, time-series) and standard formats including Parquet, JSON, CSV, Arrow, and webdataset. Trending sort surfaces what the community is actively using, and dataset size filters span under 1K rows to over 1 trillion rows. This range reflects how the Hub now hosts everything from small evaluation sets to web-scale pretraining corpora.

You can access these datasets immediately through the Hugging Face Datasets library to train, fine-tune (adapting a model for a specific task), or benchmark models. Individual datasets carry their own licensing terms — commercial use, redistribution, and attribution requirements vary per repository. Hugging Face provides the indexing and hosting; license compliance remains the user's responsibility.

View the full update on huggingface.co

Hugging Face

@huggingfaceMay 12

We've just hit 1M open datasets on the Hugging Face Hub 🎉 Open models need open data. Today we hit that milestone, together with the most incredible community in AI! 🤗 Onwards to the next million 🚀 https://t.co/PV6knP3XlJ

75617

View on X

Still wondering? A few quick answers below.

The Hugging Face Hub is an open-source platform that hosts machine learning models, datasets, and tools for the global AI community. It serves as a collaborative repository where researchers and developers can share and access the raw materials needed to build, fine-tune, and evaluate AI systems across text, image, audio, video, and other modalities.

As of May 2026, the Hugging Face Hub hosts over one million open datasets. The library spans modalities from text and image to audio, video, 3D, geospatial, and tabular data. Sizes range from small evaluation sets under one thousand rows to web-scale corpora with more than one trillion rows.

The datasets on the Hugging Face Hub are publicly accessible, but individual datasets may carry different licensing terms. Some are released under permissive open licenses while others impose restrictions on commercial use, redistribution, or attribution. Users should check the specific repository for each dataset to understand the exact permissions before using it in their work.

Developers can use the Hugging Face Datasets library to load, stream, and process datasets for training, fine-tuning, and benchmarking machine learning models. The library handles common formats like Parquet, JSON, and Arrow, and supports streaming for datasets too large to fit in memory. This makes the Hub usable for both small experiments and production-scale training runs.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Keep reading

Hugging Face Launches Storage Buckets for High-Throughput ML Workflows

Hugging Face shipped Storage Buckets — mutable, S3-like object storage on the Hub for checkpoints, agent traces, and processed datasets. Xet chunk-based deduplication means successive checkpoints skip bytes already stored, cutting bandwidth and transfer time.

Algorithmic Research Group Releases s2orc-safety to Standardize 16,806 AI Safety Papers

Algorithmic Research GroupApr 5

Algorithmic Research Group Releases s2orc-safety to Standardize 16,806 AI Safety Papers

Algorithmic Research Group released s2orc-safety on Hugging Face, a curated collection of 16,806 academic papers focused on AI safety topics like jailbreaks and red teaming. By enriching these papers with normalized metrics and code links, the dataset turns fragmented academic research into a machine-readable knowledge layer for safety engineering.

OpenRouter Reaches 13B Daily Tokens as Automated Model Routing Scales

OpenRouterJun 4

OpenRouter Reaches 13B Daily Tokens as Automated Model Routing Scales

OpenRouter's automated routing engines now process 13 billion tokens daily, with the coding-specific Pareto Router hitting 1 billion. The milestone coincides with new granular controls that let users manually balance model performance against token costs. This shift highlights how developers are moving from static model selection to dynamic, algorithmic orchestration to manage AI expenses.

NVIDIA Launches Nemotron Coalition to Build Open Frontier Models With Eight Partners

NVIDIAMar 18

NVIDIA Launches Nemotron Coalition to Build Open Frontier Models With Eight Partners

NVIDIA announced the Nemotron Coalition at GTC 2026, uniting Mistral AI, Cursor, Perplexity, LangChain, and four other labs to build open frontier foundation models. The first base model, codeveloped with Mistral AI, will be open-sourced as the Nemotron 4 foundation.

What is the Hugging Face Hub?

How many datasets are currently available on Hugging Face?

Are the datasets on Hugging Face open source?

What can developers do with Hugging Face datasets?

Keep reading

Hugging Face Launches Storage Buckets for High-Throughput ML Workflows

Hugging Face Launches Storage Buckets for High-Throughput ML Workflows

Algorithmic Research Group Releases s2orc-safety to Standardize 16,806 AI Safety Papers

Algorithmic Research Group Releases s2orc-safety to Standardize 16,806 AI Safety Papers

OpenRouter Reaches 13B Daily Tokens as Automated Model Routing Scales

OpenRouter Reaches 13B Daily Tokens as Automated Model Routing Scales

NVIDIA Launches Nemotron Coalition to Build Open Frontier Models With Eight Partners

NVIDIA Launches Nemotron Coalition to Build Open Frontier Models With Eight Partners

Keep reading

Hugging Face Launches Storage Buckets for High-Throughput ML Workflows

Hugging Face Launches Storage Buckets for High-Throughput ML Workflows

Algorithmic Research Group Releases s2orc-safety to Standardize 16,806 AI Safety Papers

Algorithmic Research Group Releases s2orc-safety to Standardize 16,806 AI Safety Papers

OpenRouter Reaches 13B Daily Tokens as Automated Model Routing Scales

OpenRouter Reaches 13B Daily Tokens as Automated Model Routing Scales

NVIDIA Launches Nemotron Coalition to Build Open Frontier Models With Eight Partners

NVIDIA Launches Nemotron Coalition to Build Open Frontier Models With Eight Partners