HF-NLP10K: A Dataset and Metadata Analysis of 10,000+ Hugging Face NLP Models

This repository contains the data, scripts, and figures accompanying the paper:

HF-NLP10K: A Dataset and Metadata Analysis of 10,000+ Hugging Face NLP Models
(Under submission)

🌍 Overview

Hugging Face has become a central platform for sharing NLP models. However, inconsistent and incomplete metadata make it difficult to find suitable models.
HF-NLP10K introduces a structured dataset of over 10,000 NLP models enriched with 33 metadata fields, combining Hugging Face API data with LLM-based parsing of free-text model cards.

📦 Repository Structure

data/
├── raw/                 # Raw extracted data from Hugging Face API
│   ├── Data-Stats.csv.gz
│   └── S1-All_NLP.csv.gz
└── final/               # Final structured dataset
    └── HLT-NLP10K.csv

src/
├── preprocessing/       # Scripts for dataset construction (S1–S8)
├── analysis/            # Exploratory data analysis scripts
└── figures/             # Figure generation scripts (for paper visuals)

outputs/
└── figures/             # Generated figures and plots

requirements.txt          # Python dependencies
README.md                 # Overview and usage instructions

🧪 Dataset Highlights

Covers 10,022 NLP models with at least 1,000 downloads
Includes 33 metadata fields, such as:
- Pipeline tag, license, training data, evaluation metrics, base model
- Download counts, fine-tuning compatibility, model card sections (limitations, use cases, etc.)
Combines API enrichment with LLM-based model card parsing

🔧 Reproducing the Dataset

Run the scripts sequentially from S1 to S8:

cd src/preprocessing

python S1-All_NLP_Models.py
python S2-Filter_NLP_Models.py
...
python S8-UniqueRows_FinalDataset.py

The final dataset (HLT-NLP10K.csv) will be available under data/final/.

📊 Figures

The figures presented in the paper can be reproduced using:

cd src/figures
python Figure1_NLPGrowth.py
python Figure2_No.ofDownloads.py
...
python Figure5_MissingSections.py

Resulting images will be stored under outputs/figures/.

⚙️ Requirements

Install dependencies before running the scripts:

pip install -r requirements.txt

🧩 Citation

If you use this dataset or scripts, please cite:

HF-NLP10K: A Dataset and Metadata Analysis of 10,000+ Hugging Face NLP Models. Under submission.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HF-NLP10K: A Dataset and Metadata Analysis of 10,000+ Hugging Face NLP Models

🌍 Overview

📦 Repository Structure

🧪 Dataset Highlights

🔧 Reproducing the Dataset

📊 Figures

⚙️ Requirements

🧩 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
data		data
src		src
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

HF-NLP10K: A Dataset and Metadata Analysis of 10,000+ Hugging Face NLP Models

🌍 Overview

📦 Repository Structure

🧪 Dataset Highlights

🔧 Reproducing the Dataset

📊 Figures

⚙️ Requirements

🧩 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages