This repository contains the data, scripts, and figures accompanying the paper:
HF-NLP10K: A Dataset and Metadata Analysis of 10,000+ Hugging Face NLP Models
(Under submission)
Hugging Face has become a central platform for sharing NLP models. However, inconsistent and incomplete metadata make it difficult to find suitable models.
HF-NLP10K introduces a structured dataset of over 10,000 NLP models enriched with 33 metadata fields, combining Hugging Face API data with LLM-based parsing of free-text model cards.
data/
βββ raw/ # Raw extracted data from Hugging Face API
β βββ Data-Stats.csv.gz
β βββ S1-All_NLP.csv.gz
βββ final/ # Final structured dataset
βββ HLT-NLP10K.csv
src/
βββ preprocessing/ # Scripts for dataset construction (S1βS8)
βββ analysis/ # Exploratory data analysis scripts
βββ figures/ # Figure generation scripts (for paper visuals)
outputs/
βββ figures/ # Generated figures and plots
requirements.txt # Python dependencies
README.md # Overview and usage instructions
- Covers 10,022 NLP models with at least 1,000 downloads
- Includes 33 metadata fields, such as:
- Pipeline tag, license, training data, evaluation metrics, base model
- Download counts, fine-tuning compatibility, model card sections (limitations, use cases, etc.)
- Combines API enrichment with LLM-based model card parsing
Run the scripts sequentially from S1 to S8:
cd src/preprocessing
python S1-All_NLP_Models.py
python S2-Filter_NLP_Models.py
...
python S8-UniqueRows_FinalDataset.pyThe final dataset (HLT-NLP10K.csv) will be available under data/final/.
The figures presented in the paper can be reproduced using:
cd src/figures
python Figure1_NLPGrowth.py
python Figure2_No.ofDownloads.py
...
python Figure5_MissingSections.pyResulting images will be stored under outputs/figures/.
Install dependencies before running the scripts:
pip install -r requirements.txtIf you use this dataset or scripts, please cite:
HF-NLP10K: A Dataset and Metadata Analysis of 10,000+ Hugging Face NLP Models. Under submission.