Skip to content

faerber-lab/HF-NLP10K

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

28 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

HF-NLP10K: A Dataset and Metadata Analysis of 10,000+ Hugging Face NLP Models

This repository contains the data, scripts, and figures accompanying the paper:

HF-NLP10K: A Dataset and Metadata Analysis of 10,000+ Hugging Face NLP Models
(Under submission)


🌍 Overview

Hugging Face has become a central platform for sharing NLP models. However, inconsistent and incomplete metadata make it difficult to find suitable models.
HF-NLP10K introduces a structured dataset of over 10,000 NLP models enriched with 33 metadata fields, combining Hugging Face API data with LLM-based parsing of free-text model cards.


πŸ“¦ Repository Structure

data/
β”œβ”€β”€ raw/                 # Raw extracted data from Hugging Face API
β”‚   β”œβ”€β”€ Data-Stats.csv.gz
β”‚   └── S1-All_NLP.csv.gz
└── final/               # Final structured dataset
    └── HLT-NLP10K.csv

src/
β”œβ”€β”€ preprocessing/       # Scripts for dataset construction (S1–S8)
β”œβ”€β”€ analysis/            # Exploratory data analysis scripts
└── figures/             # Figure generation scripts (for paper visuals)

outputs/
└── figures/             # Generated figures and plots

requirements.txt          # Python dependencies
README.md                 # Overview and usage instructions

πŸ§ͺ Dataset Highlights

  • Covers 10,022 NLP models with at least 1,000 downloads
  • Includes 33 metadata fields, such as:
    • Pipeline tag, license, training data, evaluation metrics, base model
    • Download counts, fine-tuning compatibility, model card sections (limitations, use cases, etc.)
  • Combines API enrichment with LLM-based model card parsing

πŸ”§ Reproducing the Dataset

Run the scripts sequentially from S1 to S8:

cd src/preprocessing

python S1-All_NLP_Models.py
python S2-Filter_NLP_Models.py
...
python S8-UniqueRows_FinalDataset.py

The final dataset (HLT-NLP10K.csv) will be available under data/final/.


πŸ“Š Figures

The figures presented in the paper can be reproduced using:

cd src/figures
python Figure1_NLPGrowth.py
python Figure2_No.ofDownloads.py
...
python Figure5_MissingSections.py

Resulting images will be stored under outputs/figures/.


βš™οΈ Requirements

Install dependencies before running the scripts:

pip install -r requirements.txt

🧩 Citation

If you use this dataset or scripts, please cite:

HF-NLP10K: A Dataset and Metadata Analysis of 10,000+ Hugging Face NLP Models. Under submission.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages