Predict distributed LLM training time before you run. This tool estimates the wall-clock time for training large language models across multiple GPUs using 3D parallelism (pipeline, tensor, and data parallelism), helping you plan capacity and compare parallelization strategies without expensive trial runs.
pip install estimate-train-time # Coming soon to PyPINote: PyPI package is coming soon. For now, install directly from the repository:
git clone https://github.com/DebarghaG/estimate-train-time.git
cd estimate-train-time
pip install -e .# List available example configurations
estimate-train-time list-examples
# Run prediction with a bundled example (Llama 7B on A100s)
estimate-train-time predict --example llemma_7b_4_2_2_POutput:
Estimated time cost of current training config: 9480819.17 us
= 9480.82 ms
= 9.4808 s
- 3D Parallelism Support: Pipeline, tensor (model), and data parallelism
- Pre-trained Regressors: Bundled models for NVIDIA A100 and GH200 GPUs
- No GPU Required: Predictions run on CPU using trained regressors
- Extensible: Add your own GPU profiles and cluster configurations
- Getting Started - Installation and first prediction
- Core Concepts - Understanding distributed training estimation
- Configuration Reference - Config file parameters
- CLI Reference - Command-line options
- Python API - Programmatic usage
- Examples - Usage examples and custom configurations
- Advanced - Kernel sampling and extending the tool
from estimate_train_time import one_batch_predict
# Predict training time from a config file
time_us = one_batch_predict("path/to/config.yml")
print(f"One batch takes {time_us / 1e6:.2f} seconds")- Python 3.8+
- pandas, numpy, scikit-learn, xgboost, pyyaml, ijson, joblib
For GPU sampling (optional): torch, flash-attn, deepspeed
National Science Foundation (NSF) funded AI institute for Intelligent Cyberinfrastructure with Computational Learning in the Environment (ICICLE) (OAC 2112606)
If you use this tool in your research, please cite our paper, accepted to HiPC 2025 (proceedings forthcoming):
@article{zhang2025efficient,
title={Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM},
author={Zhang, Biyao and Zheng, Mingkai and Ganguly, Debargha and Zhang, Xuecen and Singh, Vikash and Chaudhary, Vipin and Zhang, Zhao},
journal={arXiv preprint arXiv:2509.22832},
year={2025}
}MIT License - see LICENSE for details.