This repository is the official implementation of our ICML 2025 paper Synthetic Text Generation for Training Large Language Models via Gradient Matching.
conda create -n gradmm python=3.11
conda activate gradmm
pip install -r requirements.txtcd gradmm
./scripts/admm.sh
./scripts/admm_dp.shFor filtering, please refer to the notebook gradmm/Filtering.ipynb. Adjust the settings in the Parameters section, then run all cells in the notebook.
-
Obtain the synthetic data paths by running the
Print fine-tuning pathssection in the notebookgradmm/Finetuning.ipynb. -
Insert the retrieved paths into
scripts/query_ft.sh, then run the following commands:
cd addax
./scripts/query_ft.sh- To collect the fine-tuning results, paste the fine-tuning paths into
Collect fine-tuning resultssection in the notebookgradmm/Finetuning.ipynband un the corresponding cells.
If you have any questions related to the code or the paper, feel free to email Dang Nguyen (nguyentuanhaidang@gmail.com). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!
Please cite our paper if you find the repo helpful in your work:
@article{nguyen2025synthetic,
title={Synthetic Text Generation for Training Large Language Models via Gradient Matching},
author={Nguyen*, Dang and Li*, Zeman and Bateni, Mohammadhossein and Mirrokni, Vahab and Razaviyayn, Meisam and Mirzasoleiman, Baharan},
journal={International Conference on Machine Learning (ICML)},
year={2025}
}The structure of this repository is largely based on the official implementation of lamp and Addax. We are grateful for their open sources.