Official code for the papers:
- Do We Need All the Synthetic Data? Targeted Synthetic Image Augmentation via Diffusion Models (ICLR 2026)
- Changing the Training Data Distribution to Reduce Simplicity Bias Improves In-distribution Generalization (NeurIPS 2024)
generation/ # Synthetic data generation using GLIDE diffusion model
├── glide/ # GLIDE-based text-to-image generation scripts
├── calulate_fid/ # FID score computation and image selection
├── mix.py # Mix synthetic and original data
└── README.md # Generation-specific setup and usage instructions
training/ # Image classification training
├── train.py # Main training script
├── aus.py # Auto-upsampling (AUS) module
├── sam.py # Sharpness-Aware Minimization (SAM) optimizer
├── models.py # Model definitions (ResNet, ViT, etc.)
├── data_utils.py # Dataset loading and augmentation utilities
├── train_params.yaml # Full list of configurable training parameters
└── sh/ # Example shell scripts for training runs
requirements.txt # Python package requirements (training)
global_params.yaml # Global paths and W&B configuration
We release our synthetic CIFAR-10 datasets generated with GLIDE on HuggingFace:
After downloading, place the dataset directory under data_path as configured in global_params.yaml.
From the top-level directory:
conda create -n tada python=3.10 -y
conda activate tada
pip install -r requirements.txtFor synthetic data generation, follow the setup instructions in generation/README.md.
Edit global_params.yaml to set your local data path and W&B team:
team_path: /path/to/your/data
wandb_team: your_wandb_teamTraining runs are organized by exp_name (maps to a W&B project) and run_name (maps to a W&B run and local directory). A full list of training parameters and their descriptions is in training/train_params.yaml.
cd training/sh
./cifar10_original.sh
./cifar10_useful.sh
./cifar10_tada.shThis runs ResNet-18 on (augmented) CIFAR-10 dataset.
If you have any questions related to the code or the paper, feel free to email Dang Nguyen (nguyentuanhaidang@gmail.com). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!
Please cite our papers if you find the repo helpful in your work:
@article{nguyen2024make,
title = {Changing the Training Data Distribution to Reduce Simplicity Bias Improves In-distribution Generalization},
author = {Nguyen, Dang and Haddad, Paymon and Gan, Eric and Mirzasoleiman, Baharan},
journal = {Advances in Neural Information Processing Systems},
year = {2024}
}
@article{nguyen2025we,
title = {Do We Need All the Synthetic Data? Towards Targeted Synthetic Image Augmentation via Diffusion Models},
author = {Nguyen*, Dang and Li*, Jiping and Zheng*, Jinghao and Mirzasoleiman, Baharan},
journal = {International Conference on Learning Representations (ICLR)},
year = {2026}
}We thank the authors of the following open-source projects:
- GLIDE — diffusion model used for synthetic data generation
- understanding-sam — SAM optimizer implementation