TADA

Official code for the papers:

Do We Need All the Synthetic Data? Targeted Synthetic Image Augmentation via Diffusion Models (ICLR 2026)
Changing the Training Data Distribution to Reduce Simplicity Bias Improves In-distribution Generalization (NeurIPS 2024)

Repository Structure

generation/           # Synthetic data generation using GLIDE diffusion model
├── glide/            # GLIDE-based text-to-image generation scripts
├── calulate_fid/     # FID score computation and image selection
├── mix.py            # Mix synthetic and original data
└── README.md         # Generation-specific setup and usage instructions
training/             # Image classification training
├── train.py          # Main training script
├── aus.py            # Auto-upsampling (AUS) module
├── sam.py            # Sharpness-Aware Minimization (SAM) optimizer
├── models.py         # Model definitions (ResNet, ViT, etc.)
├── data_utils.py     # Dataset loading and augmentation utilities
├── train_params.yaml # Full list of configurable training parameters
└── sh/               # Example shell scripts for training runs
requirements.txt      # Python package requirements (training)
global_params.yaml    # Global paths and W&B configuration

Datasets

Synthetic CIFAR-10

We release our synthetic CIFAR-10 datasets generated with GLIDE on HuggingFace:

After downloading, place the dataset directory under data_path as configured in global_params.yaml.

Environment

Training

From the top-level directory:

conda create -n tada python=3.10 -y
conda activate tada
pip install -r requirements.txt

Generation

For synthetic data generation, follow the setup instructions in generation/README.md.

Configuration

Edit global_params.yaml to set your local data path and W&B team:

team_path: /path/to/your/data
wandb_team: your_wandb_team

Training runs are organized by exp_name (maps to a W&B project) and run_name (maps to a W&B run and local directory). A full list of training parameters and their descriptions is in training/train_params.yaml.

Basic Usage

cd training/sh
./cifar10_original.sh
./cifar10_useful.sh
./cifar10_tada.sh

This runs ResNet-18 on (augmented) CIFAR-10 dataset.

Bugs or Questions?

If you have any questions related to the code or the paper, feel free to email Dang Nguyen (nguyentuanhaidang@gmail.com). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Citation

Please cite our papers if you find the repo helpful in your work:

@article{nguyen2024make,
  title = {Changing the Training Data Distribution to Reduce Simplicity Bias Improves In-distribution Generalization},
  author = {Nguyen, Dang and Haddad, Paymon and Gan, Eric and Mirzasoleiman, Baharan},
  journal = {Advances in Neural Information Processing Systems},
  year = {2024}
}
@article{nguyen2025we,
  title = {Do We Need All the Synthetic Data? Towards Targeted Synthetic Image Augmentation via Diffusion Models},
  author = {Nguyen*, Dang and Li*, Jiping and Zheng*, Jinghao and Mirzasoleiman, Baharan},
  journal = {International Conference on Learning Representations (ICLR)},
  year = {2026}
}

Acknowledgements

We thank the authors of the following open-source projects:

GLIDE — diffusion model used for synthetic data generation
understanding-sam — SAM optimizer implementation

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
generation		generation
training		training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
global_params.yaml		global_params.yaml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TADA

Repository Structure

Datasets

Synthetic CIFAR-10

Environment

Training

Generation

Configuration

Basic Usage

Bugs or Questions?

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TADA

Repository Structure

Datasets

Synthetic CIFAR-10

Environment

Training

Generation

Configuration

Basic Usage

Bugs or Questions?

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages