Skip to content

hsgser/TADA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TADA

Python 3.10 Pytorch 2.2.1 License MIT

Official code for the papers:

Repository Structure

generation/           # Synthetic data generation using GLIDE diffusion model
├── glide/            # GLIDE-based text-to-image generation scripts
├── calulate_fid/     # FID score computation and image selection
├── mix.py            # Mix synthetic and original data
└── README.md         # Generation-specific setup and usage instructions
training/             # Image classification training
├── train.py          # Main training script
├── aus.py            # Auto-upsampling (AUS) module
├── sam.py            # Sharpness-Aware Minimization (SAM) optimizer
├── models.py         # Model definitions (ResNet, ViT, etc.)
├── data_utils.py     # Dataset loading and augmentation utilities
├── train_params.yaml # Full list of configurable training parameters
└── sh/               # Example shell scripts for training runs
requirements.txt      # Python package requirements (training)
global_params.yaml    # Global paths and W&B configuration

Datasets

Synthetic CIFAR-10

We release our synthetic CIFAR-10 datasets generated with GLIDE on HuggingFace:

After downloading, place the dataset directory under data_path as configured in global_params.yaml.

Environment

Training

From the top-level directory:

conda create -n tada python=3.10 -y
conda activate tada
pip install -r requirements.txt

Generation

For synthetic data generation, follow the setup instructions in generation/README.md.

Configuration

Edit global_params.yaml to set your local data path and W&B team:

team_path: /path/to/your/data
wandb_team: your_wandb_team

Training runs are organized by exp_name (maps to a W&B project) and run_name (maps to a W&B run and local directory). A full list of training parameters and their descriptions is in training/train_params.yaml.

Basic Usage

cd training/sh
./cifar10_original.sh
./cifar10_useful.sh
./cifar10_tada.sh

This runs ResNet-18 on (augmented) CIFAR-10 dataset.

Bugs or Questions?

If you have any questions related to the code or the paper, feel free to email Dang Nguyen (nguyentuanhaidang@gmail.com). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Citation

Please cite our papers if you find the repo helpful in your work:

@article{nguyen2024make,
  title = {Changing the Training Data Distribution to Reduce Simplicity Bias Improves In-distribution Generalization},
  author = {Nguyen, Dang and Haddad, Paymon and Gan, Eric and Mirzasoleiman, Baharan},
  journal = {Advances in Neural Information Processing Systems},
  year = {2024}
}
@article{nguyen2025we,
  title = {Do We Need All the Synthetic Data? Towards Targeted Synthetic Image Augmentation via Diffusion Models},
  author = {Nguyen*, Dang and Li*, Jiping and Zheng*, Jinghao and Mirzasoleiman, Baharan},
  journal = {International Conference on Learning Representations (ICLR)},
  year = {2026}
}

Acknowledgements

We thank the authors of the following open-source projects:

  • GLIDE — diffusion model used for synthetic data generation
  • understanding-sam — SAM optimizer implementation

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors