A framework for processing and analyzing Electronic Health Records (EHR) data using transformer-based models.
BONSAI helps researchers and data scientists preprocess EHR data, train models, and generate outcomes for downstream clinical predictions and analyses.
git clone https://github.com/FGA-DIKU/BONSAI.git
pip install -e .
cp template_env .env
You can adapt the paths in .env to specify alternative directories containing custom configs, input data or where model checkpoint should be saved.
-
Create data.
python bonsai/run/create_data.py --config-name examples/example_data dataset=correlated_MEDS_dataWe use the example_data.yaml config which transforms the correlated_MEDS_data in the example_data folder into the training format. This data will be saved indata/correlated_MEDS_data -
Pretrain model.
python bonsai/run/pretrain.py --config-name examples/example_pretrain dataset=correlated_MEDS_dataWe use the pretrain.yaml config to have a short resource-light training that can run locally and point it to the dataset created in step 1. -
Create outcomes (labels for finetuning)
python bonsai/run/create_outcome.py --config-name examples/example_outcome1 dataset=correlated_MEDS_dataWe use the example_outcome.yaml config which processes the target outcomes for the correlated_MEDS_data in the example_data folder and saves them in an outcome file indata/correlated_MEDS_data/outcomes/examples/example_outcome1.parquet -
Finetune model.
python bonsai/run/finetune.py --config-name examples/example_finetune dataset=correlated_MEDS_data outcome=examples/example_outcome1 pretrain_path=/path/to/your/pretrained/checkpoints/best.ckptWe use the finetune.yaml config to have a short resource-light training that can run locally and point it to the dataset created in step 1, the checkpoint created in step 2, and the labels created in step 3. -
Train model.
python bonsai/run/train.py --config-name examples/example_finetune dataset=correlated_MEDS_data outcome=examples/example_outcome1We use the finetune.yaml config to have a short resource-light no-pretraining training that can run locally and point it to the dataset created in step 1 and the labels created in step 3.
To use the old pre-lightning version use:
git checkout tags/pre-lightning
To resume training supply the path to the checkpoint and the old run_id. Without the run_id training will continue with a new ID in a new directory.
python bonsai/run/pretrain.py --config-name examples/example_pretrain dataset=correlated_MEDS_data paths.ckpt_path=/path/to/run_id_1234/ckpt.last run_id=1234
We provide a standardized script to generate outcomes in create_outcome.py, however you can also provide your own, in case our script doesn't accommodate your needs.
An outcome file requires the following 5 columns saved as a .parquet file:
- A
subject_idto define the person of interest - A
splitstring (e.g. "train", "tuning", "held_out") that denotes which split the given row belongs to (i.e. we make one file for all splits) - A
outcome_datethat denotes when the outcome happened (can be null) - A
index_datethat denotes from when we consider the prediction (can't be null) - A
censor_datethat denotes the data cutoff (can't be null)
We welcome contributions! Please see our Contributing Guidelines for details on:
- Code style and formatting
- Testing requirements
- Pull request process
- Issue reporting
This project is licensed under the MIT License - see the LICENSE file for details.
If you use BONSAI in your research, please cite the following paper:
@article{Montgomery2025,
author = {Montgomery, A. and others},
title = {BONSAI: A framework for processing and analysing {E}lectronic {H}ealth {R}ecords ({EHR}) data using transformer-based models},
journal = {Journal of Open Source Software},
volume = {10},
number = {114},
pages = {8869},
year = {2025},
doi = {10.21105/joss.08869}
}