Instruction-Tuning LLaMA for Synthetic Medical Note Generation: Bridging Data Privacy and Utility in Downstream Tasks
This repository contains the code belonging to the paper "Instruction-Tuning LLaMA for Synthetic Medical Note Generation in Swedish and English". The purpose of this study was to generate synthetic Swedish and English discharge summaries that preserve privacy while mimicing the task-relevant properties of the real data needed to build high-performing downstream systems. The code in this repository is structured in different chapters similar to the paper. We recommend reading the corresponding chapters alongside investigating the code to better understand the purpose and desired outcome of each single part of the code. Note, that you need to get access to the datasets and several language models from external sources (always explained in the concerning chapters). Keep in mind that this might take a while and requires planning ahead. Last, don't forget that you are working with sensitive medical data that requires special care. For example, you are not allowed to upload any of this data to online APIs, meaning that all models used in this study must be downloaded and saved locally on your device/server before using.
This paper is based on the Master’s thesis "Instruction-Tuning LLaMA for Synthetic Medical Note Generation: Bridging Data Privacy and Utility in Downstream Tasks," written by Lotta Kiefer. The thesis was supervised by Hercules Dalianis at Stockholm University, and by Dietrich Klakow and Jesujoba Alabi at Saarland University. The thesis can be downloaded here.
The code in this repository builds on several other studies and includes code from other repositories. Specifically it includes:
- The Axolotl project in
generation/axolotlfor the fine-tuning process. The code of the original repository was not changed but only complemented with some additionals files closer specified ingeneration/axolotl/changes.md. - The medical-coding-reproducability repository in
utility/plm_icd/medical_coding/for building medical coding models. Some slight modifications were made to the source code closer specified inutility/plm_icd/changes.md - The ROUGE-5 implementation of the mdpi2021-textgen in
privacy/rouge_5. The changes made to this code are closer specified inprivacy/rouge_5/changes.md.
There are four different docker containers you will need to run for different parts of this code. It is always specified which container to run for which part of the code and which folders to mount. Always make sure you're running the correct container! Pull the following two images from docker hub:
docker pull winglian/axolotl:main-20241202docker pull vllm/vllm-openai:v0.6.4.post1Build the remaining four images from the dockerfiles save in the docker folder by running
docker build -t medcode:edin -f docker/DockerfileMedcode .docker build -t general:latest -f docker/DockerfileGeneral .Due to the sensitivity of the data, you might need to transfer these images to a server with no internet access. In this case, save it by running
docker save -o /path/for/generated/tar/file image_nametransfer the .tar file to your server and load it by running
docker load -i /path/to/image/tar/fileNow you're ready to run all docker containers needed in this study and start your experiments.
Set the environment by running the medical coding docker image mounting the data, utility/plm_icd, and preprocessing folders:
sudo docker run --gpus all -v utility/phi_ner:/medical_coding -v data:/data -v preprocessing:/preprocessing -it medcode:edin bash- Get access to the MIMIC-IV dataset on PhysioNet. Note, that you need to complete a training that takes a couple of hours to get access.
- Then download MIMIC-IV and MIMIC-IV-NOTE into the folders specified in
/data/mimic. - Change the
DOWNLOAD_DIRECTORY_MIMICIVandDOWNLOAD_DIRECTORY_MIMICIV_NOTEinmedical_coding/src/settings.pyto your respective paths. - Run
python /medical_coding/prepare_data/prepare_mimiciv.py to obtain the new file /medical_coding/files/data/mimiciv_icd10/mimiciv_icd10.feather alongside the splits for the dataset.
- Run
python /preprocessing/mimic/preprocessing.py --input_file /medical_coding/files/data/mimiciv_icd10/mimiciv_icd10.feather --output_file /medical_coding/files/data/mimiciv_icd10/mimiciv_icd10.featherto preprocess the medical notes (use an alternative output file if you want to keep the original dataframe).
- Run
python /preprocessing/mimic/transcriptions.py --icd_file /preprocessing/mimic/ICD10_descriptions_mimic.csv --mimic_file /medical_coding/files/data/mimiciv_icd10/mimiciv_icd10.feather --output_file /medical_coding/files/data/mimiciv_icd10/mimiciv_icd10.feather to obtain textual descriptions of the ICD-10 codes (use an alternative output file if you want to keep the original dataframe).
- Run
python /preprocessing/mimic/json_splits.py --notes_file /medical_coding/files/data/mimiciv_icd10/mimiciv_icd10.feather --splits_file /medical_coding/files/data/mimiciv_icd10/mimiciv_icd10_split.featherto obtain JSON files as required to create prompts in the ALPACA format for MIMIC-S, MIMIC-L, and MIMIC-E.
- Get access to SEPR II by contacting the Health Bank at Stockholm University for access. Create a
.featherfile similar to/medical_coding/files/data/mimiciv_icd10/mimiciv_icd10.featherby substituting thetextandtargetcolumns with the SEPR data and store as/medical_coding/files/data/sepr/sepr_icd10.feather - Run
python /preprocessing/sepr/transcriptions_swed.py --icd_file /preprocessing/sepr/ICD10_descriptions_sepr.csv --sepr_file /medical_coding/files/data/sepr/sepr_icd10.feather --output_file /medical_coding/files/data/sepr/sepr_icd10.featherto obtain textual descriptions of the ICD-10 codes (use an alternative output file if you want to keep the original dataframe).
- Run
python /preprocessing/sepr/json_splits_swed.py --notes_file /medical_coding/files/data/sepr/sepr_icd10.feather --splits_file /medical_coding/files/data/sepr/sepr_icd10_split.feather to obtain JSON files as required to create prompts in the ALPACA format for SEPR-S, SEPR-L, and SEPR-E.
Download LLaMA-3.1-8B (or any other model you want to use) from Hugggingface by running
import os
os.environ['HF_TOKEN'] = ''
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir-use-symlinks False --local-dir path/to/local/dirand save the model locally in the models folder.
Set the environment by running the Axolotl docker image mounting the data, generation/axolotl, and models folders:
sudo docker run --gpus all -v data:/data -v generation/axolotl:/axolotl -v models:/models -it winglian/axolotl:main-20241202 bash- Go to
/axolotl/configs/ft_llama_mimic.yamland make sure thebase_model(here LLaMA-3.1-8B),datasets/path(here MIMIC-L), andoutput_dirare correct. - From the
/axolotldirectory run
CUDA_VISIBLE_DEVICES="" python -m axolotl.cli.preprocess configs/ft_llama_mimic.yamlto preprocess the dataset before fine-tuning.
- From the
/axolotldirectory run
CUDA_VISIBLE_DEVICES="" accelerate launch -m configs/ft_llama_mimic.yaml --deepspeed deepspeed_configs/zero3.jsonto start the fine-tuning process.
- From the
/axolotldirectory run
CUDA_VISIBLE_DEVICES="" python3 -m axolotl.cli.merge_lora configs/ft_llama_mimic.yamlto merch the LORA adapters back to the base model. The model will be saved in the subdirectory merged in your output path storing the fine-tuned model.
- Go to
/axolotl/configs/ft_llama_sepr.yamland make sure thebase_model(here LLaMA-3.1-8B),datasets/path(here SEPR-L), andoutput_dirare correct. - Go to
/axolotl/src/axolotland rename the fileprompters.pytoprompters_org.pyand the fileprompters_swed.pytoprompters.py. This has to be done to create Swedish prompts. Don't forget to undo this change once you're done fine-tuning on the Swedish data. - From the
/axolotldirectory run
CUDA_VISIBLE_DEVICES="" python -m axolotl.cli.preprocess configs/ft_llama_sepr.yamlto preprocess the dataset before fine-tuning.
- From the
/axolotldirectory run
CUDA_VISIBLE_DEVICES="" accelerate launch -m configs/ft_llama_sepr.yaml --deepspeed deepspeed_configs/zero3.jsonto start the fine-tuning process.
- From the
/axolotldirectory run
CUDA_VISIBLE_DEVICES="" python3 -m axolotl.cli.merge_lora configs/ft_llama_sepr.yamlto merch the LORA adapters back to the base model. The model will be saved in the subdirectory merged in your output path storing the fine-tuned model.
Set the environment by running the vLLM docker image mounting the data, generation/axolotl, and generation/inference folders:
docker run --entrypoint bash --gpus all -v /data:/data -v /generation/axolotl:/axolotl -v /generation/inference:/inference -it vllm-openai:v0.6.4.post1Run
python3 /inference/inference_vllm_mimic.py --base_model path/to/fine-tuned/merged/llama --test_data path/to/sampling/file.json --file_out path/to/output/file.jsonspecifiying the path to the fine-tuned and merged LLaMA model (e.g., /axolotl/output/mimic_s_llama/merged) to the JSON file that should be used to generate the prompts for sampling (e.g., /data/mimic/mimic_s.json) and the output JSON file that stores the generated synthetic notes (e.g., /data/mimic/synth_mimic_s.json)
Run
python3 /inference/inference_vllm_sepr.py --base_model path/to/fine-tuned/merged/llama --test_data path/to/sampling/file.json --file_out path/to/output/file.jsonspecifiying the path to the fine-tuned and merged LLaMA model (e.g., /axolotl/output/sepr_llama/merged), to the JSON file that should be used to generate the prompts for sampling (e.g., /data/mimic/sepr_s.json), and the output JSON file that stores the generated synthetic notes (e.g., /data/mimic/synth_sepr_s.json)
Set the environment by running
sudo docker run --gpus all -v fidelity:/fidelity -v data:/data -v -it general:latestRun
python /fidelity/statistical_comparison.py --real_file path/real/documents.json \
--field_real outputfield \
--synthetic_file path/real/documents.json \
--field_synthetic outputfield specifying the paths to the JSON files containing real and synthetic documents (e.g., data/mimic/mimic_s.json and synth_mimic_s.json) and the field containing the documents to analyze (e.g, output and output1). This will print a comparison of key statistical features of both datasets.
Set the environment by running the medical coding docker image mounting the data and utility/plm_icd folders:
sudo docker run --gpus all -v utility/plm_icd:/medical_coding -v data:/data -it medcode:edin bash- Download RoBERTa-base-PM-M3-Voc and save the unzipped files.
- Set the
model_pathparameter inmedical_coding/configs/plm_icd.yamlandconfigs/text_transform/huggingface.yamlto the path of the RoBERTa model folder. - Create a
.featherfile containing the real or synthetic notes in thetextcolumn, the number of words innum_words, the ICD-10 diagnosis codes inicd10_diag, the ICD-10 procedure codes inicd10_proc, the combined codes in thetarget, the number of target innum_targetand the ids in the_idcolumn. Using the/medical_coding/files/data/mimiciv_icd10/mimiciv_icd10.featheras base, filtering for the ids and just substituting the documents simplifies this process a lot. - Store this file in the same directory as the file containing the splits file with
_idandsplitcolumns. The splits used in this work are stored in/medical_coding/files/data/mimiciv_icd10/mimiciv_icd10_split.featherfor training on MIMIC-L and/medical_coding/files/data/mimiciv_icd10/splits_val_train.featherfor training on MIMIC-S. Change the/medical_coding/configs/data/mimiciv_icd10.yamlby specifyingdir,data_filenameandsplit_filename. - To train the medical coding model run
python main.py experiment=mimiciv_icd10/plm_icd gpu=x callbacks=no_wandb trainer.print_metrics=truespecifying the GPU you want to use.
- If you want to evaluate a trained model run
python main.py experiment=mimiciv_icd10/plm_icd gpu=x load_model=path/to/model/checkpoints trainer.epochs=0 callbacks=no_wandb trainer.print_metrics=truespecifying the GPU you want to use and model folder containing the checkpoints of your medical coding model.
- Get access to the model checkpoints of SweDeClin-BERT by contacting the Health Bank at Stockholm University and store them in a folder.
- Set the
model_pathparameter inmedical_coding/configs/plm_icd.yamlandconfigs/text_transform/huggingface.yamlto the path of the SweDeClin-BERT model folder. - Create a
.featherfile containing the real or synthetic notes in thetextcolumn, the number of words innum_words, the ICD-10 diagnosis codes inicd10_diag, the ICD-10 procedure codes inicd10_proc, the combined codes in thetarget, the number of target innum_targetand the ids in the_idcolumn. - Store this file in the same directory as the file containing the splits file with
_idandsplitcolumns. The splits used in this work are stored in/medical_coding/files/data/sper/sepr_icd10_split.featherfor training on SEPR-L and/medical_coding/files/data/sepr/swed_splits_test_train.featherfor training on SEPR-s. Change the/medical_coding/configs/data/mimiciv_icd10.yamlby specifyingdir,data_filenameandsplit_filename. - To train the medical coding model run
python main.py experiment=mimiciv_icd10/plm_icd gpu=x callbacks=no_wandb trainer.print_metrics=truespecifying the GPU you want to use.
- If you want to evaluate a trained model run
python main.py experiment=mimiciv_icd10/plm_icd gpu=x load_model=path/to/model/checkpoints trainer.epochs=0 callbacks=no_wandb trainer.print_metrics=truespecifying the GPU you want to use and the model folder containing the checkpoints of your medical coding model.
Set the environment by running
sudo docker run --gpus all -v utility/:/medical_coding -v data:/data -it general:latestThe error analysis is tailored to the MIMIC data and needs some prior adaptation if desired to apply to the SEPR models.
To analyze key features of the predictions of the medical coding model compared to the targets, run
python /medical_coding/error_analysis/predictions.py --file path/to/prediction.feather --threshold x specifying the path to the test prediction file obtained in the evaluation of your medical coding model, and the optimal threshold also obtained in the evaluation of your medical coding model.
To get the percentage of predicted code sequences that are present in the training targets or form a subset of a code sequence from the training data, run
python /medical_coding/error_analysis/code_match.py --pred_file path/to/prediction.feather \
--data_file path/to/training/data.feather \
--split_file path/to/split/file.feather \
--threshold x specifying the prediction file and threshold obtained during training as well as the path to the data file and splits file used during training.
- Generate a CSV file storing the F1 score and its frequency in the training data for each code by running
python /medical_coding/error_analysis/get_f1_frequ.py --train_path path/to/training/data.feather \
--split_path path/to/split/file.feather \
--pred_file path/to/prediction.feather \
--threshold x \
--output_file path/to/output.csvspecifying the prediction file and threshold obtained during training as well as the path to the data file and splits file used during training and the path and name of the output file. For comparison, repeat this process for the model trained on real data and the model trained on synthetic data.
- To plot F1 against code frequency and obtain spearman and pearson correlation coefficients, run
python /medical_coding/error_analysis/plot_f1_vs_frequ.py \
--real_path path/to/f1_freq_df_real.csv \
--synth_path path/to/f1_freq_df_synth.csv \
--plot_output path/to/plot.png specifying the paths to the dataframes of the real-data and synthetic-data models generated in step 1 and the name of the plot file.
- Generate a CSV file storing the F1 score and its length for each document in the test file running
python /medical_coding/error_analysis/get_f1_doc_length.py --train_path path/to/training/data.feather \
--split_path path/to/split/file.feather \
--pred_file path/to/prediction.feather \
--threshold x \
--output_file path/to/output.csvspecifying the prediction file and threshold obtained during training as well as the path to the data file and splits file used during training and the path and name of the output file. For comparison, repeat this process for the model trained on real data and the model trained on synthetic data.
- To plot F1 against document length and obtain spearman and pearson correlation coefficients, run
python /medical_coding/error_analysis/plot_f1_vs_length.py \
--real_path path/to/f1_freq_df_real.csv \
--synth_path path/to/f1_freq_df_synth.csv \
--plot_output path/to/plot.png specifying the paths to the dataframes of the real-data and synthetic-data models generated in step 3 and the name of the plot file.
To analyze how many incorrect predictions are WF and OOF errors and plot the ICD-10 chapter distribution of the predictions in a pie chart, run
python /medical_coding/error_analysis/wf_oof_chapters.py \
--preds path/to/predictions.feather \
--threshold x \
--plot_output path/to/plot.png specifying the prediction file and threshold obtained during training and the name of the plot file.
To investigate whether noise contained in the synthetic data is widespread (H1) or rather concentrated in a subset of documents (H2) follow these steps:
- Insert widespread noise in the real dataset by running
python /medical_coding/error_analysis/noise_h1.py --file_path path/to/real.feather \
--split_path path/to/splits.feather \
--noise_percentage x \
--output_file_path path/to/output_file.featherspecifying the path to the real data where the noise is to be inserted, the respective splits file, the percentage of noise that should be added and the path and name of the output file containing the widespread noise in the training data. Repeat this process with different numbers of noice percentage.
- Insert concentrated noise in the real dataset by running
python /medical_coding/error_analysis/noise_h2.py --file_path path/to/real.feather \
--split_path path/to/splits.feather \
--noise_percentage x \
--output_file_path path/to/output_file.featherspecifying the path to the real data where the noise is to be inserted, the respective splits file, the percentage of noise that should be added and the path and name of the output file containing the widespread noise in the training data. Repeat this process with different numbers of noice percentage.
- Use the files obtained in step 1 and 2 to train medical coding models as described in 4.1. Then plot F1 agains frequency and document length comparing the models trained on noisy real data to a model trained on the same amount of synthetic data and compare the curves. Analyze which hypothesis seems more likely by aligning more closely to the curve of the synthetic-data model and which percentage of noise achieves the most similar results.
Set the environment by running
sudo docker run --gpus all -v privacy:/privacy -v data:/data -it general:latest- To obtain a CSV file with ROUGE-5 scores between the training data and the synthetic data, run
python /privacy/rouge_5/rouge_similarity_ranking.py --realpath /path/to/training.json \
--real_field outputfield
--synthpath /path/to/synthetic.json \
--synth_field outputfield
--outdir path/to/output/dir \
--n_jobs x \
--batch_size xspecifying the path to both JSON files and the name of the fields containing the documents and the number of synthetic documents processed in parallel (--n_job) as well as the number of synthetic documents to process before saving (--batch_size)
- To get the average, minimum, and maximum ROUGE-5 recall score of all documents and the 122 most similar documents, run
python /privacy/rouge_5/evaluate_rouge.py --rouge_file path/to/rouge.csv \specifying the path to the CSV file storing the ROUGE-5 scores obtained in step 1.
- For further evaluation of the 20 most similar document pairs, run
python /privacy/rouge_5/longest_sequence.py --rouge_file path/to/rouge.csv \specifiying the path to the ROUGE CSV file to obtain statistics about the longest overlapping sequences, and
python /privacy/rouge_5/longest_sequence.py --train_file path/to/training.feather
--splits_file path/to/splits.feather
--rouge_file path/to/rouge.csv \specifying the path to the training and split feather files and the ROUGE CSV file to obtain the overall count of overlapping 5-grams in the 20 most similar documents in the training data.
To calculate the 8-gram overlap between the training data used to finetune LLaMA and the generated synthetic data, run
python3 /privacy/8_gram/ngram_overlap.py --original_file path/to/training.json \
--original_field outputfield \
--synthetic_file path/to/synthetic.json \
--synthetic_field outputfield \specifying the path to both JSON files and the name of the fields containing the documents.
The XML files of the questionnaires of the MIMIC and SEPR study can be found in readability_coherence/questionnaire. You can import them on Soscisurvey to investigate and edit the study.
You can insert the results of the study into readability_coherence/evaluation/study_evaluation.ipynb to generate boxplots for the evaluation of the ratings, test for statistical significance and investigate the correlation of readability and medical coherence.