Instruction-Tuning LLaMA for Synthetic Medical Note Generation: Bridging Data Privacy and Utility in Downstream Tasks

This repository contains the code belonging to the paper "Instruction-Tuning LLaMA for Synthetic Medical Note Generation in Swedish and English". The purpose of this study was to generate synthetic Swedish and English discharge summaries that preserve privacy while mimicing the task-relevant properties of the real data needed to build high-performing downstream systems. The code in this repository is structured in different chapters similar to the paper. We recommend reading the corresponding chapters alongside investigating the code to better understand the purpose and desired outcome of each single part of the code. Note, that you need to get access to the datasets and several language models from external sources (always explained in the concerning chapters). Keep in mind that this might take a while and requires planning ahead. Last, don't forget that you are working with sensitive medical data that requires special care. For example, you are not allowed to upload any of this data to online APIs, meaning that all models used in this study must be downloaded and saved locally on your device/server before using.

Acknowledgements

This paper is based on the Master’s thesis "Instruction-Tuning LLaMA for Synthetic Medical Note Generation: Bridging Data Privacy and Utility in Downstream Tasks," written by Lotta Kiefer. The thesis was supervised by Hercules Dalianis at Stockholm University, and by Dietrich Klakow and Jesujoba Alabi at Saarland University. The thesis can be downloaded here.

The code in this repository builds on several other studies and includes code from other repositories. Specifically it includes:

The Axolotl project in generation/axolotl for the fine-tuning process. The code of the original repository was not changed but only complemented with some additionals files closer specified in generation/axolotl/changes.md.
The medical-coding-reproducability repository in utility/plm_icd/medical_coding/ for building medical coding models. Some slight modifications were made to the source code closer specified in utility/plm_icd/changes.md
The ROUGE-5 implementation of the mdpi2021-textgen in privacy/rouge_5. The changes made to this code are closer specified in privacy/rouge_5/changes.md.

Environments

There are four different docker containers you will need to run for different parts of this code. It is always specified which container to run for which part of the code and which folders to mount. Always make sure you're running the correct container! Pull the following two images from docker hub:

docker pull winglian/axolotl:main-20241202

docker pull vllm/vllm-openai:v0.6.4.post1

Build the remaining four images from the dockerfiles save in the docker folder by running

docker build -t medcode:edin -f docker/DockerfileMedcode .

docker build -t general:latest -f docker/DockerfileGeneral .

Due to the sensitivity of the data, you might need to transfer these images to a server with no internet access. In this case, save it by running

docker save -o /path/for/generated/tar/file image_name

transfer the .tar file to your server and load it by running

docker load -i /path/to/image/tar/file

Now you're ready to run all docker containers needed in this study and start your experiments.

1. Data Preprocessing

Set the environment by running the medical coding docker image mounting the data, utility/plm_icd, and preprocessing folders:

sudo docker run --gpus all -v utility/phi_ner:/medical_coding -v data:/data -v preprocessing:/preprocessing -it medcode:edin bash

1.1 MIMIC

Get access to the MIMIC-IV dataset on PhysioNet. Note, that you need to complete a training that takes a couple of hours to get access.
Then download MIMIC-IV and MIMIC-IV-NOTE into the folders specified in /data/mimic.
Change the DOWNLOAD_DIRECTORY_MIMICIV and DOWNLOAD_DIRECTORY_MIMICIV_NOTE in medical_coding/src/settings.py to your respective paths.
Run

python /medical_coding/prepare_data/prepare_mimiciv.py

to obtain the new file /medical_coding/files/data/mimiciv_icd10/mimiciv_icd10.feather alongside the splits for the dataset.

Run

python /preprocessing/mimic/preprocessing.py --input_file /medical_coding/files/data/mimiciv_icd10/mimiciv_icd10.feather --output_file /medical_coding/files/data/mimiciv_icd10/mimiciv_icd10.feather

to preprocess the medical notes (use an alternative output file if you want to keep the original dataframe).

Run

python /preprocessing/mimic/transcriptions.py --icd_file /preprocessing/mimic/ICD10_descriptions_mimic.csv --mimic_file /medical_coding/files/data/mimiciv_icd10/mimiciv_icd10.feather --output_file /medical_coding/files/data/mimiciv_icd10/mimiciv_icd10.feather

to obtain textual descriptions of the ICD-10 codes (use an alternative output file if you want to keep the original dataframe).

Run

python /preprocessing/mimic/json_splits.py --notes_file /medical_coding/files/data/mimiciv_icd10/mimiciv_icd10.feather --splits_file /medical_coding/files/data/mimiciv_icd10/mimiciv_icd10_split.feather

to obtain JSON files as required to create prompts in the ALPACA format for MIMIC-S, MIMIC-L, and MIMIC-E.

1.2 SEPR

Get access to SEPR II by contacting the Health Bank at Stockholm University for access. Create a .feather file similar to /medical_coding/files/data/mimiciv_icd10/mimiciv_icd10.feather by substituting the textand targetcolumns with the SEPR data and store as /medical_coding/files/data/sepr/sepr_icd10.feather
Run

python /preprocessing/sepr/transcriptions_swed.py --icd_file /preprocessing/sepr/ICD10_descriptions_sepr.csv --sepr_file /medical_coding/files/data/sepr/sepr_icd10.feather --output_file /medical_coding/files/data/sepr/sepr_icd10.feather

to obtain textual descriptions of the ICD-10 codes (use an alternative output file if you want to keep the original dataframe).

Run

python /preprocessing/sepr/json_splits_swed.py --notes_file /medical_coding/files/data/sepr/sepr_icd10.feather --splits_file /medical_coding/files/data/sepr/sepr_icd10_split.feather

to obtain JSON files as required to create prompts in the ALPACA format for SEPR-S, SEPR-L, and SEPR-E.

2. Synthetic Medical Note Generation

2.1 Fine-tuning

Download LLaMA-3.1-8B (or any other model you want to use) from Hugggingface by running

import os 
os.environ['HF_TOKEN'] = ''
huggingface-cli download meta-llama/Llama-3.1-8B --local-dir-use-symlinks False --local-dir path/to/local/dir

and save the model locally in the models folder.

Set the environment by running the Axolotl docker image mounting the data, generation/axolotl, and models folders:

sudo docker run --gpus all -v data:/data -v generation/axolotl:/axolotl -v models:/models -it winglian/axolotl:main-20241202 bash

2.1.1 MIMIC

Go to /axolotl/configs/ft_llama_mimic.yaml and make sure the base_model (here LLaMA-3.1-8B), datasets/path (here MIMIC-L), and output_dir are correct.
From the /axolotl directory run

CUDA_VISIBLE_DEVICES="" python -m axolotl.cli.preprocess configs/ft_llama_mimic.yaml

to preprocess the dataset before fine-tuning.

From the /axolotl directory run

CUDA_VISIBLE_DEVICES="" accelerate launch -m configs/ft_llama_mimic.yaml --deepspeed deepspeed_configs/zero3.json

to start the fine-tuning process.

From the /axolotl directory run

CUDA_VISIBLE_DEVICES="" python3 -m axolotl.cli.merge_lora configs/ft_llama_mimic.yaml

to merch the LORA adapters back to the base model. The model will be saved in the subdirectory merged in your output path storing the fine-tuned model.

2.1.2 SEPR

Go to /axolotl/configs/ft_llama_sepr.yaml and make sure the base_model (here LLaMA-3.1-8B), datasets/path (here SEPR-L), and output_dir are correct.
Go to /axolotl/src/axolotl and rename the file prompters.py to prompters_org.py and the file prompters_swed.py to prompters.py. This has to be done to create Swedish prompts. Don't forget to undo this change once you're done fine-tuning on the Swedish data.
From the /axolotl directory run

CUDA_VISIBLE_DEVICES="" python -m axolotl.cli.preprocess configs/ft_llama_sepr.yaml

to preprocess the dataset before fine-tuning.

From the /axolotl directory run

CUDA_VISIBLE_DEVICES="" accelerate launch -m configs/ft_llama_sepr.yaml --deepspeed deepspeed_configs/zero3.json

to start the fine-tuning process.

From the /axolotl directory run

CUDA_VISIBLE_DEVICES="" python3 -m axolotl.cli.merge_lora configs/ft_llama_sepr.yaml

to merch the LORA adapters back to the base model. The model will be saved in the subdirectory merged in your output path storing the fine-tuned model.

2.2 Inference

Set the environment by running the vLLM docker image mounting the data, generation/axolotl, and generation/inference folders:

docker run --entrypoint bash --gpus all -v /data:/data -v /generation/axolotl:/axolotl -v /generation/inference:/inference -it vllm-openai:v0.6.4.post1

2.2.1 MIMIC

Run

python3 /inference/inference_vllm_mimic.py --base_model path/to/fine-tuned/merged/llama --test_data path/to/sampling/file.json --file_out path/to/output/file.json

specifiying the path to the fine-tuned and merged LLaMA model (e.g., /axolotl/output/mimic_s_llama/merged) to the JSON file that should be used to generate the prompts for sampling (e.g., /data/mimic/mimic_s.json) and the output JSON file that stores the generated synthetic notes (e.g., /data/mimic/synth_mimic_s.json)

2.2.2 SEPR

Run

python3 /inference/inference_vllm_sepr.py --base_model path/to/fine-tuned/merged/llama --test_data path/to/sampling/file.json --file_out path/to/output/file.json

specifiying the path to the fine-tuned and merged LLaMA model (e.g., /axolotl/output/sepr_llama/merged), to the JSON file that should be used to generate the prompts for sampling (e.g., /data/mimic/sepr_s.json), and the output JSON file that stores the generated synthetic notes (e.g., /data/mimic/synth_sepr_s.json)

3. Fidelity Evaluation

Set the environment by running

sudo docker run --gpus all -v fidelity:/fidelity -v data:/data -v -it general:latest

Run

python /fidelity/statistical_comparison.py --real_file path/real/documents.json \
            --field_real outputfield \
            --synthetic_file path/real/documents.json \
            --field_synthetic outputfield

specifying the paths to the JSON files containing real and synthetic documents (e.g., data/mimic/mimic_s.json and synth_mimic_s.json) and the field containing the documents to analyze (e.g, output and output1). This will print a comparison of key statistical features of both datasets.

4. Utility: Medical Coding

Set the environment by running the medical coding docker image mounting the data and utility/plm_icd folders:

sudo docker run --gpus all -v utility/plm_icd:/medical_coding -v data:/data -it medcode:edin bash

4.1 MIMIC

Download RoBERTa-base-PM-M3-Voc and save the unzipped files.
Set the model_path parameter in medical_coding/configs/plm_icd.yaml and configs/text_transform/huggingface.yaml to the path of the RoBERTa model folder.
Create a .feather file containing the real or synthetic notes in the text column, the number of words in num_words, the ICD-10 diagnosis codes in icd10_diag, the ICD-10 procedure codes in icd10_proc, the combined codes in the target, the number of target in num_target and the ids in the _id column. Using the /medical_coding/files/data/mimiciv_icd10/mimiciv_icd10.feather as base, filtering for the ids and just substituting the documents simplifies this process a lot.
Store this file in the same directory as the file containing the splits file with _id and split columns. The splits used in this work are stored in /medical_coding/files/data/mimiciv_icd10/mimiciv_icd10_split.feather for training on MIMIC-L and /medical_coding/files/data/mimiciv_icd10/splits_val_train.feather for training on MIMIC-S. Change the /medical_coding/configs/data/mimiciv_icd10.yaml by specifying dir, data_filename and split_filename.
To train the medical coding model run

python main.py experiment=mimiciv_icd10/plm_icd gpu=x callbacks=no_wandb trainer.print_metrics=true

specifying the GPU you want to use.

If you want to evaluate a trained model run

python main.py experiment=mimiciv_icd10/plm_icd gpu=x load_model=path/to/model/checkpoints trainer.epochs=0 callbacks=no_wandb trainer.print_metrics=true

specifying the GPU you want to use and model folder containing the checkpoints of your medical coding model.

4.2 SEPR

Get access to the model checkpoints of SweDeClin-BERT by contacting the Health Bank at Stockholm University and store them in a folder.
Set the model_path parameter in medical_coding/configs/plm_icd.yaml and configs/text_transform/huggingface.yaml to the path of the SweDeClin-BERT model folder.
Create a .feather file containing the real or synthetic notes in the text column, the number of words in num_words, the ICD-10 diagnosis codes in icd10_diag, the ICD-10 procedure codes in icd10_proc, the combined codes in the target, the number of target in num_target and the ids in the _id column.
Store this file in the same directory as the file containing the splits file with _id and split columns. The splits used in this work are stored in /medical_coding/files/data/sper/sepr_icd10_split.feather for training on SEPR-L and /medical_coding/files/data/sepr/swed_splits_test_train.feather for training on SEPR-s. Change the /medical_coding/configs/data/mimiciv_icd10.yaml by specifying dir, data_filename and split_filename.
To train the medical coding model run

python main.py experiment=mimiciv_icd10/plm_icd gpu=x callbacks=no_wandb trainer.print_metrics=true

specifying the GPU you want to use.

If you want to evaluate a trained model run

python main.py experiment=mimiciv_icd10/plm_icd gpu=x load_model=path/to/model/checkpoints trainer.epochs=0 callbacks=no_wandb trainer.print_metrics=true

specifying the GPU you want to use and the model folder containing the checkpoints of your medical coding model.

4.3 Error Analysis

Set the environment by running

sudo docker run --gpus all -v utility/:/medical_coding -v data:/data -it general:latest

The error analysis is tailored to the MIMIC data and needs some prior adaptation if desired to apply to the SEPR models.

4.3.1 Predictions

To analyze key features of the predictions of the medical coding model compared to the targets, run

python  /medical_coding/error_analysis/predictions.py --file path/to/prediction.feather --threshold x

specifying the path to the test prediction file obtained in the evaluation of your medical coding model, and the optimal threshold also obtained in the evaluation of your medical coding model.

To get the percentage of predicted code sequences that are present in the training targets or form a subset of a code sequence from the training data, run

python  /medical_coding/error_analysis/code_match.py --pred_file path/to/prediction.feather \ 
        --data_file path/to/training/data.feather \
        --split_file path/to/split/file.feather \
        --threshold x

specifying the prediction file and threshold obtained during training as well as the path to the data file and splits file used during training.

4.3.2 F1 vs. Code Frequency and Document Length

Generate a CSV file storing the F1 score and its frequency in the training data for each code by running

python /medical_coding/error_analysis/get_f1_frequ.py --train_path path/to/training/data.feather \
               --split_path path/to/split/file.feather \
               --pred_file path/to/prediction.feather \
               --threshold x \
               --output_file path/to/output.csv

specifying the prediction file and threshold obtained during training as well as the path to the data file and splits file used during training and the path and name of the output file. For comparison, repeat this process for the model trained on real data and the model trained on synthetic data.

To plot F1 against code frequency and obtain spearman and pearson correlation coefficients, run

python /medical_coding/error_analysis/plot_f1_vs_frequ.py \
                --real_path path/to/f1_freq_df_real.csv \
                --synth_path path/to/f1_freq_df_synth.csv \
                --plot_output path/to/plot.png

specifying the paths to the dataframes of the real-data and synthetic-data models generated in step 1 and the name of the plot file.

Generate a CSV file storing the F1 score and its length for each document in the test file running

python /medical_coding/error_analysis/get_f1_doc_length.py --train_path path/to/training/data.feather \
                      --split_path path/to/split/file.feather \
                      --pred_file path/to/prediction.feather \
                      --threshold x \
                      --output_file path/to/output.csv

specifying the prediction file and threshold obtained during training as well as the path to the data file and splits file used during training and the path and name of the output file. For comparison, repeat this process for the model trained on real data and the model trained on synthetic data.

To plot F1 against document length and obtain spearman and pearson correlation coefficients, run

python /medical_coding/error_analysis/plot_f1_vs_length.py \
                        --real_path path/to/f1_freq_df_real.csv \
                        --synth_path path/to/f1_freq_df_synth.csv \
                        --plot_output path/to/plot.png

specifying the paths to the dataframes of the real-data and synthetic-data models generated in step 3 and the name of the plot file.

4.3.3 WF and OOF Errors

To analyze how many incorrect predictions are WF and OOF errors and plot the ICD-10 chapter distribution of the predictions in a pie chart, run

python /medical_coding/error_analysis/wf_oof_chapters.py \
    --preds path/to/predictions.feather \
    --threshold x \
    --plot_output path/to/plot.png

specifying the prediction file and threshold obtained during training and the name of the plot file.

4.3.4 Noise

To investigate whether noise contained in the synthetic data is widespread (H1) or rather concentrated in a subset of documents (H2) follow these steps:

Insert widespread noise in the real dataset by running

python /medical_coding/error_analysis/noise_h1.py --file_path path/to/real.feather \
                    --split_path path/to/splits.feather \
                    --noise_percentage x \
                    --output_file_path path/to/output_file.feather

specifying the path to the real data where the noise is to be inserted, the respective splits file, the percentage of noise that should be added and the path and name of the output file containing the widespread noise in the training data. Repeat this process with different numbers of noice percentage.

Insert concentrated noise in the real dataset by running

python /medical_coding/error_analysis/noise_h2.py --file_path path/to/real.feather \
                    --split_path path/to/splits.feather \
                    --noise_percentage x \
                    --output_file_path path/to/output_file.feather

specifying the path to the real data where the noise is to be inserted, the respective splits file, the percentage of noise that should be added and the path and name of the output file containing the widespread noise in the training data. Repeat this process with different numbers of noice percentage.

Use the files obtained in step 1 and 2 to train medical coding models as described in 4.1. Then plot F1 agains frequency and document length comparing the models trained on noisy real data to a model trained on the same amount of synthetic data and compare the curves. Analyze which hypothesis seems more likely by aligning more closely to the curve of the synthetic-data model and which percentage of noise achieves the most similar results.

6. Privacy

Set the environment by running

sudo docker run --gpus all -v privacy:/privacy -v data:/data -it general:latest

6.1 MIMIC: ROUGE-5

To obtain a CSV file with ROUGE-5 scores between the training data and the synthetic data, run

python /privacy/rouge_5/rouge_similarity_ranking.py --realpath /path/to/training.json \
                    --real_field outputfield
                    --synthpath /path/to/synthetic.json \
                    --synth_field outputfield
                    --outdir path/to/output/dir \
                    --n_jobs x \
                    --batch_size x

specifying the path to both JSON files and the name of the fields containing the documents and the number of synthetic documents processed in parallel (--n_job) as well as the number of synthetic documents to process before saving (--batch_size)

To get the average, minimum, and maximum ROUGE-5 recall score of all documents and the 122 most similar documents, run

python /privacy/rouge_5/evaluate_rouge.py --rouge_file path/to/rouge.csv \

specifying the path to the CSV file storing the ROUGE-5 scores obtained in step 1.

For further evaluation of the 20 most similar document pairs, run

python /privacy/rouge_5/longest_sequence.py --rouge_file path/to/rouge.csv \

specifiying the path to the ROUGE CSV file to obtain statistics about the longest overlapping sequences, and

python /privacy/rouge_5/longest_sequence.py --train_file path/to/training.feather
--splits_file path/to/splits.feather
--rouge_file path/to/rouge.csv \

specifying the path to the training and split feather files and the ROUGE CSV file to obtain the overall count of overlapping 5-grams in the 20 most similar documents in the training data.

6.2 SEPR: 8-gram Overlap

To calculate the 8-gram overlap between the training data used to finetune LLaMA and the generated synthetic data, run

python3 /privacy/8_gram/ngram_overlap.py --original_file path/to/training.json  \
                    --original_field outputfield  \
                    --synthetic_file path/to/synthetic.json  \
                    --synthetic_field outputfield  \

specifying the path to both JSON files and the name of the fields containing the documents.

7. Readability and Medical Coherence

7.1 Questionnaire

The XML files of the questionnaires of the MIMIC and SEPR study can be found in readability_coherence/questionnaire. You can import them on Soscisurvey to investigate and edit the study.

7.2 Evaluation

You can insert the results of the study into readability_coherence/evaluation/study_evaluation.ipynb to generate boxplots for the evaluation of the ratings, test for statistical significance and investigate the correlation of readability and medical coherence.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
docker		docker
fidelity		fidelity
generation		generation
models		models
preprocessing		preprocessing
privacy		privacy
readability_coherence		readability_coherence
utility		utility
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Instruction-Tuning LLaMA for Synthetic Medical Note Generation: Bridging Data Privacy and Utility in Downstream Tasks

Acknowledgements

Environments

1. Data Preprocessing

1.1 MIMIC

1.2 SEPR

2. Synthetic Medical Note Generation

2.1 Fine-tuning

2.1.1 MIMIC

2.1.2 SEPR

2.2 Inference

2.2.1 MIMIC

2.2.2 SEPR

3. Fidelity Evaluation

4. Utility: Medical Coding

4.1 MIMIC

4.2 SEPR

4.3 Error Analysis

4.3.1 Predictions

4.3.2 F1 vs. Code Frequency and Document Length

4.3.3 WF and OOF Errors

4.3.4 Noise

6. Privacy

6.1 MIMIC: ROUGE-5

6.2 SEPR: 8-gram Overlap

7. Readability and Medical Coherence

7.1 Questionnaire

7.2 Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages