This repository contains the public release artifacts for MedGuideX, a medical LLM project that converts clinical practice guidelines into executable decision logic and then into factual and counterfactual QA supervision.
The release is intentionally scoped to reviewer-useful artifacts:
- final guideline-derived factual QA data;
- final strict counterfactual QA data;
- the executable guideline-to-QA construction pipeline;
- training data preparation utilities for SFT/RL;
- executable-consistency reward code;
- lightweight validation scripts.
It intentionally excludes model weights, checkpoints, raw benchmark data, private logs, API keys, and large intermediate pipeline artifacts.
MedGuideX/
data/
factual.json # 4,963 factual QA examples, grouped by guideline function
counterfactual.json # 4,963 strict counterfactual QA examples
stats.json # release and source pipeline counts
samples/ # one sample record per task type
source/README.md # raw CPG source schema note
code/
data_generation/v2_pipeline.py
training/
prepare_sft_coldstart_dataset.py
prepare_rl_medical_reasoning_dataset.py
medical_reasoning_reward.py
medical_sft_dataset.py
src/
azure_api.py # env-driven LLM client wrapper, no keys included
azure_openai_judge.py
scripts/
create_release_dataset.py
validate_release.py
check_executable_consistency.py
The released dataset is a cleaned version of the US-only pipeline output. It preserves the fields needed for training and verification while removing raw CPG text, source chunks, LLM history, validation history, logs, and checkpoints.
Counts:
| Split | Functions | QA examples |
|---|---|---|
| Factual | 2,759 | 4,963 |
| Counterfactual | 699 | 4,963 |
Each record is grouped by executable guideline function. Each scenario contains:
- the clinical question;
- the generated reasoning and final answer in JSON text;
- executable inputs and outputs;
- for counterfactual data,
X_base,X_hidden,X_change, intervention values, and abduction-stability metadata.
pip install -r requirements.txt
python scripts/validate_release.py
python scripts/check_executable_consistency.pyExpected output:
OK factual=4963 counterfactual=4963
The second command re-executes every released factual and counterfactual scenario against its guideline function and checks that the stored oracle output is reproduced.
python code/training/prepare_sft_coldstart_dataset.py \
--task-selection both \
--use-all \
--factual-val 0 \
--factual-test 0 \
--counterfactual-val 0 \
--counterfactual-test 0 \
--out-dir data/sftThe script writes parquet files for text-only SFT. The default paths point to data/factual.json and data/counterfactual.json.
python code/training/prepare_rl_medical_reasoning_dataset.py \
--task-selection factual_cot \
--pool-mode all \
--out-dir data/rl/factual_cotThe reward implementation is in code/training/medical_reasoning_reward.py. It checks answer correctness, response format, executable consistency, and counterfactual hidden-variable recovery when applicable.
The main post-training configuration template is in code/training/configs/post_training.yaml.
The full pipeline is in code/data_generation/v2_pipeline.py. To run it from raw CPG JSONL:
export AZURE_OPENAI_API_KEY=...
export AZURE_OPENAI_BASE_URL=...
export GUIDELINE_OPENAI_MODEL=<your-generation-model>
python code/data_generation/v2_pipeline.py \
--source-jsonl data/source/us_filter.jsonl \
--output-dir data/generated/us_only_final \
--max-qa-jobs 4963 \
--max-no-action-ratio 0.25Raw CPG source documents are not bundled by default to avoid redistribution and licensing ambiguity. See data/source/README.md for the expected schema.
Do not upload:
- model weights or LoRA adapters;
- training checkpoints and optimizer states;
- private
.envfiles, API keys, or usage logs; - raw MIMIC/benchmark data;
- private evaluation outputs that include benchmark case text;
- raw CPG source documents unless the release license has been explicitly checked.
The included .gitignore blocks common model and credential files, but run python scripts/validate_release.py before publishing.
This repository is for research on clinical reasoning models. It is not a medical device and must not be used as a substitute for professional clinical judgment.