Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ For users needing custom behavior or specific integrations:
* [`examples/advanced/custom_splitting/inference_individual_splitters.py`](examples/advanced/custom_splitting/inference_individual_splitters.py): Example script for inference using individual splitters.
* [`examples/advanced/custom_splitting/training_individual_splitters.ipynb`](examples/advanced/custom_splitting/training_individual_splitters.ipynb): Notebook demonstrating training data generation with individual splitters.
* [`examples/advanced/custom_splitting/training_custom_split_events.ipynb`](examples/advanced/custom_splitting/training_custom_split_events.ipynb): Notebook showing how to customize split events and forecast different event categories.
* [`examples/advanced/custom_splitting/training_forecasting_splitter_only.ipynb`](examples/advanced/custom_splitting/training_forecasting_splitter_only.ipynb): Forecasting-only example showing training data generation using only the `DataSplitterForecasting` (no event splitter).
* **Custom Text Generation**: [`examples/advanced/custom_output/customizing_text_generation.ipynb`](examples/advanced/custom_output/customizing_text_generation.ipynb)
* A comprehensive tutorial on customizing every textual component of the instruction generation pipeline. Learn how to modify preambles, event formatting, time units, genetic data tags, forecasting prompts, and more to adapt outputs for different LLMs, languages, or institutional requirements.
* **Custom Summarized Row**: [`examples/advanced/custom_output/custom_summarized_row.ipynb`](examples/advanced/custom_output/custom_summarized_row.ipynb)
Expand Down
7 changes: 7 additions & 0 deletions docs/examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,15 @@ Located in the `advanced/` directory, these examples cover more specific use cas
### Custom Splitting (`advanced/custom_splitting/`)

* **[training_individual_splitters.ipynb](advanced/custom_splitting/training_individual_splitters.ipynb)**: Demonstrates data preparation using individual data splitters for more granular control.
* **[training_custom_split_events.ipynb](advanced/custom_splitting/training_custom_split_events.ipynb)**: Shows how to customize split events and forecast different event categories.
* **[training_forecasting_splitter_only.ipynb](advanced/custom_splitting/training_forecasting_splitter_only.ipynb)**: Forecasting-only example showing training data generation using only the `DataSplitterForecasting` (no event splitter).
* **[inference_individual_splitters.py](advanced/custom_splitting/inference_individual_splitters.py)**: A Python script showing how to run inference using the individual splitter setup.

### Custom Output (`advanced/custom_output/`)

* **[customizing_text_generation.ipynb](advanced/custom_output/customizing_text_generation.ipynb)**: A comprehensive tutorial on customizing every textual component of the instruction generation pipeline, including preambles, event formatting, time units, genetic data tags, forecasting prompts, and more.
* **[custom_summarized_row.ipynb](advanced/custom_output/custom_summarized_row.ipynb)**: Shows how to customize the summarized row section of the instruction prompt using `set_custom_summarized_row_fn()`. Includes minimal and advanced examples, plus error handling guidance.

### Pretraining (`advanced/pretraining/`)

* **[prepare_pretraining_data.py](advanced/pretraining/prepare_pretraining_data.py)**: A script to prepare data for the pretraining phase.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,300 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "0",
"metadata": {},
"source": [
"# Forecasting-Only Example: Training Data Generation with Custom Dataset"
]
},
{
"cell_type": "markdown",
"id": "1",
"metadata": {},
"source": [
"Start by loading in all libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"from twinweaver import (\n",
" DataSplitterForecasting,\n",
" DataManager,\n",
" ConverterInstruction,\n",
" Config,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "3",
"metadata": {},
"source": [
"## Basic Setup\n"
]
},
{
"cell_type": "markdown",
"id": "4",
"metadata": {},
"source": [
"Set up the config - showing how to use custom dataset here from example data."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5",
"metadata": {},
"outputs": [],
"source": [
"df_events = pd.read_csv(\"../../example_data/events.csv\")\n",
"df_constant = pd.read_csv(\"../../example_data/constant.csv\")\n",
"df_constant_description = pd.read_csv(\"../../example_data/constant_description.csv\")"
]
},
{
"cell_type": "markdown",
"id": "6",
"metadata": {},
"source": [
"Set up the data manager and the forecasting-only pipeline."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7",
"metadata": {},
"outputs": [],
"source": [
"config = Config() # Override values here to customize pipeline\n",
"\n",
"# <---------------------- CRITICAL CONFIGURATION ---------------------->\n",
"# 1. Event category used for data splitting (e.g., split data around Lines of Therapy 'lot')\n",
"# Has to be set for all instruction tasks\n",
"config.split_event_category = \"lot\"\n",
"\n",
"# 2. List of event categories we want to forecast (e.g., forecasting 'lab' values)\n",
"# Only needs to be set if you want to forecast variables\n",
"config.event_category_forecast = [\"lab\"]\n",
"\n",
"# No time to event\n",
"\n",
"# Constant setup\n",
"config.constant_columns_to_use = [\"birthyear\", \"gender\", \"histology\", \"smoking_history\"] # Manually set from constant\n",
"config.constant_birthdate_column = \"birthyear\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8",
"metadata": {},
"outputs": [],
"source": [
"dm = DataManager(config=config)\n",
"dm.load_indication_data(df_events=df_events, df_constant=df_constant, df_constant_description=df_constant_description)\n",
"dm.process_indication_data()\n",
"dm.setup_unique_mapping_of_events()\n",
"dm.setup_dataset_splits()\n",
"dm.infer_var_types()\n",
"\n",
"\n",
"data_splitter_forecasting = DataSplitterForecasting(\n",
" data_manager=dm,\n",
" config=config,\n",
")\n",
"# In case you manually want to override the variables for forecasting selectiong, you can skip this next line.\n",
"data_splitter_forecasting.setup_statistics()\n",
"\n",
"converter = ConverterInstruction(\n",
" nr_tokens_budget_total=8192,\n",
" config=config,\n",
" dm=dm,\n",
" variable_stats=data_splitter_forecasting.variable_stats, # Optional, needed for forecasting QA tasks\n",
")"
]
},
{
"cell_type": "markdown",
"id": "9",
"metadata": {},
"source": [
"## Examine patient data"
]
},
{
"cell_type": "markdown",
"id": "10",
"metadata": {},
"source": [
"From the data manager we can get a patient, for example the third patientid."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "11",
"metadata": {},
"outputs": [],
"source": [
"patientid = dm.all_patientids[2]\n",
"patientid"
]
},
{
"cell_type": "markdown",
"id": "12",
"metadata": {},
"source": [
"Let's checkout the data of the patient. `patient_data` is a dictionary containing the patient's data, with two keys: \n",
"- \"events\": A pandas DataFrame containing all time-series events\n",
" (original events and molecular data combined and sorted\n",
" by date).\n",
"- \"constant\": A pandas DataFrame containing the static (constant)\n",
" data for the patient."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "13",
"metadata": {},
"outputs": [],
"source": [
"patient_data = dm.get_patient_data(patientid)\n",
"patient_data[\"events\"].head(20)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "14",
"metadata": {},
"outputs": [],
"source": [
"patient_data[\"constant\"]"
]
},
{
"cell_type": "markdown",
"id": "15",
"metadata": {},
"source": [
"## Convert patient data to string"
]
},
{
"cell_type": "markdown",
"id": "16",
"metadata": {},
"source": [
"We start by generating random \"splits\" in the patient trajectory. We can make multiple relevant samples from each patient trajectory (e.g. depending on when the therapy started), and also to predict different variables (e.g. neutrophils/hemoglobin/... for forecasting).\n",
"\n",
"Here we generate these random splits. We can also manually override them (see other examples on inference)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "17",
"metadata": {},
"outputs": [],
"source": [
"processed_splits_fc, split_dates = data_splitter_forecasting.get_splits_from_patient(\n",
" patient_data,\n",
" nr_samples_per_split=4,\n",
" filter_outliers=False,\n",
" include_metadata=True,\n",
" max_num_splits_per_split_event=2,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "18",
"metadata": {},
"source": [
"Now for each split, we can generate the formatted strings. Note that `event_splits` is left empty since this example only uses the forecasting splitter."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "19",
"metadata": {},
"outputs": [],
"source": [
"split_idx = 0\n",
"p_converted = converter.forward_conversion(\n",
" forecasting_splits=processed_splits_fc[split_idx],\n",
" event_splits=[], # Not needed for forecasting-only splitter\n",
" override_mode_to_select_forecasting=\"forecasting\",\n",
")"
]
},
{
"cell_type": "markdown",
"id": "20",
"metadata": {},
"source": [
"`p_converted` is a dictionary containing the final formatted data:\n",
"- 'instruction': The complete input string for the model (context + multi-task prompt).\n",
"- 'answer': The complete target string for the model (multi-task answer).\n",
"- 'meta': A dictionary holding metadata including patient ID, structured constant and\n",
" history data used, split date, combined metadata from sub-converters, and\n",
" a list of detailed metadata for each individual task generated ('target_meta_detailed').\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "21",
"metadata": {},
"outputs": [],
"source": [
"print(p_converted[\"instruction\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "22",
"metadata": {},
"outputs": [],
"source": [
"print(p_converted[\"answer\"])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv_dev",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
7 changes: 7 additions & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,15 @@ Located in the `advanced/` directory, these examples cover more specific use cas
### Custom Splitting (`advanced/custom_splitting/`)

* **[training_individual_splitters.ipynb](advanced/custom_splitting/training_individual_splitters.ipynb)**: Demonstrates data preparation using individual data splitters for more granular control.
* **[training_custom_split_events.ipynb](advanced/custom_splitting/training_custom_split_events.ipynb)**: Shows how to customize split events and forecast different event categories.
* **[training_forecasting_splitter_only.ipynb](advanced/custom_splitting/training_forecasting_splitter_only.ipynb)**: Forecasting-only example showing training data generation using only the `DataSplitterForecasting` (no event splitter).
* **[inference_individual_splitters.py](advanced/custom_splitting/inference_individual_splitters.py)**: A Python script showing how to run inference using the individual splitter setup.

### Custom Output (`advanced/custom_output/`)

* **[customizing_text_generation.ipynb](advanced/custom_output/customizing_text_generation.ipynb)**: A comprehensive tutorial on customizing every textual component of the instruction generation pipeline, including preambles, event formatting, time units, genetic data tags, forecasting prompts, and more.
* **[custom_summarized_row.ipynb](advanced/custom_output/custom_summarized_row.ipynb)**: Shows how to customize the summarized row section of the instruction prompt using `set_custom_summarized_row_fn()`. Includes minimal and advanced examples, plus error handling guidance.

### Pretraining (`advanced/pretraining/`)

* **[prepare_pretraining_data.py](advanced/pretraining/prepare_pretraining_data.py)**: A script to prepare data for the pretraining phase.
Expand Down
Loading