LLM pipeline: To auto-generate and validate maps from profiled schemas, follow LLM_MAP_PIPELINE.md (
profile_synthea_tables.py→generate_linkml_map_with_llm.py). For a minimal single-file flow seellm_map.py(PydanticAI) orbaml_map.py(BAML / OpenAI).
In this step, we will use linkml-map to specify how the source Patients data is transformed into the target OMOP-CDM person and death tables. Generating mapping rules declaratively makes data transformations much easier to maintain, validate, and execute.
We create a declarative mapping using a YAML file mapping the models.
Create a file named models/patients_to_omop_cdm_v54.yaml and add the following content:
prefixes:
linkml: https://w3id.org/linkml/
omop_cdm: https://w3id.org/omop_cdm
MySchema: https://w3id.org/MySchema
enum_derivations:
GENDER_enum:
populated_from: GENDER_enum
mirror_source: true
RACE_enum:
populated_from: RACE_enum
mirror_source: true
ETHNICITY_enum:
populated_from: ETHNICITY_enum
mirror_source: true
STATE_enum:
populated_from: STATE_enum
mirror_source: true
PREFIX_enum:
populated_from: PREFIX_enum
mirror_source: true
MARITAL_enum:
populated_from: MARITAL_enum
mirror_source: true
class_derivations:
person:
populated_from: Patients
slot_derivations:
person_id:
expr: "target = abs(hash(src.Id)) % 1000000000"
person_source_value:
populated_from: Id
gender_concept_id:
expr: "8507 if GENDER == 'M' else (8532 if GENDER == 'F' else 8551)"
gender_source_value:
populated_from: GENDER
year_of_birth:
expr: "target = int(str(src.BIRTHDATE)[0:4]) if src.BIRTHDATE else None"
month_of_birth:
expr: "target = int(str(src.BIRTHDATE)[5:7]) if src.BIRTHDATE else None"
day_of_birth:
expr: "target = int(str(src.BIRTHDATE)[8:10]) if src.BIRTHDATE else None"
birth_datetime:
populated_from: BIRTHDATE
race_concept_id:
expr: "8527 if RACE == 'white' else (8516 if RACE == 'black' else (8515 if RACE == 'asian' else 0))"
race_source_value:
populated_from: RACE
ethnicity_concept_id:
expr: "38003563 if ETHNICITY == 'hispanic' else 38003564"
ethnicity_source_value:
populated_from: ETHNICITY
death:
populated_from: Patients
slot_derivations:
person_id:
expr: "target = abs(hash(src.Id)) % 1000000000"
death_date:
expr: "target = str(src.DEATHDATE)[0:10] if src.DEATHDATE else None"
death_datetime:
populated_from: DEATHDATEclass_derivations: Specifies how the target classes are generated from the source class object.person&death: Target classes defined inmodels/omop_cdm_v54.yaml.populated_from: The source class (Patientsinferenced and generated from oursource_schemas/).expr: Allows evaluation of Python-like expressions. This enables seamless type transformations, string formatting, resolving code mappings like gender to Athena IDs (8507,8532), and extracting OMOP'syear_of_birthusing subset indexing. Unrestricted execution requires assigning results totarget =when running python functions.enum_derivations: LinkML-map requires Enum specifications or mirroring for schema-inferred fields.
Ensure you are using the correct Python environment equipped with LinkML ecosystem libraries:
source ~/code/environments/linkml-env/bin/activateTo check that the specification correctly maps attributes, you can process the file using our split mappings via the cli:
linkml-map map-data -T models/patients_to_person.yaml -s models/patients.yaml --source-type Patients data/patients.csv -o output/person.csv -f csv --unrestricted-eval
# Use awk to stream only patients with a populated DEATHDATE (column 3) to exclude empty rows natively
awk -F',' 'NR==1 || $3 != ""' data/patients.csv > data/dead_patients.csv
linkml-map map-data -T models/patients_to_death.yaml -s models/patients.yaml --source-type Patients data/dead_patients.csv -o output/death.csv -f csv --unrestricted-evalFor maintaining clean logic, we have incrementally refactored patients_to_person.yaml.
- models/patients_to_person_v2.yaml: Deprecates Python conditional expressions targeting enum concepts (e.g.
gender_concept_id) and replaces them entirely with native string-to-integer mappings using the standard LinkMLvalue_mappingsdictionary architecture. - models/patients_to_person_v3.yaml: Takes value matching a step further into the root level. Rather than slot assignments intercepting Enums, the transformation resolves strings directly within
permissible_value_derivationsdeployed internally onenum_derivations. To ensure the final output retains original string copies safely without getting swept into integers by LinkML,gender_source_valueactively leveragesexpr: target = src.GENDERto seamlessly bypass the Enum conversion pipeline!
If you attempt mapping property values heavily scoped like {BIRTHDATE}.year across CLI data dumps, LinkML will typically issue an AttributeError constraint. This originates because the csv.DictReader engine automatically parses your data strictly into abstract string dictionaries ('1992-05-18'), neglecting your formalized date datatypes.
Instead of writing string slices inline into the CLI maps, we utilize a heavily controlled Python programmatic pipeline in scripts/person_v2.py. It effectively:
- Translates the file using proper instantiation into
models.patients.Patientsdataclass objects. - Extracts explicitly encapsulated elements like
XSDDateTimeand rehydrates them cleanly backward into valid Pythondatetimeinstances. - Automatically leverages our streamlined
models/patients_to_person_v4.yamlwhere evaluation logic supports seamless dynamic properties (e.g.,src.BIRTHDATE.yearandsrc.BIRTHDATE.month):
source ~/code/environments/linkml-env/bin/activate
python scripts/person_v2.pyThis precisely outputs output/person_v4.csv efficiently parsed through strong execution validations!
In a broader production deployment, pipeline logic typically orchestrates these mappings iteratively over datasets, or models compile mappings directly down to Pandas/Spark transformations via compile subcommands.
Next, we establish the transformation for data/conditions.csv into the OMOP condition_occurrence table.
-
Source Schema: We natively codify a specific schema (
models/conditions.yaml) representing the 7 CSV columns (START,STOP,PATIENT,ENCOUNTER,SYSTEM,CODE,DESCRIPTION). Because OMOP maps diagnoses specifically via standard vocabulary concepts, we structurally enforce term integrity under theCODEattribute by defining a LinkMLreachable_fromconstraint referencing the SNOMED-CT tree root natively:enums: SnomedCode: reachable_from: source_ontology: obo:snomed source_nodes: - snomed:138875005 # SNOMED CT Concept Root
-
Mapping Pipeline: We construct
models/conditions_to_condition_occurrence.yamlevaluating ourConditionsconcepts securely into OMOP. We explicitly persist the hashing derivation identically mapping UUID strings (abs(hash(src.PATIENT)) % 1000000000) assuring complete referential consistency explicitly tracing back securely to theperson_idpreviously emitted dynamically in our primarypersongeneration.
Since SnomedCode validates dynamically against structural Enums, we instruct our configuration to mirror it sequentially to avoid pipeline derivation failures:
enum_derivations:
SnomedCode:
populated_from: SnomedCode
mirror_source: trueWe can execute and validate this standalone component pipeline by streaming the model:
linkml-map map-data -T models/conditions_to_condition_occurrence.yaml -s models/conditions.yaml --source-type Conditions data/conditions.csv -o output/condition_occurrence.csv -f csv --unrestricted-evalTo automate map authoring, we add a lightweight "Model Alignment Agent" script:
scripts/generate_linkml_map_with_llm.py.
It follows the same core pattern used by the LinkML Aurelian agent (reference implementation):
- Ask an LLM for a LinkML map draft.
- Validate locally.
- If validation fails, feed error messages back to the LLM and retry.
- Loads source and target LinkML schemas.
- Builds a constrained prompt from:
- source class slots (for example,
Patients) - target class attributes (for example,
person)
- source class slots (for example,
- Calls an OpenAI-compatible Chat Completions endpoint.
- Parses the YAML proposal.
- Validates structure (
class_derivations, selected target class). - Runs a local dry-run mapping with
linkml-mapAPIs on a sample CSV row. - Writes the final map only if validation passes.
source /Users/alabarga/code/environments/linkml-env/bin/activate
export OPENAI_API_KEY="your_api_key_here"python scripts/generate_linkml_map_with_llm.py \
--source-schema models/patients.yaml \
--target-schema models/omop_cdm_v54.yaml \
--source-type Patients \
--target-class person \
--sample-csv data/patients.csv \
--output-map models/patients_to_person_llm_v8.yaml \
--model gpt-4o-minilinkml-map map-data \
-T models/patients_to_person_llm_v8.yaml \
-s models/patients.yaml \
--source-type Patients \
data/patients.csv \
-o output/person_llm_v8.csv \
-f csv \
--unrestricted-evalFor OMOP *_concept_id fields, we also validate semantic correctness against the local OMOP CONCEPT.csv.gz table:
python scripts/validate_concept_ids.py \
--input output/person_llm_v8.csv \
--concept-table vocabularies/CONCEPT.csv.gzThis script checks:
- the
concept_idexists inCONCEPT.csv.gz - the
concept_idbelongs to the expecteddomain_idcategory (e.g.,Genderforgender_concept_id) concept_id=0is allowed as an “unknown/not mapped” sentinel- By default, the script also derives the expected
domain_idcategory for each*_concept_idcolumn from the OMOP schemabindings(so it can be reused beyond justperson).
- The script currently generates one target class per run (
--target-class), which keeps feedback loops clear and easy to debug. - For multi-table OMOP outputs (for example
person,death,condition_occurrence), run the agent once per target class. - Local validation remains the source of truth:
- The agent performs a dry-run mapping using
linkml-mapruntime. - It rejects drafts unless required target attributes are populated and have the expected types.
- Only then does it write the final
linkml-mapYAML.
- The agent performs a dry-run mapping using
- The
bindingsyou see in the OMOP schema for*_concept_idfields serve as “accepted value references”. validate_concept_ids.pyis the step that actually enforces the semantic lookup againstCONCEPT.csv.gz.- Schema validation (
linkml-validate) requires the importedomop_vocabularyfile.- This repo includes a small local stub at
models/omop_vocabulary.yamlto unblock validation in tutorial mode.
- This repo includes a small local stub at
After generating the map, you can validate the mapped output against models/omop_cdm_v54.yaml:
/Users/alabarga/code/environments/linkml-env/bin/python - <<'PY'
import csv, json, numbers
from pathlib import Path
from linkml_runtime.utils.schemaview import SchemaView
from linkml_map.transformer.object_transformer import ObjectTransformer
sv = SchemaView('models/patients.yaml')
transformer = ObjectTransformer(source_schemaview=sv)
transformer.unrestricted_eval = True
transformer.load_transformer_specification('models/patients_to_person_llm_v8.yaml')
out=[]
with open('data/patients.csv','r',encoding='utf-8') as f:
reader=csv.DictReader(f)
for i,row in enumerate(reader):
if i>=50: break
mapped = transformer.map_object(row, source_type='Patients')
# JSON-safe coercion for numpy scalar ints
for k,v in list(mapped.items()):
if isinstance(v, (numbers.Integral,)) and not isinstance(v, bool):
mapped[k] = int(v)
out.append(mapped)
Path('output/person_llm_v8_list.json').write_text(json.dumps(out), encoding='utf-8')
print('wrote output/person_llm_v8_list.json', len(out))
PY
/Users/alabarga/code/environments/linkml-env/bin/linkml-validate -s models/omop_cdm_v54.yaml -C person output/person_llm_v8_list.json