Skip to content

Architecture

Matthias Günter edited this page Feb 3, 2026 · 4 revisions

Modules / Programming

The solution is implemented in Python.

Submodules

The XSD schema for NeTEx is loaded into the folder xsd as a submodule

Important facts about libraries

TODO

Usage of xsdata

TODO

Usage of lmdb

There is a quirk in lmdb for Windows: It can't scale the file size automatically. This is now done in the program. However, due to the way we process stuff, we try to avoid it. So we estimate the size of the db from the input files.

Pipeline

We consider everything a pipeline. Where the intermediate results are stored in lmdb databases and mdbx databases. We do some types of fixing in different locations in the input and sometimes in the database. In one exception we currently fix something in the GTFS output

@startuml
(*) --> "Loading Defaults"
note right: Setting reasonable values for all default elements

if "is GTFS" then
  --> "GTFS loading to LMDB"
  "GTFS loading to LMDB" --> "LMDB convert to source MDBX"
  "LMDB convert to source MDBX" --> "Transform to target MDBX"
else
   -->"NeTex_preproc_cleanup"
   note left
     Depends on the problems of the source file(s)
     We may automate detection of problems.
     Here we work on the XML files (even with zip and gzip)
   end note
   "NeTex_preproc_cleanup" --> "Load source MDBX"
   note left: May depend on type of NeTEx
endif

"Load source MDBX" --> "Fix within source MDBX"
note left 
  Fixes depend on the problems of the source file(s). 
  We may automate the detection of problems in the futre.
end note
"Fix within source MDBX" --> "Transform to target MDBX"
"Transform to target MDBX" --> "Store to target format"
"Store to target format" --> "Postprocessing (to be avoided)"
"Postprocessing (to be avoided)" --> "Validate result" 
"Validate result" --> (*)

@enduml
sequenceDiagram
title PyNeTExConv: Transformation (generalisation)

participant NeTEx XML Source
participant lmdb Source
participant lmdb Target

participant NeTEx XML Target

NeTEx XML Source->>lmdb Source: Parse XML into database<br /><br />execute: netex_to_db.py

lmdb Source->>lmdb Target: Apply transformations and<br />introduce new objects:<br /> 1. changes to the timingmodel<br /> 2. calendars vs availabilitycondions<br /> 3. geographic projections<br /><br />execute: *_db_to_db.py

lmdb Target->>NeTEx XML Target: Query both databases<br />for the objects which the target<br />database has created, or<br />exists as-is in the source.<br /><br />execute: *_db_to_xml.py
lmdb Source->>NeTEx XML Target: .
Loading

There is one notable exception. For GTFS the data is read into a Duckdb database first (as this is really fast and convenient).

Processing

Importing NeTEx sequence diagram

sequenceDiagram
title PyNeTExConv2

participant NeTEx XSD
participant NeTEx XML
participant Python Dataclasses
participant Parsing
participant DuckDB

NeTEx XSD->>Python Dataclasses: xsdata generate -c netex.conf
NeTEx XML->>Parsing:lxml.etree.iterparse<br />event driven sax based<br />XML-parsing

loop #ff00ff for each first class object
Parsing->>Parsing:Inheritance stack from<br />FrameDefaults:<br /> 1. DataSourceRef<br /> 2. ResponsibilitySet<br /> 3. SrsName
Parsing->>Python Dataclasses:unmarshall etree into<br />python object

note over Parsing: Prior to marshall, execute any code<br />which does not have interdependencies<br />but does generate changes to the object.

Python Dataclasses->>DuckDB: marshall object into:<br /> 1. pickle<br /> 2. XML (legacy)<br /><br />INSERT OR REPLACE
Python Dataclasses->>DuckDB: Recursively resolve:<br /> 1. embedded objects with ids<br /> 2. objects referencing other objects
end
Loading

Transformation NeTEx sequence diagram

sequenceDiagram
title PyNeTExConv: Transformation (generalisation)

participant NeTEx XML Source
participant DuckDB Source
participant DuckDB Target

participant NeTEx XML Target

NeTEx XML Source->>DuckDB Source: Parse XML into database<br /><br />execute: netex_to_db.py

DuckDB Source->>DuckDB Target: Apply transformations and<br />introduce new objects:<br /> 1. changes to the timingmodel<br /> 2. calendars vs availabilitycondions<br /> 3. geographic projections<br /><br />execute: *_db_to_db.py

DuckDB Target->>NeTEx XML Target: Query both databases<br />for the objects which the target<br />database has created, or<br />exists as-is in the source.<br /><br />execute: *_db_to_xml.py
DuckDB Source->>NeTEx XML Target: .
Loading

Handling of Embeddings

Handling of operating days

Handling of interchanges

Omitted data structures

GTFS

  • Shapes
  • Flex elements (for the time being)

NeTEx

TODO

Filtering and conditional innward filtering

If only a part of a network is used, then the data element must be selected somehow. That all "included" elements e.g. ServiceJourneyPattern used by a ServiceJourney used by a Line is clear.

However, sometimes one of the assignement types must be used as well. For a dict is necessary,in which direction to go through it. Otherwise the whole network is selected instead of a line.

What behaves in this way us defined in two sets:

filter_set = {Line, ServiceJourneyPattern, DayType, ScheduledStopPoint}
filter_set_assignment = {DayType: {DayTypeAssignment}, ScheduledStopPoint: {PassengerStopAssignment}}

filter_set - which elements can be filtered for (only limited) filter_set_assignment - the assignment is only used when coming from DayType (conditional inward filtering)