-
Notifications
You must be signed in to change notification settings - Fork 2
Architecture
The solution is implemented in Python.
The XSD schema for NeTEx is loaded into the folder xsd as a submodule
TODO
TODO
There is a quirk in lmdb for Windows: It can't scale the file size automatically. This is now done in the program. However, due to the way we process stuff, we try to avoid it. So we estimate the size of the db from the input files.
We consider everything a pipeline. Where the intermediate results are stored in lmdb databases and mdbx databases. We do some types of fixing in different locations in the input and sometimes in the database. In one exception we currently fix something in the GTFS output
@startuml
(*) --> "Loading Defaults"
note right: Setting reasonable values for all default elements
if "is GTFS" then
--> "GTFS loading to LMDB"
"GTFS loading to LMDB" --> "LMDB convert to source MDBX"
"LMDB convert to source MDBX" --> "Transform to target MDBX"
else
-->"NeTex_preproc_cleanup"
note left
Depends on the problems of the source file(s)
We may automate detection of problems.
Here we work on the XML files (even with zip and gzip)
end note
"NeTex_preproc_cleanup" --> "Load source MDBX"
note left: May depend on type of NeTEx
endif
"Load source MDBX" --> "Fix within source MDBX"
note left
Fixes depend on the problems of the source file(s).
We may automate the detection of problems in the futre.
end note
"Fix within source MDBX" --> "Transform to target MDBX"
"Transform to target MDBX" --> "Store to target format"
"Store to target format" --> "Postprocessing (to be avoided)"
"Postprocessing (to be avoided)" --> "Validate result"
"Validate result" --> (*)
@enduml
sequenceDiagram
title PyNeTExConv: Transformation (generalisation)
participant NeTEx XML Source
participant lmdb Source
participant lmdb Target
participant NeTEx XML Target
NeTEx XML Source->>lmdb Source: Parse XML into database<br /><br />execute: netex_to_db.py
lmdb Source->>lmdb Target: Apply transformations and<br />introduce new objects:<br /> 1. changes to the timingmodel<br /> 2. calendars vs availabilitycondions<br /> 3. geographic projections<br /><br />execute: *_db_to_db.py
lmdb Target->>NeTEx XML Target: Query both databases<br />for the objects which the target<br />database has created, or<br />exists as-is in the source.<br /><br />execute: *_db_to_xml.py
lmdb Source->>NeTEx XML Target: .
There is one notable exception. For GTFS the data is read into a Duckdb database first (as this is really fast and convenient).
sequenceDiagram
title PyNeTExConv2
participant NeTEx XSD
participant NeTEx XML
participant Python Dataclasses
participant Parsing
participant DuckDB
NeTEx XSD->>Python Dataclasses: xsdata generate -c netex.conf
NeTEx XML->>Parsing:lxml.etree.iterparse<br />event driven sax based<br />XML-parsing
loop #ff00ff for each first class object
Parsing->>Parsing:Inheritance stack from<br />FrameDefaults:<br /> 1. DataSourceRef<br /> 2. ResponsibilitySet<br /> 3. SrsName
Parsing->>Python Dataclasses:unmarshall etree into<br />python object
note over Parsing: Prior to marshall, execute any code<br />which does not have interdependencies<br />but does generate changes to the object.
Python Dataclasses->>DuckDB: marshall object into:<br /> 1. pickle<br /> 2. XML (legacy)<br /><br />INSERT OR REPLACE
Python Dataclasses->>DuckDB: Recursively resolve:<br /> 1. embedded objects with ids<br /> 2. objects referencing other objects
end
sequenceDiagram
title PyNeTExConv: Transformation (generalisation)
participant NeTEx XML Source
participant DuckDB Source
participant DuckDB Target
participant NeTEx XML Target
NeTEx XML Source->>DuckDB Source: Parse XML into database<br /><br />execute: netex_to_db.py
DuckDB Source->>DuckDB Target: Apply transformations and<br />introduce new objects:<br /> 1. changes to the timingmodel<br /> 2. calendars vs availabilitycondions<br /> 3. geographic projections<br /><br />execute: *_db_to_db.py
DuckDB Target->>NeTEx XML Target: Query both databases<br />for the objects which the target<br />database has created, or<br />exists as-is in the source.<br /><br />execute: *_db_to_xml.py
DuckDB Source->>NeTEx XML Target: .
- Shapes
- Flex elements (for the time being)
TODO
If only a part of a network is used, then the data element must be selected somehow. That all "included" elements e.g. ServiceJourneyPattern used by a ServiceJourney used by a Line is clear.
However, sometimes one of the assignement types must be used as well. For a dict is necessary,in which direction to go through it. Otherwise the whole network is selected instead of a line.
What behaves in this way us defined in two sets:
filter_set = {Line, ServiceJourneyPattern, DayType, ScheduledStopPoint}
filter_set_assignment = {DayType: {DayTypeAssignment}, ScheduledStopPoint: {PassengerStopAssignment}}
filter_set - which elements can be filtered for (only limited) filter_set_assignment - the assignment is only used when coming from DayType (conditional inward filtering)