Arrow Schema (Parquet file) generation#452
Arrow Schema (Parquet file) generation#452Seth Fitzsimmons (mojodna) wants to merge 1 commit intocodegenfrom
Conversation
Add `--format arrow` to the CLI, producing pyarrow Schemas from Overture feature models. The renderer walks ModelSpecs top-down: primitives map to Arrow scalars via TypeRegistry, MODEL-kind fields expand recursively into nested structs, enums and literals fall back to string, and discriminated unions merge member structs into a single flattened struct with all fields nullable. Dict fields emit map types. List-wrapped fields emit list types. Field descriptions and schema-level model metadata (name, description, constraint prose) are embedded in Arrow metadata so downstream consumers can inspect documentation without the source models. pyarrow is an optional dependency (`arrow` extra) to avoid adding weight for markdown-only users.
Adam Lastowka (Rachmanin0xFF)
left a comment
There was a problem hiding this comment.
Nice, solid!
Differences between the Parquets this generates and our latest release are expected (bbox float/doubles, nullability, admin_level, re-ordering of fields) -- with the exception of a codegen issue affecting divisions (I left a comment in that PR).
| """Build schema-level metadata dict, or None if empty.""" | ||
| metadata: dict[bytes, bytes] = {} | ||
| if version: | ||
| metadata[b"overture-schema.version"] = version.encode() |
There was a problem hiding this comment.
We're hoping to include schema version metadata in our release parquets, too -- Jennings Anderson (@jenningsanderson) was working on writing something like this in our data normalization task (process_data).
It'd be good if we could keep the naming consistent; Jennings Anderson (@jenningsanderson) had you settled on something already?
There was a problem hiding this comment.
Also, our published parquet files include some geoparquet spec info...
I guess I'm wondering whether we want:
- To validate our metadata against the schema (I would say no, since right now we're writing metadata post-schema validation)
- To copy our metadata from the schema. So version, geoparquet crs, etc. would live here, and our pipelines would copy that in during the normalization step.
- To just sort of do nothing and don't aim for any pipeline/schema parquet metadata crossover.
I like option 2.
Summary
Add
--format arrowto the codegen CLI, producing PyArrow schemas from Overture featuremodels. Output is a metadata-only Parquet file per feature -- no row data, just the
schema. The central data pipeline reads these files to enforce column names, types, and
nullability when writing Overture Parquet datasets.
This is the first non-markdown output target, exercising the extraction/rendering split
described in the codegen design doc. The extraction layer and output layout are
untouched. The change adds one renderer module, extends
TypeMappingwith anarrowcolumn, and branches the CLI's format dispatch.
How it works
arrow_renderer.pyconvertsModelSpecfields to PyArrow types by walking the existingspec tree:
arrowcolumn (int32->pa.int32(),str->pa.utf8(),Geometry->pa.binary())pa.struct()types, with cycledetection that falls back to
pa.utf8()pa.utf8()(string representation)list[X]wraps withpa.list_(),dict[K, V]emitspa.map_()BBoxgets a custom struct (xmin,ymin,xmax,ymaxasfloat64)Discriminated unions merge member structs into a single flattened struct via
merge_model_variants(). Fields present in all variants keep their type (promoted whenwidths differ). Fields absent from some variants become nullable. Numeric type promotion
follows Arrow/Spark conventions: wider type wins, mixed int/float promotes to
float64,mixed signed/unsigned promotes to the next wider signed type.
Field descriptions embed as Arrow field metadata (
b"description"key). Schema-levelmetadata carries
overture-schema.version(from the installed package version) andmodel(the entry point string).Changes
arrow_renderer.pytype_info_to_arrow,field_spec_to_arrow,model_spec_to_arrow_schema,union_spec_to_arrow_schema,merge_model_variantstype_registry.pyarrow: str | NonetoTypeMapping, populate for all primitives, updatefor_target()to returnNonefor missing targetscli.pyarrowto format choices,_generate_arrow()writes metadata-only Parquet files or text to stdoutpyproject.toml[project.optional-dependencies] arrow = ["pyarrow>=14.0"]test_arrow_renderer.pytest_cli.pyDesign decisions
Metadata-only Parquet files, not Arrow IPC or JSON. The data pipeline already reads
Parquet.
pq.write_metadata(schema, path)produces a valid Parquet file with zero rowgroups --
pq.read_schema(path)recovers the full Arrow schema. No custom format toparse.
Union variants flatten to a single struct. Arrow has a union type, but Parquet does
not. Since the output targets Parquet consumption, unions flatten to a struct where
variant-specific fields are nullable. This matches how Overture data lands in Parquet
today.
pyarrow is an optional dependency. The
arrowextra avoids adding pyarrow's weightfor markdown-only users. The CLI imports lazily and produces a clear error message when
pyarrow is missing.
TypeMapping.for_target()returnsNoneinstead of raising for missing targets.BBoxhas noarrowmapping because it uses a custom struct --Nonesignals therenderer to check
_CUSTOM_ARROW_TYPESbefore falling back. This also prevents futuretargets from needing to backfill every registry entry on day one.
Union specs flatten to a single merged schema.
union_spec_to_arrow_schemacallsmerge_model_variants()on the union's member types, producing one flattened struct perunion feature. Variant-only fields become nullable; shared fields use type promotion.
Nested union fields (like
VehicleSelectorinside segment variants) merge inline via thesame path in
type_info_to_arrow.Test plan
make checkpasses (all existing tests + new Arrow tests)overture-codegen generate --format arrow --theme buildingsprints Arrow schematext to stdout
overture-codegen generate --format arrow --theme buildings --output-dir /tmp/arrow-outwrites.parquetfiles;pq.read_schema()recovers the schemaoverture-codegen generate --format arrow --theme transportation --output-dir /tmp/arrow-outwritessegment.parquet(union type) alongsideconnector.parquetoverture-codegen generate --format markdownstill works identically (noregression from
TypeMappingchanges)--format arrowproduces a clear error message