Arrow Schema (Parquet file) generation by mojodna · Pull Request #452 · OvertureMaps/schema

Seth Fitzsimmons (mojodna) · 2026-02-27T00:46:31Z

Summary

Add --format arrow to the codegen CLI, producing PyArrow schemas from Overture feature
models. Output is a metadata-only Parquet file per feature -- no row data, just the
schema. The central data pipeline reads these files to enforce column names, types, and
nullability when writing Overture Parquet datasets.

This is the first non-markdown output target, exercising the extraction/rendering split
described in the codegen design doc. The extraction layer and output layout are
untouched. The change adds one renderer module, extends TypeMapping with an arrow
column, and branches the CLI's format dispatch.

How it works

arrow_renderer.py converts ModelSpec fields to PyArrow types by walking the existing
spec tree:

Primitives map to Arrow scalars via the type registry's new arrow column (int32 ->
pa.int32(), str -> pa.utf8(), Geometry -> pa.binary())
MODEL-kind fields expand recursively into nested pa.struct() types, with cycle
detection that falls back to pa.utf8()
Enums and Literals fall back to pa.utf8() (string representation)
list[X] wraps with pa.list_(), dict[K, V] emits pa.map_()
BBox gets a custom struct (xmin, ymin, xmax, ymax as float64)

Discriminated unions merge member structs into a single flattened struct via
merge_model_variants(). Fields present in all variants keep their type (promoted when
widths differ). Fields absent from some variants become nullable. Numeric type promotion
follows Arrow/Spark conventions: wider type wins, mixed int/float promotes to float64,
mixed signed/unsigned promotes to the next wider signed type.

Field descriptions embed as Arrow field metadata (b"description" key). Schema-level
metadata carries overture-schema.version (from the installed package version) and
model (the entry point string).

Changes

File	What
`arrow_renderer.py`	New renderer: `type_info_to_arrow`, `field_spec_to_arrow`, `model_spec_to_arrow_schema`, `union_spec_to_arrow_schema`, `merge_model_variants`
`type_registry.py`	Add `arrow: str \| None` to `TypeMapping`, populate for all primitives, update `for_target()` to return `None` for missing targets
`cli.py`	Add `arrow` to format choices, `_generate_arrow()` writes metadata-only Parquet files or text to stdout
`pyproject.toml`	Add `[project.optional-dependencies] arrow = ["pyarrow>=14.0"]`
`test_arrow_renderer.py`	primitive mapping, fallbacks, lists, dicts, nested models, union merging, type promotion, field metadata, real-model integration
`test_cli.py`	Arrow CLI tests: stdout output, Parquet file generation, schema round-trip

Design decisions

Metadata-only Parquet files, not Arrow IPC or JSON. The data pipeline already reads
Parquet. pq.write_metadata(schema, path) produces a valid Parquet file with zero row
groups -- pq.read_schema(path) recovers the full Arrow schema. No custom format to
parse.

Union variants flatten to a single struct. Arrow has a union type, but Parquet does
not. Since the output targets Parquet consumption, unions flatten to a struct where
variant-specific fields are nullable. This matches how Overture data lands in Parquet
today.

pyarrow is an optional dependency. The arrow extra avoids adding pyarrow's weight
for markdown-only users. The CLI imports lazily and produces a clear error message when
pyarrow is missing.

TypeMapping.for_target() returns None instead of raising for missing targets.
BBox has no arrow mapping because it uses a custom struct -- None signals the
renderer to check _CUSTOM_ARROW_TYPES before falling back. This also prevents future
targets from needing to backfill every registry entry on day one.

Union specs flatten to a single merged schema. union_spec_to_arrow_schema calls
merge_model_variants() on the union's member types, producing one flattened struct per
union feature. Variant-only fields become nullable; shared fields use type promotion.
Nested union fields (like VehicleSelector inside segment variants) merge inline via the
same path in type_info_to_arrow.

Test plan

make check passes (all existing tests + new Arrow tests)
overture-codegen generate --format arrow --theme buildings prints Arrow schema
text to stdout
overture-codegen generate --format arrow --theme buildings --output-dir /tmp/arrow-out writes .parquet files; pq.read_schema() recovers the schema
overture-codegen generate --format arrow --theme transportation --output-dir /tmp/arrow-out writes segment.parquet (union type) alongside connector.parquet
overture-codegen generate --format markdown still works identically (no
regression from TypeMapping changes)
Without pyarrow installed, --format arrow produces a clear error message

Add `--format arrow` to the CLI, producing pyarrow Schemas from Overture feature models. The renderer walks ModelSpecs top-down: primitives map to Arrow scalars via TypeRegistry, MODEL-kind fields expand recursively into nested structs, enums and literals fall back to string, and discriminated unions merge member structs into a single flattened struct with all fields nullable. Dict fields emit map types. List-wrapped fields emit list types. Field descriptions and schema-level model metadata (name, description, constraint prose) are embedded in Arrow metadata so downstream consumers can inspect documentation without the source models. pyarrow is an optional dependency (`arrow` extra) to avoid adding weight for markdown-only users.

Adam Lastowka (Rachmanin0xFF)

Nice, solid!

Differences between the Parquets this generates and our latest release are expected (bbox float/doubles, nullability, admin_level, re-ordering of fields) -- with the exception of a codegen issue affecting divisions (I left a comment in that PR).

Adam Lastowka (Rachmanin0xFF) · 2026-02-28T00:10:29Z

packages/overture-schema-codegen/src/overture/schema/codegen/arrow_renderer.py

+    """Build schema-level metadata dict, or None if empty."""
+    metadata: dict[bytes, bytes] = {}
+    if version:
+        metadata[b"overture-schema.version"] = version.encode()


We're hoping to include schema version metadata in our release parquets, too -- Jennings Anderson (@jenningsanderson) was working on writing something like this in our data normalization task (process_data).

It'd be good if we could keep the naming consistent; Jennings Anderson (@jenningsanderson) had you settled on something already?

Also, our published parquet files include some geoparquet spec info...

I guess I'm wondering whether we want:

To validate our metadata against the schema (I would say no, since right now we're writing metadata post-schema validation)

To copy our metadata from the schema. So version, geoparquet crs, etc. would live here, and our pipelines would copy that in during the normalization step.

To just sort of do nothing and don't aim for any pipeline/schema parquet metadata crossover.

I like option 2.

Seth Fitzsimmons (mojodna) requested a review from Adam Lastowka (Rachmanin0xFF) February 27, 2026 00:46

Seth Fitzsimmons (mojodna) temporarily deployed to staging February 27, 2026 00:46 — with GitHub Actions Inactive

Seth Fitzsimmons (mojodna) changed the base branch from dev to codegen February 27, 2026 00:46

Seth Fitzsimmons (mojodna) added the change type - cosmetic 🌹 Cosmetic change label Feb 27, 2026

Seth Fitzsimmons (mojodna) changed the title ~~Arrow Schema generation~~ Arrow Schema (Parquet file) generation Feb 27, 2026

Adam Lastowka (Rachmanin0xFF) approved these changes Feb 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow Schema (Parquet file) generation#452

Arrow Schema (Parquet file) generation#452
Seth Fitzsimmons (mojodna) wants to merge 1 commit intocodegenfrom
arrow-schema

Seth Fitzsimmons (mojodna) commented Feb 27, 2026 •

edited

Loading

Uh oh!

Adam Lastowka (Rachmanin0xFF) left a comment

Uh oh!

Adam Lastowka (Rachmanin0xFF) Feb 28, 2026

Uh oh!

Adam Lastowka (Rachmanin0xFF) Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Seth Fitzsimmons (mojodna) commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Changes

Design decisions

Test plan

Uh oh!

Adam Lastowka (Rachmanin0xFF) left a comment

Choose a reason for hiding this comment

Uh oh!

Adam Lastowka (Rachmanin0xFF) Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Adam Lastowka (Rachmanin0xFF) Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Seth Fitzsimmons (mojodna) commented Feb 27, 2026 •

edited

Loading