Skip to content

Arrow Schema (Parquet file) generation#452

Open
Seth Fitzsimmons (mojodna) wants to merge 1 commit intocodegenfrom
arrow-schema
Open

Arrow Schema (Parquet file) generation#452
Seth Fitzsimmons (mojodna) wants to merge 1 commit intocodegenfrom
arrow-schema

Conversation

@mojodna
Copy link
Collaborator

@mojodna Seth Fitzsimmons (mojodna) commented Feb 27, 2026

Summary

Add --format arrow to the codegen CLI, producing PyArrow schemas from Overture feature
models. Output is a metadata-only Parquet file per feature -- no row data, just the
schema. The central data pipeline reads these files to enforce column names, types, and
nullability when writing Overture Parquet datasets.

This is the first non-markdown output target, exercising the extraction/rendering split
described in the codegen design doc. The extraction layer and output layout are
untouched. The change adds one renderer module, extends TypeMapping with an arrow
column, and branches the CLI's format dispatch.

How it works

arrow_renderer.py converts ModelSpec fields to PyArrow types by walking the existing
spec tree:

  • Primitives map to Arrow scalars via the type registry's new arrow column (int32 ->
    pa.int32(), str -> pa.utf8(), Geometry -> pa.binary())
  • MODEL-kind fields expand recursively into nested pa.struct() types, with cycle
    detection that falls back to pa.utf8()
  • Enums and Literals fall back to pa.utf8() (string representation)
  • list[X] wraps with pa.list_(), dict[K, V] emits pa.map_()
  • BBox gets a custom struct (xmin, ymin, xmax, ymax as float64)

Discriminated unions merge member structs into a single flattened struct via
merge_model_variants(). Fields present in all variants keep their type (promoted when
widths differ). Fields absent from some variants become nullable. Numeric type promotion
follows Arrow/Spark conventions: wider type wins, mixed int/float promotes to float64,
mixed signed/unsigned promotes to the next wider signed type.

Field descriptions embed as Arrow field metadata (b"description" key). Schema-level
metadata carries overture-schema.version (from the installed package version) and
model (the entry point string).

Changes

File What
arrow_renderer.py New renderer: type_info_to_arrow, field_spec_to_arrow, model_spec_to_arrow_schema, union_spec_to_arrow_schema, merge_model_variants
type_registry.py Add arrow: str | None to TypeMapping, populate for all primitives, update for_target() to return None for missing targets
cli.py Add arrow to format choices, _generate_arrow() writes metadata-only Parquet files or text to stdout
pyproject.toml Add [project.optional-dependencies] arrow = ["pyarrow>=14.0"]
test_arrow_renderer.py primitive mapping, fallbacks, lists, dicts, nested models, union merging, type promotion, field metadata, real-model integration
test_cli.py Arrow CLI tests: stdout output, Parquet file generation, schema round-trip

Design decisions

Metadata-only Parquet files, not Arrow IPC or JSON. The data pipeline already reads
Parquet. pq.write_metadata(schema, path) produces a valid Parquet file with zero row
groups -- pq.read_schema(path) recovers the full Arrow schema. No custom format to
parse.

Union variants flatten to a single struct. Arrow has a union type, but Parquet does
not. Since the output targets Parquet consumption, unions flatten to a struct where
variant-specific fields are nullable. This matches how Overture data lands in Parquet
today.

pyarrow is an optional dependency. The arrow extra avoids adding pyarrow's weight
for markdown-only users. The CLI imports lazily and produces a clear error message when
pyarrow is missing.

TypeMapping.for_target() returns None instead of raising for missing targets.
BBox has no arrow mapping because it uses a custom struct -- None signals the
renderer to check _CUSTOM_ARROW_TYPES before falling back. This also prevents future
targets from needing to backfill every registry entry on day one.

Union specs flatten to a single merged schema. union_spec_to_arrow_schema calls
merge_model_variants() on the union's member types, producing one flattened struct per
union feature. Variant-only fields become nullable; shared fields use type promotion.
Nested union fields (like VehicleSelector inside segment variants) merge inline via the
same path in type_info_to_arrow.

Test plan

  • make check passes (all existing tests + new Arrow tests)
  • overture-codegen generate --format arrow --theme buildings prints Arrow schema
    text to stdout
  • overture-codegen generate --format arrow --theme buildings --output-dir /tmp/arrow-out writes .parquet files; pq.read_schema() recovers the schema
  • overture-codegen generate --format arrow --theme transportation --output-dir /tmp/arrow-out writes segment.parquet (union type) alongside connector.parquet
  • overture-codegen generate --format markdown still works identically (no
    regression from TypeMapping changes)
  • Without pyarrow installed, --format arrow produces a clear error message

Add `--format arrow` to the CLI, producing pyarrow Schemas from
Overture feature models.

The renderer walks ModelSpecs top-down: primitives map to Arrow
scalars via TypeRegistry, MODEL-kind fields expand recursively
into nested structs, enums and literals fall back to string,
and discriminated unions merge member structs into a single
flattened struct with all fields nullable. Dict fields emit
map types. List-wrapped fields emit list types.

Field descriptions and schema-level model metadata (name,
description, constraint prose) are embedded in Arrow metadata
so downstream consumers can inspect documentation without the
source models.

pyarrow is an optional dependency (`arrow` extra) to avoid
adding weight for markdown-only users.
@mojodna Seth Fitzsimmons (mojodna) changed the base branch from dev to codegen February 27, 2026 00:46
@mojodna Seth Fitzsimmons (mojodna) changed the title Arrow Schema generation Arrow Schema (Parquet file) generation Feb 27, 2026
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, solid!

Differences between the Parquets this generates and our latest release are expected (bbox float/doubles, nullability, admin_level, re-ordering of fields) -- with the exception of a codegen issue affecting divisions (I left a comment in that PR).

"""Build schema-level metadata dict, or None if empty."""
metadata: dict[bytes, bytes] = {}
if version:
metadata[b"overture-schema.version"] = version.encode()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're hoping to include schema version metadata in our release parquets, too -- Jennings Anderson (@jenningsanderson) was working on writing something like this in our data normalization task (process_data).

It'd be good if we could keep the naming consistent; Jennings Anderson (@jenningsanderson) had you settled on something already?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, our published parquet files include some geoparquet spec info...

I guess I'm wondering whether we want:

  1. To validate our metadata against the schema (I would say no, since right now we're writing metadata post-schema validation)
  2. To copy our metadata from the schema. So version, geoparquet crs, etc. would live here, and our pipelines would copy that in during the normalization step.
  3. To just sort of do nothing and don't aim for any pipeline/schema parquet metadata crossover.

I like option 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants