Skip to content

[WIP] Add PyArrow Parquet validation to CLI#437

Open
Adam Lastowka (Rachmanin0xFF) wants to merge 23 commits intodevfrom
generate-parquet
Open

[WIP] Add PyArrow Parquet validation to CLI#437
Adam Lastowka (Rachmanin0xFF) wants to merge 23 commits intodevfrom
generate-parquet

Conversation

@Rachmanin0xFF
Copy link
Contributor

@Rachmanin0xFF Adam Lastowka (Rachmanin0xFF) commented Feb 6, 2026

Description

Adds two commands to the CLI:

  • parquet-schema: Generates an emptpy Parquet file with a specific type's schema.
  • validate-schema: Validates the schema of a Parquet file (or hive-partitioned dataset) by comparing its schema to a PyArrow schema generated from Pydantic.

Adds --theme and --simple options to the list-types command (this makes it easier to programmatically interpret its output).

Additionally, changes all int8 and int16 types to int32. Parquet's compression is good enough that there are no differences in physical (Parquet) size between the types, and int32s are more universally digestible.

This command will be used for schema validation in our release process; you can see how here.

Examples

Round-trip generate + validate

$ overture-schema parquet-schema --type building -o buildings.parquet && overture-schema validate-schema --theme buildings --type building buildings.parquet
Wrote Parquet schema to buildings.parquet
SUCCESS, schema of 'buildings.parquet' matches type 'building' (subset check)
$ echo $?
0

Validation on release data

$ overture-schema validate-schema --theme addresses --type address s3://overturemaps-us-west-2/release/2026-01-21.0/theme=addresses/type=address/
FAILURE, schema mismatch: 
's3://overturemaps-us-west-2/release/2026-01-21.0/theme=addresses/type=address/' vs type 'address'
FAILURE, schema mismatch: 
's3://overturemaps-us-west-2/release/2026-01-21.0/theme=addresses/type=address/' vs type 'address'

Missing fields (2):
  - theme  (expected: string)
  - type  (expected: string)

Type mismatches (1):
  ~ sources.item.confidence
      expected: float
      actual:   double

$ echo $?
1

Arrow schema text output (uses Arrow schema's .to_string())

$ overture-schema parquet-schema --format text --type connector
id: string not null
  -- field metadata --
  description: 'A feature ID. This may be an ID associated with the Globa' + 104
bbox: struct<xmin: float, ymin: float, xmax: float, ymax: float>
  child 0, xmin: float
  child 1, ymin: float
  child 2, xmax: float
  child 3, ymax: float
  -- field metadata --
  description: 'An optional bounding box for the feature'
geometry: binary not null
  -- field metadata --
  description: 'Position of the connector'
theme: string not null
type: string not null
version: int32 not null
sources: list<item: struct<property: string not null, dataset: string not null, license: string, record_id: s (... 76 chars omitted)
  child 0, item: struct<property: string not null, dataset: string not null, license: string, record_id: string, upda (... 64 chars omitted)
      child 0, property: string not null
      -- field metadata --
      description: 'A JSON Pointer identifying the property (field) that ' + 603
      child 1, dataset: string not null
      -- field metadata --
      description: 'Name of the dataset where the source data can be foun' + 2
      child 2, license: string
      -- field metadata --
      description: 'Source data license name.

This should be a valid SPD' + 105
      child 3, record_id: string
      -- field metadata --
      description: 'Identifies the specific record within the source data' + 94
      child 4, update_time: string
      -- field metadata --
      description: 'Last update time of the source data record.'
      child 5, confidence: float
      -- field metadata --
      description: 'Confidence value from the source dataset.

This is a ' + 75
      child 6, between: list<item: double>
          child 0, item: double
      -- field metadata --
      description: 'The linearly-referenced sub-segment of the geometry, ' + 130
-- schema metadata --
overture_schema_version: '0.1.0'
model_name: 'Connector'
model_module: 'overture.schema.transportation.connector.models'

Currently fails on our public data due to some precision mismatches + column nullability issues.

Reference

  1. https://github.com/OvertureMaps/tf-data-platform/pull/2767
  2. https://github.com/OvertureMaps/tf-data-platform/issues/2369

Testing

Brief description of the testing done for this change showing why you are confident it works as expected and does not introduce regressions. Provide sample output data where appropriate.

TODO.

Checklist

Checklist of tasks commonly-associated with schema pull requests. Please review the relevant checklists and ensure you do all the tasks that are required for the change you made.

  1. Add relevant examples.
  2. Add relevant counterexamples.
  3. Update any counterexamples that became obsolete. For example, if a counterexample uses property A but is not intended to test property A's validity, and you made a schema change that invalidates property A in that counterexample, fix the counterexample to align it with your schema change.
  4. Update in-schema documentation using plain English written in complete sentences, if an update is required.
  5. Update Docusaurus documentation, if an update is required.
  6. Review change with Overture technical writer to ensure any advanced documentation needs will be taken care of, unless the change is trivial and would not affect the documentation.

Documentation website

Update the hyperlink below to put the pull request number in.

[Docs preview for this PR.](https://dfhx9f55j8eg5.cloudfront.net/pr/<PUT THE PR # HERE>)

Vehicle dimension selectors (height, length, weight, width) use float64
instead of float32 to match the double-precision values in the data
platform. Level uses int32 instead of int16 for the same reason. Axle
count stays uint8 since it's a discrete count.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants