[WIP] Add PyArrow Parquet validation to CLI#437
Open
Adam Lastowka (Rachmanin0xFF) wants to merge 23 commits intodevfrom
Open
[WIP] Add PyArrow Parquet validation to CLI#437Adam Lastowka (Rachmanin0xFF) wants to merge 23 commits intodevfrom
Adam Lastowka (Rachmanin0xFF) wants to merge 23 commits intodevfrom
Conversation
Vehicle dimension selectors (height, length, weight, width) use float64 instead of float32 to match the double-precision values in the data platform. Level uses int32 instead of int16 for the same reason. Axle count stays uint8 since it's a discrete count.
2 tasks
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds two commands to the CLI:
parquet-schema: Generates an emptpy Parquet file with a specific type's schema.validate-schema: Validates the schema of a Parquet file (or hive-partitioned dataset) by comparing its schema to a PyArrow schema generated from Pydantic.Adds
--themeand--simpleoptions to thelist-typescommand (this makes it easier to programmatically interpret its output).Additionally, changes all
int8andint16types toint32. Parquet's compression is good enough that there are no differences in physical (Parquet) size between the types, andint32s are more universally digestible.This command will be used for schema validation in our release process; you can see how here.
Examples
Round-trip generate + validate
Validation on release data
Arrow schema text output (uses Arrow schema's
.to_string())Currently fails on our public data due to some precision mismatches + column nullability issues.
Reference
Testing
Brief description of the testing done for this change showing why you are confident it works as expected and does not introduce regressions. Provide sample output data where appropriate.
TODO.
Checklist
Checklist of tasks commonly-associated with schema pull requests. Please review the relevant checklists and ensure you do all the tasks that are required for the change you made.
Abut is not intended to test propertyA's validity, and you made a schema change that invalidates propertyAin that counterexample, fix the counterexample to align it with your schema change.Documentation website
Update the hyperlink below to put the pull request number in.
[Docs preview for this PR.](https://dfhx9f55j8eg5.cloudfront.net/pr/<PUT THE PR # HERE>)