feat: prototype collation support for strings#1318
Draft
laskoviymishka wants to merge 2 commits into
Draft
Conversation
A working reference implementation of collation support, to drive the Iceberg collation spec discussion (Snowflake/Löser proposal, aligned with Delta's approach to collation statistics). - collation/: parse the collation specifier grammar (locale, ci/cs, ai/as, trim, casing, "utf8" pseudo-locale) and compare/sort via golang.org/x/text/collate (CLDR/UCA, the Go ICU-equivalent). - StringType carries an optional collation; it round-trips as a collation_spec sibling on NestedField, keeping the on-disk type name "string" so collation-unaware readers still read the column as a plain string. - Delta-aligned collation bounds: store the original min/max VALUES (not ICU sort keys, which aren't stable across versions) tagged with a collation version, and prune only on an exact collation+version match. - Persist them in a prototype data_file.collation_bounds Avro field (v3, experimental field IDs 9000-9006 pending an official reservation), with a full WriteManifest/ReadManifest round-trip. - Version-gated, collator-based data-file pruning in the inclusive metrics evaluator; the strict evaluator and collated columns without valid bounds are conservatively kept (byte-order bounds must not prune a collated column). Prototype scope and deferred items are documented in the collation package doc.
The gofmt formatter (run by golangci-lint in CI) requires the single-line String() methods on Float32Literal/Float64Literal to be column-aligned with their sibling methods. Align them.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A working reference implementation of collation support, to drive the Iceberg collation spec discussion.
Prototype scope and deferred items are documented in the collation package doc.