Skip to content

feat: prototype collation support for strings#1318

Draft
laskoviymishka wants to merge 2 commits into
apache:mainfrom
laskoviymishka:prototype/collation-support
Draft

feat: prototype collation support for strings#1318
laskoviymishka wants to merge 2 commits into
apache:mainfrom
laskoviymishka:prototype/collation-support

Conversation

@laskoviymishka

Copy link
Copy Markdown
Contributor

A working reference implementation of collation support, to drive the Iceberg collation spec discussion.

  • collation/: parse the collation specifier grammar (locale, ci/cs, ai/as, trim, casing, "utf8" pseudo-locale) and compare/sort via golang.org/x/text/collate (CLDR/UCA, the Go ICU-equivalent).
  • StringType carries an optional collation; it round-trips as a collation_spec sibling on NestedField, keeping the on-disk type name "string" so collation-unaware readers still read the column as a plain string.
  • Delta-aligned collation bounds: store the original min/max VALUES (not ICU sort keys, which aren't stable across versions) tagged with a collation version, and prune only on an exact collation+version match.
  • Persist them in a prototype data_file.collation_bounds Avro field (v3, experimental field IDs 9000-9006 pending an official reservation), with a full WriteManifest/ReadManifest round-trip.
  • Version-gated, collator-based data-file pruning in the inclusive metrics evaluator; the strict evaluator and collated columns without valid bounds are conservatively kept (byte-order bounds must not prune a collated column).

Prototype scope and deferred items are documented in the collation package doc.

A working reference implementation of collation support, to drive the Iceberg
collation spec discussion (Snowflake/Löser proposal, aligned with Delta's
approach to collation statistics).

- collation/: parse the collation specifier grammar (locale, ci/cs, ai/as,
  trim, casing, "utf8" pseudo-locale) and compare/sort via
  golang.org/x/text/collate (CLDR/UCA, the Go ICU-equivalent).
- StringType carries an optional collation; it round-trips as a collation_spec
  sibling on NestedField, keeping the on-disk type name "string" so
  collation-unaware readers still read the column as a plain string.
- Delta-aligned collation bounds: store the original min/max VALUES (not ICU
  sort keys, which aren't stable across versions) tagged with a collation
  version, and prune only on an exact collation+version match.
- Persist them in a prototype data_file.collation_bounds Avro field (v3,
  experimental field IDs 9000-9006 pending an official reservation), with a
  full WriteManifest/ReadManifest round-trip.
- Version-gated, collator-based data-file pruning in the inclusive metrics
  evaluator; the strict evaluator and collated columns without valid bounds are
  conservatively kept (byte-order bounds must not prune a collated column).

Prototype scope and deferred items are documented in the collation package doc.
The gofmt formatter (run by golangci-lint in CI) requires the single-line
String() methods on Float32Literal/Float64Literal to be column-aligned with
their sibling methods. Align them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant