Spec: Add collations for string types#16972
Conversation
Adds a collation annotation on string fields and collation-aware bounds for collated columns: - A string field may carry a provider-qualified collation (e.g. icu.en_US-ci) defining case-insensitive, accent-insensitive, or locale-aware comparison and ordering. Comparison-only; stored values are unchanged. - data_file.collation_bounds (147) stores collation-aware lower/upper bounds as original values tagged with the collation and its implementation version. Byte-order lower/upper bounds are still written so collation-unaware readers stay correct, and must not be used to prune predicates on a collated column. - Updates the schema JSON field attribute, the data_file table and notes, an Appendix E v3 change list, and the OpenAPI StructField. Design rationale and differences from the original proposal are kept in a separate write-up; backed by a working iceberg-go reference implementation.
szehon-ho
left a comment
There was a problem hiding this comment.
Thanks for working on this! A few comments here and inline:
-
Nested string fields need an explicit story. The spec says a
stringfield may carry acollation, and the JSON change addscollationtoStructField. Does that intentionally excludelist<string>,map<string, ...>, andmap<..., string>? If nested strings are in scope, the spec should define where collation is stored for list elements, map keys, and map values. If they are out of scope, that limitation should be explicit. -
Collation evolution can be better defined in terms of schema evolution. What happens when a column changes from one collation to another, or from byte-order to an ICU collation? Is this a compatible schema evolution, a type change, or disallowed? Existing files may contain bounds for older collations/versions, so readers should only use
collation_boundsentries matching the collation resolved for the read schema and otherwise treat the file as possibly matching. -
Please clarify the scope beyond file pruning. If collation affects query semantics, then equality deletes, sort orders, partition transforms, and bucket/hash semantics may also need rules. If this change is only defining schema annotation and file-level pruning metadata, it would help to say that explicitly so engines do not infer broader SQL/equality semantics from the table-format metadata alone.
| * Row Lineage tracking | ||
| * Binary deletion vectors | ||
| * Table encryption keys | ||
| * Collations for string types |
There was a problem hiding this comment.
V3 spec is already formally adopted , should we put it in v4?
|
|
||
| A field's collation is stored as a `collation` attribute on the field (see [Appendix C](#appendix-c-json-serialization)). The attribute is allowed only on `string` fields. If a field has no `collation` attribute, comparison defaults to UTF-8 byte order, which is the behavior of all prior versions. | ||
|
|
||
| A collation is identified by a provider-qualified name of the form `<provider>.<name>`, for example `icu.en_US-ci`. The provider names the library that defines the collation (`icu` for collations defined by the [Unicode Collation Algorithm](https://unicode.org/reports/tr10/) over [CLDR](https://cldr.unicode.org/) locale data; other providers may define engine-specific collations such as case-folding variants). The name selects a locale and optional modifiers for case sensitivity (`ci`/`cs`), accent sensitivity (`ai`/`as`), trimming, and case folding. The reserved name `utf8` denotes UTF-8 byte-order comparison. |
There was a problem hiding this comment.
is it missing and end parens (after locale data)
|
|
||
| A collation is identified by a provider-qualified name of the form `<provider>.<name>`, for example `icu.en_US-ci`. The provider names the library that defines the collation (`icu` for collations defined by the [Unicode Collation Algorithm](https://unicode.org/reports/tr10/) over [CLDR](https://cldr.unicode.org/) locale data; other providers may define engine-specific collations such as case-folding variants). The name selects a locale and optional modifiers for case sensitivity (`ci`/`cs`), accent sensitivity (`ai`/`as`), trimming, and case folding. The reserved name `utf8` denotes UTF-8 byte-order comparison. | ||
|
|
||
| The schema stores the collation **name without a version**, so any engine that supports the collation can read the table. UCA, DUCET, CLDR, and ICU collation orders are [not stable across versions](https://unicode.org/reports/tr10/#Non-Goals), so collation-aware metrics carry the implementation version they were produced under (see below) and a reader uses them only when it can produce the same order. |
There was a problem hiding this comment.
nit: its describing too much about bounds which is already covered below? Also 'can read the table' is a bit vague as it should be read without stats-pruning, so suggest remove
how about just a reference: The schema stores the logical collation name without a version. Collation-aware file metrics store the implementation version used to compute their bounds (see below)
|
|
||
| Bounds store the **original values** that are the minimum and maximum under the collation order, not collation sort keys (sort keys are not stable across implementation versions). The same single-value serialization and truncation rules as `lower_bounds`/`upper_bounds` apply, except that when truncating an upper bound, the appended successor must be the next value **in collation order**. A column may have more than one entry, one per collation version, so a file can serve readers pinned to different versions during an upgrade. | ||
|
|
||
| **Writers** must continue to write the byte-order `lower_bounds`/`upper_bounds` for collated columns (so collation-unaware readers stay correct). A writer should additionally write `collation_bounds` when it supports the column's collation, tagging each entry with the collation and the exact implementation version used. A writer that does not support the collation or version must not write `collation_bounds` for that column. |
There was a problem hiding this comment.
in iceberg, lower_bounds and upper_bounds are optional, so 'must' is too strong here.
|
|
||
| **Writers** must continue to write the byte-order `lower_bounds`/`upper_bounds` for collated columns (so collation-unaware readers stay correct). A writer should additionally write `collation_bounds` when it supports the column's collation, tagging each entry with the collation and the exact implementation version used. A writer that does not support the collation or version must not write `collation_bounds` for that column. | ||
|
|
||
| **Readers** may use a `collation_bounds` entry to prune a file only when both the entry's `collation` and `version` match the collation the reader resolves for the column; otherwise the entry must be ignored. A reader must not use the byte-order `lower_bounds`/`upper_bounds` to prune comparison, equality, or prefix predicates on a collated column. When no usable collation bounds are available, the file must be scanned. A collation-unaware reader ignores the `collation` attribute and the `collation_bounds` field and reads the column as a UTF-8 byte-order string. |
There was a problem hiding this comment.
suggestion to clarify what kind of predicate requires collation match, and clarify what collation unaware readers may do
Readers that evaluate string predicates using a field’s declared collation must not use byte-order lower_bounds or upper_bounds to prune predicates whose result depends on collation ordering or collation equality. This includes range comparisons, equality and inequality comparisons, prefix predicates, and any pattern predicate the reader evaluates using collation semantics. Such readers may use a collation_bounds entry only when the entry’s collation name matches the field’s resolved collation and the entry’s implementation version is known to produce the same ordering. If no matching collation_bounds entry is available, the reader must treat the file as possibly matching the predicate.
Collation-unaware readers ignore the field `collation` attribute and the `collation_bounds` field. Such readers continue to interpret string values and byte-order `lower_bounds` / `upper_bounds` using ordinary UTF-8 byte-order semantics.
| | Field id, name | Type | Description | | ||
| |----------------|------|-------------| | ||
| | **`151 collation`** | `string` | Collation the bounds were produced for, e.g. `icu.en_US-ci` | | ||
| | **`152 version`** | `string` | Collation implementation version the bounds were selected under | |
There was a problem hiding this comment.
may be worth to give an example
This adds collation support for string types to the v3 spec, following the discussion in the dev-list thread. A string field can carry a collation (e.g. icu.en_US-ci) that defines case-insensitive, accent-insensitive, or locale-aware comparison and ordering, and a new data_file.collation_bounds field carries collation-aware min/max so collated columns can still be pruned
The spec changes:
Two decisions I'd flag for discussion.
Backed by a working reference implementation in iceberg-go: apache/iceberg-go#1318.
Rationale and the differences from the original proposal are written up separately (proposal doc).
See also the dev-list thread. Feedback very welcome.