Spec: Add collations for string types by laskoviymishka · Pull Request #16972 · apache/iceberg

laskoviymishka · 2026-06-26T08:23:28Z

This adds collation support for string types to the v3 spec, following the discussion in the dev-list thread. A string field can carry a collation (e.g. icu.en_US-ci) that defines case-insensitive, accent-insensitive, or locale-aware comparison and ordering, and a new data_file.collation_bounds field carries collation-aware min/max so collated columns can still be pruned

The spec changes:

A collation attribute on string fields (provider-qualified, e.g. icu.en_US-ci; utf8 means byte order), stored unversioned in the schema so any compatible engine can read.
A new data_file.collation_bounds field (id 147): per column, a list of collation-aware lower/upper bounds stored as original values and tagged with the collation and the implementation version they were selected under.
Reader/writer rules: byte-order lower_bounds/upper_bounds are still written for collated columns so collation-unaware engines stay correct, but must not be used to prune predicates on a collated column; a collation_bounds entry may be used only on an exact collation + version match.

Two decisions I'd flag for discussion.

First, bounds store original values rather than ICU sort keys: sort keys aren't stable across UCA/CLDR/ICU versions, so per-file versioning plus an exact-match read gate degrades gracefully where a pinned global version would break.
Second, I put collation_bounds on data_file as a standalone v3 field, but it could instead live in the v4 content_stats typed-stats struct - worth deciding which.

Backed by a working reference implementation in iceberg-go: apache/iceberg-go#1318.

Rationale and the differences from the original proposal are written up separately (proposal doc).

See also the dev-list thread. Feedback very welcome.

Adds a collation annotation on string fields and collation-aware bounds for collated columns: - A string field may carry a provider-qualified collation (e.g. icu.en_US-ci) defining case-insensitive, accent-insensitive, or locale-aware comparison and ordering. Comparison-only; stored values are unchanged. - data_file.collation_bounds (147) stores collation-aware lower/upper bounds as original values tagged with the collation and its implementation version. Byte-order lower/upper bounds are still written so collation-unaware readers stay correct, and must not be used to prune predicates on a collated column. - Updates the schema JSON field attribute, the data_file table and notes, an Appendix E v3 change list, and the OpenAPI StructField. Design rationale and differences from the original proposal are kept in a separate write-up; backed by a working iceberg-go reference implementation.

szehon-ho

Thanks for working on this! A few comments here and inline:

Nested string fields need an explicit story. The spec says a string field may carry a collation, and the JSON change adds collation to StructField. Does that intentionally exclude list<string>, map<string, ...>, and map<..., string>? If nested strings are in scope, the spec should define where collation is stored for list elements, map keys, and map values. If they are out of scope, that limitation should be explicit.
Collation evolution can be better defined in terms of schema evolution. What happens when a column changes from one collation to another, or from byte-order to an ICU collation? Is this a compatible schema evolution, a type change, or disallowed? Existing files may contain bounds for older collations/versions, so readers should only use collation_bounds entries matching the collation resolved for the read schema and otherwise treat the file as possibly matching.
Please clarify the scope beyond file pruning. If collation affects query semantics, then equality deletes, sort orders, partition transforms, and bucket/hash semantics may also need rules. If this change is only defining schema annotation and file-level pruning metadata, it would help to say that explicitly so engines do not infer broader SQL/equality semantics from the table-format metadata alone.

szehon-ho · 2026-06-26T22:48:16Z

 * Row Lineage tracking
 * Binary deletion vectors
 * Table encryption keys
+* Collations for string types


V3 spec is already formally adopted , should we put it in v4?

szehon-ho · 2026-06-26T23:08:39Z

+
+A field's collation is stored as a `collation` attribute on the field (see [Appendix C](#appendix-c-json-serialization)). The attribute is allowed only on `string` fields. If a field has no `collation` attribute, comparison defaults to UTF-8 byte order, which is the behavior of all prior versions.
+
+A collation is identified by a provider-qualified name of the form `<provider>.<name>`, for example `icu.en_US-ci`. The provider names the library that defines the collation (`icu` for collations defined by the [Unicode Collation Algorithm](https://unicode.org/reports/tr10/) over [CLDR](https://cldr.unicode.org/) locale data; other providers may define engine-specific collations such as case-folding variants). The name selects a locale and optional modifiers for case sensitivity (`ci`/`cs`), accent sensitivity (`ai`/`as`), trimming, and case folding. The reserved name `utf8` denotes UTF-8 byte-order comparison.


is it missing and end parens (after locale data)

szehon-ho · 2026-06-26T23:20:50Z

+
+A collation is identified by a provider-qualified name of the form `<provider>.<name>`, for example `icu.en_US-ci`. The provider names the library that defines the collation (`icu` for collations defined by the [Unicode Collation Algorithm](https://unicode.org/reports/tr10/) over [CLDR](https://cldr.unicode.org/) locale data; other providers may define engine-specific collations such as case-folding variants). The name selects a locale and optional modifiers for case sensitivity (`ci`/`cs`), accent sensitivity (`ai`/`as`), trimming, and case folding. The reserved name `utf8` denotes UTF-8 byte-order comparison.
+
+The schema stores the collation **name without a version**, so any engine that supports the collation can read the table. UCA, DUCET, CLDR, and ICU collation orders are [not stable across versions](https://unicode.org/reports/tr10/#Non-Goals), so collation-aware metrics carry the implementation version they were produced under (see below) and a reader uses them only when it can produce the same order.


nit: its describing too much about bounds which is already covered below? Also 'can read the table' is a bit vague as it should be read without stats-pruning, so suggest remove

how about just a reference: The schema stores the logical collation name without a version. Collation-aware file metrics store the implementation version used to compute their bounds (see below)

szehon-ho · 2026-06-26T23:24:18Z

+
+Bounds store the **original values** that are the minimum and maximum under the collation order, not collation sort keys (sort keys are not stable across implementation versions). The same single-value serialization and truncation rules as `lower_bounds`/`upper_bounds` apply, except that when truncating an upper bound, the appended successor must be the next value **in collation order**. A column may have more than one entry, one per collation version, so a file can serve readers pinned to different versions during an upgrade.
+
+**Writers** must continue to write the byte-order `lower_bounds`/`upper_bounds` for collated columns (so collation-unaware readers stay correct). A writer should additionally write `collation_bounds` when it supports the column's collation, tagging each entry with the collation and the exact implementation version used. A writer that does not support the collation or version must not write `collation_bounds` for that column.


in iceberg, lower_bounds and upper_bounds are optional, so 'must' is too strong here.

szehon-ho · 2026-06-26T23:26:01Z

+
+**Writers** must continue to write the byte-order `lower_bounds`/`upper_bounds` for collated columns (so collation-unaware readers stay correct). A writer should additionally write `collation_bounds` when it supports the column's collation, tagging each entry with the collation and the exact implementation version used. A writer that does not support the collation or version must not write `collation_bounds` for that column.
+
+**Readers** may use a `collation_bounds` entry to prune a file only when both the entry's `collation` and `version` match the collation the reader resolves for the column; otherwise the entry must be ignored. A reader must not use the byte-order `lower_bounds`/`upper_bounds` to prune comparison, equality, or prefix predicates on a collated column. When no usable collation bounds are available, the file must be scanned. A collation-unaware reader ignores the `collation` attribute and the `collation_bounds` field and reads the column as a UTF-8 byte-order string.


suggestion to clarify what kind of predicate requires collation match, and clarify what collation unaware readers may do

Readers that evaluate string predicates using a field’s declared collation must not use byte-order lower_bounds or upper_bounds to prune predicates whose result depends on collation ordering or collation equality. This includes range comparisons, equality and inequality comparisons, prefix predicates, and any pattern predicate the reader evaluates using collation semantics. Such readers may use a collation_bounds entry only when the entry’s collation name matches the field’s resolved collation and the entry’s implementation version is known to produce the same ordering. If no matching collation_bounds entry is available, the reader must treat the file as possibly matching the predicate. Collation-unaware readers ignore the field `collation` attribute and the `collation_bounds` field. Such readers continue to interpret string values and byte-order `lower_bounds` / `upper_bounds` using ordinary UTF-8 byte-order semantics.

szehon-ho · 2026-06-26T23:32:43Z

+| Field id, name | Type | Description |
+|----------------|------|-------------|
+| **`151 collation`** | `string` | Collation the bounds were produced for, e.g. `icu.en_US-ci` |
+| **`152 version`** | `string` | Collation implementation version the bounds were selected under |


may be worth to give an example

github-actions Bot added Specification Issues that may introduce spec changes. OPENAPI labels Jun 26, 2026

laskoviymishka mentioned this pull request Jun 26, 2026

Collation prototype: field annotation, schema round-trip, comparator #16974

Draft

szehon-ho reviewed Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spec: Add collations for string types#16972

Spec: Add collations for string types#16972
laskoviymishka wants to merge 1 commit into
apache:mainfrom
laskoviymishka:spec-v3-collations

laskoviymishka commented Jun 26, 2026

Uh oh!

szehon-ho left a comment

Uh oh!

szehon-ho Jun 26, 2026

Uh oh!

szehon-ho Jun 26, 2026

Uh oh!

szehon-ho Jun 26, 2026 •

edited

Loading

Uh oh!

szehon-ho Jun 26, 2026

Uh oh!

szehon-ho Jun 26, 2026

Uh oh!

szehon-ho Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		A field's collation is stored as a `collation` attribute on the field (see [Appendix C](#appendix-c-json-serialization)). The attribute is allowed only on `string` fields. If a field has no `collation` attribute, comparison defaults to UTF-8 byte order, which is the behavior of all prior versions.

		A collation is identified by a provider-qualified name of the form `<provider>.<name>`, for example `icu.en_US-ci`. The provider names the library that defines the collation (`icu` for collations defined by the [Unicode Collation Algorithm](https://unicode.org/reports/tr10/) over [CLDR](https://cldr.unicode.org/) locale data; other providers may define engine-specific collations such as case-folding variants). The name selects a locale and optional modifiers for case sensitivity (`ci`/`cs`), accent sensitivity (`ai`/`as`), trimming, and case folding. The reserved name `utf8` denotes UTF-8 byte-order comparison.


		A collation is identified by a provider-qualified name of the form `<provider>.<name>`, for example `icu.en_US-ci`. The provider names the library that defines the collation (`icu` for collations defined by the [Unicode Collation Algorithm](https://unicode.org/reports/tr10/) over [CLDR](https://cldr.unicode.org/) locale data; other providers may define engine-specific collations such as case-folding variants). The name selects a locale and optional modifiers for case sensitivity (`ci`/`cs`), accent sensitivity (`ai`/`as`), trimming, and case folding. The reserved name `utf8` denotes UTF-8 byte-order comparison.

		The schema stores the collation name without a version, so any engine that supports the collation can read the table. UCA, DUCET, CLDR, and ICU collation orders are [not stable across versions](https://unicode.org/reports/tr10/#Non-Goals), so collation-aware metrics carry the implementation version they were produced under (see below) and a reader uses them only when it can produce the same order.


		Bounds store the original values that are the minimum and maximum under the collation order, not collation sort keys (sort keys are not stable across implementation versions). The same single-value serialization and truncation rules as `lower_bounds`/`upper_bounds` apply, except that when truncating an upper bound, the appended successor must be the next value in collation order. A column may have more than one entry, one per collation version, so a file can serve readers pinned to different versions during an upgrade.

		Writers must continue to write the byte-order `lower_bounds`/`upper_bounds` for collated columns (so collation-unaware readers stay correct). A writer should additionally write `collation_bounds` when it supports the column's collation, tagging each entry with the collation and the exact implementation version used. A writer that does not support the collation or version must not write `collation_bounds` for that column.


		Writers must continue to write the byte-order `lower_bounds`/`upper_bounds` for collated columns (so collation-unaware readers stay correct). A writer should additionally write `collation_bounds` when it supports the column's collation, tagging each entry with the collation and the exact implementation version used. A writer that does not support the collation or version must not write `collation_bounds` for that column.

		Readers may use a `collation_bounds` entry to prune a file only when both the entry's `collation` and `version` match the collation the reader resolves for the column; otherwise the entry must be ignored. A reader must not use the byte-order `lower_bounds`/`upper_bounds` to prune comparison, equality, or prefix predicates on a collated column. When no usable collation bounds are available, the file must be scanned. A collation-unaware reader ignores the `collation` attribute and the `collation_bounds` field and reads the column as a UTF-8 byte-order string.

Uh oh!

Conversation

laskoviymishka commented Jun 26, 2026

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

szehon-ho Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

szehon-ho Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

szehon-ho Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

szehon-ho Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

szehon-ho Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

szehon-ho Jun 26, 2026 •

edited

Loading