Skip to content

Spec: Add collations for string types#16972

Draft
laskoviymishka wants to merge 1 commit into
apache:mainfrom
laskoviymishka:spec-v3-collations
Draft

Spec: Add collations for string types#16972
laskoviymishka wants to merge 1 commit into
apache:mainfrom
laskoviymishka:spec-v3-collations

Conversation

@laskoviymishka

Copy link
Copy Markdown
Contributor

This adds collation support for string types to the v3 spec, following the discussion in the dev-list thread. A string field can carry a collation (e.g. icu.en_US-ci) that defines case-insensitive, accent-insensitive, or locale-aware comparison and ordering, and a new data_file.collation_bounds field carries collation-aware min/max so collated columns can still be pruned

The spec changes:

  • A collation attribute on string fields (provider-qualified, e.g. icu.en_US-ci; utf8 means byte order), stored unversioned in the schema so any compatible engine can read.
  • A new data_file.collation_bounds field (id 147): per column, a list of collation-aware lower/upper bounds stored as original values and tagged with the collation and the implementation version they were selected under.
  • Reader/writer rules: byte-order lower_bounds/upper_bounds are still written for collated columns so collation-unaware engines stay correct, but must not be used to prune predicates on a collated column; a collation_bounds entry may be used only on an exact collation + version match.

Two decisions I'd flag for discussion.

  • First, bounds store original values rather than ICU sort keys: sort keys aren't stable across UCA/CLDR/ICU versions, so per-file versioning plus an exact-match read gate degrades gracefully where a pinned global version would break.
  • Second, I put collation_bounds on data_file as a standalone v3 field, but it could instead live in the v4 content_stats typed-stats struct - worth deciding which.

Backed by a working reference implementation in iceberg-go: apache/iceberg-go#1318.

Rationale and the differences from the original proposal are written up separately (proposal doc).

See also the dev-list thread. Feedback very welcome.

Adds a collation annotation on string fields and collation-aware bounds for
collated columns:

- A string field may carry a provider-qualified collation (e.g. icu.en_US-ci)
  defining case-insensitive, accent-insensitive, or locale-aware comparison and
  ordering. Comparison-only; stored values are unchanged.
- data_file.collation_bounds (147) stores collation-aware lower/upper bounds as
  original values tagged with the collation and its implementation version.
  Byte-order lower/upper bounds are still written so collation-unaware readers
  stay correct, and must not be used to prune predicates on a collated column.
- Updates the schema JSON field attribute, the data_file table and notes, an
  Appendix E v3 change list, and the OpenAPI StructField.

Design rationale and differences from the original proposal are kept in a
separate write-up; backed by a working iceberg-go reference implementation.

@szehon-ho szehon-ho left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! A few comments here and inline:

  1. Nested string fields need an explicit story. The spec says a string field may carry a collation, and the JSON change adds collation to StructField. Does that intentionally exclude list<string>, map<string, ...>, and map<..., string>? If nested strings are in scope, the spec should define where collation is stored for list elements, map keys, and map values. If they are out of scope, that limitation should be explicit.

  2. Collation evolution can be better defined in terms of schema evolution. What happens when a column changes from one collation to another, or from byte-order to an ICU collation? Is this a compatible schema evolution, a type change, or disallowed? Existing files may contain bounds for older collations/versions, so readers should only use collation_bounds entries matching the collation resolved for the read schema and otherwise treat the file as possibly matching.

  3. Please clarify the scope beyond file pruning. If collation affects query semantics, then equality deletes, sort orders, partition transforms, and bucket/hash semantics may also need rules. If this change is only defining schema annotation and file-level pruning metadata, it would help to say that explicitly so engines do not infer broader SQL/equality semantics from the table-format metadata alone.

Comment thread format/spec.md
* Row Lineage tracking
* Binary deletion vectors
* Table encryption keys
* Collations for string types

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V3 spec is already formally adopted , should we put it in v4?

Comment thread format/spec.md

A field's collation is stored as a `collation` attribute on the field (see [Appendix C](#appendix-c-json-serialization)). The attribute is allowed only on `string` fields. If a field has no `collation` attribute, comparison defaults to UTF-8 byte order, which is the behavior of all prior versions.

A collation is identified by a provider-qualified name of the form `<provider>.<name>`, for example `icu.en_US-ci`. The provider names the library that defines the collation (`icu` for collations defined by the [Unicode Collation Algorithm](https://unicode.org/reports/tr10/) over [CLDR](https://cldr.unicode.org/) locale data; other providers may define engine-specific collations such as case-folding variants). The name selects a locale and optional modifiers for case sensitivity (`ci`/`cs`), accent sensitivity (`ai`/`as`), trimming, and case folding. The reserved name `utf8` denotes UTF-8 byte-order comparison.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it missing and end parens (after locale data)

Comment thread format/spec.md

A collation is identified by a provider-qualified name of the form `<provider>.<name>`, for example `icu.en_US-ci`. The provider names the library that defines the collation (`icu` for collations defined by the [Unicode Collation Algorithm](https://unicode.org/reports/tr10/) over [CLDR](https://cldr.unicode.org/) locale data; other providers may define engine-specific collations such as case-folding variants). The name selects a locale and optional modifiers for case sensitivity (`ci`/`cs`), accent sensitivity (`ai`/`as`), trimming, and case folding. The reserved name `utf8` denotes UTF-8 byte-order comparison.

The schema stores the collation **name without a version**, so any engine that supports the collation can read the table. UCA, DUCET, CLDR, and ICU collation orders are [not stable across versions](https://unicode.org/reports/tr10/#Non-Goals), so collation-aware metrics carry the implementation version they were produced under (see below) and a reader uses them only when it can produce the same order.

@szehon-ho szehon-ho Jun 26, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: its describing too much about bounds which is already covered below? Also 'can read the table' is a bit vague as it should be read without stats-pruning, so suggest remove

how about just a reference: The schema stores the logical collation name without a version. Collation-aware file metrics store the implementation version used to compute their bounds (see below)

Comment thread format/spec.md

Bounds store the **original values** that are the minimum and maximum under the collation order, not collation sort keys (sort keys are not stable across implementation versions). The same single-value serialization and truncation rules as `lower_bounds`/`upper_bounds` apply, except that when truncating an upper bound, the appended successor must be the next value **in collation order**. A column may have more than one entry, one per collation version, so a file can serve readers pinned to different versions during an upgrade.

**Writers** must continue to write the byte-order `lower_bounds`/`upper_bounds` for collated columns (so collation-unaware readers stay correct). A writer should additionally write `collation_bounds` when it supports the column's collation, tagging each entry with the collation and the exact implementation version used. A writer that does not support the collation or version must not write `collation_bounds` for that column.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in iceberg, lower_bounds and upper_bounds are optional, so 'must' is too strong here.

Comment thread format/spec.md

**Writers** must continue to write the byte-order `lower_bounds`/`upper_bounds` for collated columns (so collation-unaware readers stay correct). A writer should additionally write `collation_bounds` when it supports the column's collation, tagging each entry with the collation and the exact implementation version used. A writer that does not support the collation or version must not write `collation_bounds` for that column.

**Readers** may use a `collation_bounds` entry to prune a file only when both the entry's `collation` and `version` match the collation the reader resolves for the column; otherwise the entry must be ignored. A reader must not use the byte-order `lower_bounds`/`upper_bounds` to prune comparison, equality, or prefix predicates on a collated column. When no usable collation bounds are available, the file must be scanned. A collation-unaware reader ignores the `collation` attribute and the `collation_bounds` field and reads the column as a UTF-8 byte-order string.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion to clarify what kind of predicate requires collation match, and clarify what collation unaware readers may do

Readers that evaluate string predicates using a field’s declared collation must not use byte-order lower_bounds or upper_bounds to prune predicates whose result depends on collation ordering or collation equality. This includes range comparisons, equality and inequality comparisons, prefix predicates, and any pattern predicate the reader evaluates using collation semantics. Such readers may use a collation_bounds entry only when the entry’s collation name matches the field’s resolved collation and the entry’s implementation version is known to produce the same ordering. If no matching collation_bounds entry is available, the reader must treat the file as possibly matching the predicate.

Collation-unaware readers ignore the field `collation` attribute and the `collation_bounds` field. Such readers continue to interpret string values and byte-order `lower_bounds` / `upper_bounds` using ordinary UTF-8 byte-order semantics.

Comment thread format/spec.md
| Field id, name | Type | Description |
|----------------|------|-------------|
| **`151 collation`** | `string` | Collation the bounds were produced for, e.g. `icu.en_US-ci` |
| **`152 version`** | `string` | Collation implementation version the bounds were selected under |

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be worth to give an example

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OPENAPI Specification Issues that may introduce spec changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants