Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
311 changes: 311 additions & 0 deletions format/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,311 @@
---
title: "Index Spec"
---
<!--
- Licensed to the Apache Software Foundation (ASF) under one or more
- contributor license agreements. See the NOTICE file distributed with
- this work for additional information regarding copyright ownership.
- The ASF licenses this file to You under the Apache License, Version 2.0
- (the "License"); you may not use this file except in compliance with
- the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
-->

# Iceberg Index Specification

## Background and Motivation

Indexes enable query engines to locate relevant rows without scanning entire datasets.
They can accelerate point lookups, range predicates, and other retrieval patterns
while preserving Iceberg's table format, snapshot isolation, and interoperability.

Indexes are optional. Engines may choose to create, maintain, consume, or ignore them.

## Goals

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we need the goals section here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the spec normally has a Goals section (see udf-spec and view-spec). The overlap comes from the This specification defines: list in Background. I think Background should be the motivation and what an index is, and Goals should hold that list. So I suggest keeping Goals and removing the list from Background.


- Define a portable metadata format for indexes
- Provide a common storage architecture for index data
- Allow indexes to evolve independently from table metadata as catalog-managed objects
- Enable index sharing across engines
- Provide a framework for defining new index types and transform functions

## Overview

Indexes are stored as a collection of files with some Iceberg table like semantics. At a high level they consist of a tracking file (similar to a root manifest file) which contains listings for a defined set of leaf files (similar to data files.) Leaf files store an ordered set of rows containing at least a key and the path of a Iceberg Table data file and the position within that file where the row where that key is stored. The organization of leaf files is defined by an Indexing Transform which varies based on the type of index. This structure is recorded in an Index metadata.json file which contains a set of snapshots, each of which points to a single tracking file mapping to the complete state of an Iceberg table at a given Iceberg table snapshot.

Like Iceberg tables, views, and functions:

- Metadata and data files are immutable
- Updates create new metadata files
- Catalogs perform atomic metadata swaps

Index data is stored separately from metadata.

Each index snapshot references a tracking file which describes the leaf files belonging to the snapshot.

```text
Index Metadata
|
+-- Index Snapshot
|
+-- Tracking File
|
+-- Leaf Data Files
```

Transform functions derive a transform value from the key columns and determine how index entries are organized within
the leaf files.
- The transform value space is divided into non-overlapping ranges.
- Each leaf file stores entries for a single range.
- The tracking file stores range bounds for each leaf file.

This structure enables efficient planning while keeping the data layout flexible for different index implementations.

## Definitions

### Index Type

The index type defines the logical category of an index and the class of queries it is designed to accelerate.

The following index type is defined in this specification:

| Type |
|--------|
| SCALAR |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SCALAR is listed but never defined. I suggest adding a description column.


The following index types are reserved for future specifications:

| Type |
|--------|
| VECTOR |
| TERM |

The index type communicates the capabilities of an index to query engines and helps determine whether an index is
applicable to a particular query.

### Index Transform Function

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these sections (Transform, Instance, Snapshot) should follow the overview. Currently they have a lot of undefined terms in them

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree to move these definitions to after the Overview section.


The index transform function defines how the index organization key is derived from source table columns when rows are
stored in the index.

The transform function determines the physical organization of the indexed data and therefore influences which query
patterns can efficiently leverage the index.

The following transform functions are defined in this specification:

| Transform |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Leaf Files Transform functions section also has this table and the reserved table below it. Should we remove the tables here, or remove them from the Leaf Files section, so the list lives in only one place?

|-----------|
| IDENTITY |
| HASH |
| HILBERT |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we agreed the organization transform is an Iceberg-style transform with a sort order, so I think we should use the Iceberg transform names: use bucket instead of hash.

I think for now the key-lookup index only needs identity and bucket, so we should move hilbert to the reserved table below.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add a sentence somewhere to say that tuple transforms like (bucket(key, 256), key) (bucket first, then sort) are also supported.


The following transform function is reserved for future specifications:

| Transform |
|-----------|
| IVF |

Index Instances may share the same index type while using different transform functions.

### Index Instance

An index instance is a concrete realization of an index type and function applied to a specific table.

Users create index instances by specifying:

- The source table
- The index type
- The transform function
- The indexed columns
- The included columns
- Index properties

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we mark The included columns and Index properties optional?


Multiple instances of the same index type may exist for a table.

### Index Snapshot

An index snapshot is an immutable version of the index data generated from a specific table snapshot.

Each index snapshot references a complete set of index files and contains all data from the referenced table snapshot.

## Index Metadata

The index metadata file stores the index definition and snapshot history.

### Index Metadata File

| Requirement | Field | Type | Description |
|-------------|---------------------|--------------------------|-------------------------------------------------|
| required | format-version | int | Index specification version |
| required | uuid | string | Stable UUID assigned at creation |
| required | table-uuid | string | UUID of the indexed table |
| required | location | string | Index root location |
| required | type | string | Logical index type |

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to have this only be chosen from a set of index types we define? Feels like we should if these are going to be interoperable. This also makes me think a bit about the "reserved" terms above. I think basically everything should be reserved unless we define it here imho.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree it should be a closed set for interoperability. One step back though: I think we only scoped the key-lookup index, we didn't actually agree on a SCALAR/VECTOR/TERM type yet.

| required | transform-function | string | Physical organization transform |

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs to be well defined? An expression or something we explicitly make here?

| required | key-column-ids | list<int> | Indexed columns |
| optional | included-column-ids | list<int> | Included columns |
| optional | properties | map<string,string> | Index properties applicable for every snapshot |
| required | snapshots | list<index-snapshot> | Index snapshots |

## Index Snapshot

Each index snapshot corresponds to one version of the index data.

| Requirement | Field | Type | Description |
|-------------|--------------------------|--------------------|---------------------------------------------------------------------|
| required | snapshot-id | long | Index snapshot identifier |
| required | source-table-snapshot-id | long | Source table snapshot |
| required | timestamp-ms | long | Snapshot creation timestamp |
| required | tracking-file | string | Tracking file location |
| optional | properties | map<string,string> | Snapshot properties specific to this snapshot |
| optional | key-metadata | binary | Implementation-specific key metadata, for tracking file encryption. |

## Tracking File

Each index snapshot references exactly one tracking file.

It contains summary metadata about all leaf files belonging to the index snapshot and enables efficient planning
without scanning every leaf file.

The tracking file may be stored using any supported metadata file format.

### Tracking File Entry

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this


Each tracking file contains a collection of tracking file entries. A tracking file entry describes a single leaf file
tracked by an index snapshot. The fields are the subset of the V4 manifest entry fields that are relevant to planning
queries against the index.
Entries contain aggregated statistics for all referenced leaf files, enabling engines to perform pruning and planning
without opening every leaf file.

| Field ID | Name | Type | Requirement | Description |
|----------|--------------------|---------|--------------|--------------------------------------------------------------------------------------------------------------|
| 100 | location | string | required | Location of the referenced file. |
| 101 | file_format | string | required | File format name, such as parquet, avro, or orc. |
| 103 | record_count | long | required | Number of records contained in the referenced leaf file. |
| 104 | file_size_in_bytes | long | required | Total file size in bytes. |
| 146 | content_stats | struct | optional | Statistics used for planning and pruning, including transform-key statistics and optional column statistics. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does content_stats contain the transform bounds (transform_min / transform_max)? If so, I think we should make them explicit, required fields. They're needed for routing and non-overlapping ranges, but content_stats is marked optional here, so the bounds could be missing.

| 131 | key_metadata | binary | optional | Implementation-specific key metadata, used for leaf file encryption. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

key_metadata -> key-metadata?


### Content Statistics

The content statistics structure contains transform-key statistics and optional column statistics for the referenced
file. The transform-key statistics are always present, while column statistics are optional and may be omitted for
performance reasons.

## Leaf Files

Leaf files contain the actual index entries and represent the lowest level of the index hierarchy.

Leaf files must be standard Iceberg data files and may be stored using any Iceberg-supported file format:
- Parquet
- Avro
- ORC - May be removed if ORC support is deprecated in Iceberg.

The schema of a leaf file is determined by the index definition and contains:
- All key columns defined by the index
- All included columns defined by the index

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is optional, maybe word it as "Any included columns defined by the index" to make clear it can be empty?

- The transform value produced by the transform function

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for an identity transform on the key, the transform value equals the key column, do we still want to save the transform value?

- The source file path
- The source row position

### Transform functions

The transform function produces a transform value for each indexed row. To enable efficient planning, the transform
value space is divided into non-overlapping ranges. Each leaf file contains entries for a single range, while the
tracking file stores the corresponding bounds for every leaf file. If, and only if, a single transform value produces
more rows than fit in one leaf file, multiple leaf files may be created for that value, and engines must read all of
them.

When a query predicate can be mapped to transform value ranges, engines can use these bounds to prune leaf files that
cannot contain matching entries, avoiding unnecessary reads.

A well-designed transform function also preserves locality between the source columns and the resulting transform value,
allowing additional pruning using column statistics stored in the tracking file. For example, a Hilbert transform can
cluster similar multi-column keys together, reducing the number of leaf files that must be read for range scans and
partial-key lookups.

The following transform functions are currently supported:

| Transform | Bound Interpretation |
|-----------|----------------------|
| IDENTITY | Original value range |
| HASH | Hash bucket range |
| HILBERT | Hilbert key range |

The following transform type is reserved for future specifications:

| Transform | Bound Interpretation |
|-----------|----------------------------|
| IVF | Centroid identifier range |


### Leaf Schema

Columns originating from the source table must preserve their original Iceberg field identifiers.
Reusing the original field IDs ensures that schema evolution, column renames, and type compatibility semantics remain
consistent between the table and the index.

The index-specific columns are:

| Field Id | Column | Type | Description |
|-----------|-----------------|--------|------------------------------------------------------------------------|
| TBD | transform_value | long | The result of applying the index transform function to the key columns |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the type is not always long, maybe change to determined by the transform function?

| TBD | file_path | string | The path of the source data file the entry references |
| TBD | position | long | The row position of the entry within the source data file |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file_path and position are basically Iceberg's reserved _file (2147483646) and _pos (2147483645). Should we reuse those reserved IDs and give transform_value another reserved ID?


## Example: Key Lookup Index

Index Type:

```text
SCALAR
```

Transform Function:

```text
HASH(primary_key)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to bucket(primary_key, N)?

```

Leaf Schema:

| Column |
|------------------|
| primary_key |
| transform_value |
| file_path |
| position |

The leaf files are organized by hash key, while the tracking file stores summary information and pruning statistics.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The leaf files are organized by hash key -> The leaf files are organized by transform value?


## Snapshot Evolution

Index snapshots are immutable.

Updating an index creates:

1. New leaf files
2. A new tracking file pointing to new leaf files, and potentially some old leaf files that are still valid for the new snapshot.
3. A new index metadata file

The catalog commits the update by atomically replacing the metadata location.

## Maintenance

An index maintains a mapping between source table snapshots and index snapshots.

Engines may use this mapping to determine whether a compatible index snapshot exists for a given table snapshot.

## Future Extensions

Future specifications may define:

- VECTOR indexes
- TERM indexes
Loading