Core: Basic fields and schemas for column files by gaborkaszab · Pull Request #16285 · apache/iceberg

gaborkaszab · 2026-05-11T13:52:03Z

This change introduces the interface for column files and also integrates it to the schema for TrackedFile.

gaborkaszab · 2026-05-11T16:02:14Z

First piece of the column update work: introducing the basic interface of the column updates files, aka column files
cc @anuragmantri @rdblue @pvary @RussellSpitzer @amogh-jahagirdar @anoopj @nastra

gaborkaszab · 2026-05-21T13:56:27Z

I opened a thread on dev@ to discuss the metadata structs for column files. Once that's finalized, I'll incorporate the changes here.

gaborkaszab · 2026-06-02T13:13:44Z

cc @amogh-jahagirdar @rdblue @anoopj

gaborkaszab · 2026-06-04T07:03:03Z

          Tracking.FIRST_ROW_ID,
          Tracking.DELETED_POSITIONS,
          Tracking.REPLACED_POSITIONS,
+          Tracking.LATEST_COLUMN_FILE_SNAPSHOT_ID,


I believe the expectation is to have the physically persisted schema fields first and the manifest position after. Hence I placed the new field before ROW_POSITION Note, this changes the ordinal of the existing manifest pos field, however, since this is under development and there are no already written data files out there, this seems fine.

The implementation of the new field in getByPos and internalSet are meant to be in the follow-up implementation PR.

gaborkaszab · 2026-06-08T12:33:22Z

Adjusted field IDs because 157 is going to be allocated for writer_format_version in this PR.

gaborkaszab · 2026-06-12T14:39:56Z

+      this.status = EntryStatus.MODIFIED;
+    }
+    // Bumping 'dataSequenceNumber' to avoid having both equality deletes and column files.
+    this.dataSequenceNumber = null;


We discussed bumping the data sequence number when adding column files. We haven't mentioned file seq num, so I'm not bumping it here.
This works if the manifest owning this data file entry bumps its own seq num when adding column files. Let me know if there is any other way achieving this.

anuragmantri · 2026-06-18T23:32:03Z

+        deletedPositions == null && replacedPositions == null,
+        "Cannot mark column files updated on a manifest entry (deleted/replaced positions are set)");
+    this.latestColumnFileSnapshotId = newSnapshotId;
+    if (status == EntryStatus.EXISTING) {


Should we add preconditions here to check status should never be DELETED or REPLACED?

This method is inline with the similar dvUpdated. Since DELETED and REPLACED aren't constructed through the builder, guarding against those statuses would be just noise here IMO

anuragmantri · 2026-06-18T23:33:48Z

            : null;
+    this.columnFiles =
+        toCopy.columnFiles != null
+            ? toCopy.columnFiles.stream().map(ColumnFile::copy).collect(Collectors.toList())


This can NPE in .map(ColumnFile::copy), do we ensure that it's non null?

You mean columnFiles might contain a null value? Currently, you can pass whatever content for columnFiles through the constructor, you now technically there can be null too. In the long run the expectation is to build this class through its builder (PR) where we can prevent adding null to the list.
I'm not concerned about this, WDYT?

I implemented a null-safe version of the copy and added test coverage, just to be on the safe side. Let me know what you think.

anuragmantri · 2026-06-18T23:35:33Z

+  @Override
+  public String toString() {
+    return MoreObjects.toStringHelper(this)
+        .add("field_ids", fieldIds)


Nit: Does this print the list object instead of values?

This does print the values in the list. Just checked the output:
ColumnFileStruct{field_ids=[1, 2, 3], location=s3://bucket/data/column.parquet, file_size_in_bytes=1024}
I was hesitating to add test coverage for this, but I haven't seen anywhere else testing the output of a toSting function.

ToStringHelper() should handle arrays just fine. No need to add a test for this.

RussellSpitzer · 2026-06-23T16:16:20Z

        this.equalityIds = ArrayUtil.toIntArray((List<Integer>) value);
        break;
+      case 16:
+        this.columnFiles = copyColumnFiles((List<ColumnFile>) value);


value can be null so we can NPE here. Above here the ArrayUtil method is guarding against the null input.

We can either cover the null here or in copyCOlumnFiles

On a broader point, do we really need to copy here? The List isn't a re-usuable container so why do we need to deep copy it? For example why isn't this just like DeletionVector or ManifestInfo?

You're right! I was overly cautious here, no copy is needed.
Removed the copy, just cast value to List<ColumnFile> for assignment.

gaborkaszab

Thanks for taking a look, @RussellSpitzer !

I removed the unwanted copy. Also did a rebase because there was a conflict with the new switch style PR.

gaborkaszab · 2026-06-24T10:25:33Z

        this.equalityIds = ArrayUtil.toIntArray((List<Integer>) value);
        break;
+      case 16:
+        this.columnFiles = copyColumnFiles((List<ColumnFile>) value);


You're right! I was overly cautious here, no copy is needed.
Removed the copy, just cast value to List<ColumnFile> for assignment.

This change introduces the bases structs for column files and also integrates them to the schema for TrackedFile and Tracking.

anoopj · 2026-06-26T23:31:54Z

+    Preconditions.checkArgument(!newColumnFiles.isEmpty(), "Invalid column files: empty");
+    Preconditions.checkArgument(
+        contentType == FileContent.DATA || contentType == FileContent.DATA_MANIFEST,
+        "Column files can only be set for DATA or DATA_MANIFEST entries, but entry type is %s",


Why do we allow column files for DATA_MANIFEST entries? Is this for metadata updates? (override DV column?)

anoopj · 2026-06-26T23:57:38Z

+import org.apache.iceberg.types.Types;
+
+/** Information about a column file. */
+interface ColumnFile {


The Efficient Column Updates proposal had sequence_number at the column file level. Is that stale? ie are we dropping per-file granularity?

anoopj · 2026-06-27T00:03:45Z

+    assertThat(copy.fileSizeInBytes()).isEqualTo(2048L);
+
+    // verify deep copy
+    assertThat(copy.fieldIds()).isNotSameAs(columnFile.fieldIds());


This will always pass because the fieldIds() wrap a Collections.umodifiableList() which will always be different. I think the only way to test deep copy is to actually mutate the field IDs in the source and verify that the values don't change in the copy.

anoopj · 2026-06-27T00:08:32Z

+  @Override
+  public String toString() {
+    return MoreObjects.toStringHelper(this)
+        .add("field_ids", fieldIds)


ToStringHelper() should handle arrays just fine. No need to add a test for this.

github-actions Bot added the core label May 11, 2026

gaborkaszab commented May 11, 2026

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/ColumnFileInfo.java Outdated

anuragmantri added this to V4: Efficient Column Updates May 11, 2026

anuragmantri moved this to Backlog in V4: Efficient Column Updates May 11, 2026

anuragmantri moved this from Backlog to In progress in V4: Efficient Column Updates May 11, 2026

gaborkaszab force-pushed the main_column_file_interface branch from 630b00e to ca3259e Compare May 11, 2026 17:04

anuragmantri reviewed May 11, 2026

View reviewed changes

gaborkaszab force-pushed the main_column_file_interface branch from ca3259e to e6f7cf6 Compare May 12, 2026 09:35

gaborkaszab commented May 13, 2026

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/ColumnFileInfo.java Outdated

Comment thread core/src/main/java/org/apache/iceberg/ColumnFileInfo.java Outdated

gaborkaszab force-pushed the main_column_file_interface branch 3 times, most recently from 681633b to 813d5c0 Compare May 13, 2026 12:52

gaborkaszab force-pushed the main_column_file_interface branch 3 times, most recently from 596f6a4 to 6a1cbe9 Compare June 2, 2026 13:12

gaborkaszab changed the title ~~Core: Introduce interface for column files~~ Core: Basic fields and schemas for column files Jun 3, 2026

gaborkaszab force-pushed the main_column_file_interface branch from 6a1cbe9 to c683e72 Compare June 4, 2026 06:55

gaborkaszab commented Jun 4, 2026

View reviewed changes

pvary moved this to In review in V4: metadata tree Jun 8, 2026

pvary added this to V4: metadata tree Jun 8, 2026

gaborkaszab force-pushed the main_column_file_interface branch from c683e72 to 6222fad Compare June 8, 2026 12:32

stevenzwu reviewed Jun 9, 2026

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/TrackedFileStruct.java

Comment thread core/src/main/java/org/apache/iceberg/TrackingStruct.java Outdated

gaborkaszab force-pushed the main_column_file_interface branch from 6222fad to 5c04f55 Compare June 11, 2026 09:27

gaborkaszab commented Jun 11, 2026

View reviewed changes

Comment thread core/src/main/java/org/apache/iceberg/TrackedFileStruct.java Outdated

Comment thread core/src/main/java/org/apache/iceberg/TrackingBuilder.java

gaborkaszab force-pushed the main_column_file_interface branch from 5c04f55 to b6ae446 Compare June 12, 2026 14:37

gaborkaszab commented Jun 12, 2026

View reviewed changes

gaborkaszab force-pushed the main_column_file_interface branch 2 times, most recently from 0a252e8 to 144cd39 Compare June 18, 2026 14:42

gaborkaszab requested review from anuragmantri and stevenzwu June 18, 2026 14:43

anuragmantri reviewed Jun 18, 2026

View reviewed changes

gaborkaszab force-pushed the main_column_file_interface branch 2 times, most recently from 35c5fb4 to 8df135e Compare June 19, 2026 11:11

gaborkaszab requested a review from anuragmantri June 19, 2026 11:12

gaborkaszab force-pushed the main_column_file_interface branch from 8df135e to e1a430d Compare June 23, 2026 14:38

RussellSpitzer reviewed Jun 23, 2026

View reviewed changes

gaborkaszab force-pushed the main_column_file_interface branch from e1a430d to c3205ae Compare June 24, 2026 10:29

gaborkaszab commented Jun 24, 2026

View reviewed changes

Core: Basic fields and schemas for column files

e235fc0

This change introduces the bases structs for column files and also integrates them to the schema for TrackedFile and Tracking.

gaborkaszab force-pushed the main_column_file_interface branch from c3205ae to e235fc0 Compare June 24, 2026 10:53

gaborkaszab requested a review from RussellSpitzer June 24, 2026 12:44

anoopj reviewed Jun 27, 2026

View reviewed changes

Uh oh!

Conversation

gaborkaszab commented May 11, 2026

Uh oh!

Uh oh!

gaborkaszab commented May 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gaborkaszab commented May 21, 2026

Uh oh!

gaborkaszab commented Jun 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gaborkaszab commented Jun 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gaborkaszab left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants