Skip to content

Core: Basic fields and schemas for column files#16285

Open
gaborkaszab wants to merge 1 commit into
apache:mainfrom
gaborkaszab:main_column_file_interface
Open

Core: Basic fields and schemas for column files#16285
gaborkaszab wants to merge 1 commit into
apache:mainfrom
gaborkaszab:main_column_file_interface

Conversation

@gaborkaszab

Copy link
Copy Markdown
Contributor

This change introduces the interface for column files and also integrates it to the schema for TrackedFile.

@github-actions github-actions Bot added the core label May 11, 2026
Comment thread core/src/main/java/org/apache/iceberg/ColumnFileInfo.java Outdated
@gaborkaszab

Copy link
Copy Markdown
Contributor Author

First piece of the column update work: introducing the basic interface of the column updates files, aka column files
cc @anuragmantri @rdblue @pvary @RussellSpitzer @amogh-jahagirdar @anoopj @nastra

Comment thread core/src/main/java/org/apache/iceberg/ColumnFileInfo.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/ColumnFileInfo.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileStruct.java
Comment thread core/src/main/java/org/apache/iceberg/ColumnFileInfo.java Outdated
@gaborkaszab gaborkaszab force-pushed the main_column_file_interface branch from ca3259e to e6f7cf6 Compare May 12, 2026 09:35
Comment thread core/src/main/java/org/apache/iceberg/ColumnFileInfo.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/ColumnFileInfo.java Outdated
@gaborkaszab gaborkaszab force-pushed the main_column_file_interface branch 3 times, most recently from 681633b to 813d5c0 Compare May 13, 2026 12:52
@gaborkaszab

Copy link
Copy Markdown
Contributor Author

I opened a thread on dev@ to discuss the metadata structs for column files. Once that's finalized, I'll incorporate the changes here.

@gaborkaszab gaborkaszab force-pushed the main_column_file_interface branch 3 times, most recently from 596f6a4 to 6a1cbe9 Compare June 2, 2026 13:12
@gaborkaszab

Copy link
Copy Markdown
Contributor Author

cc @amogh-jahagirdar @rdblue @anoopj

@gaborkaszab gaborkaszab changed the title Core: Introduce interface for column files Core: Basic fields and schemas for column files Jun 3, 2026
@gaborkaszab gaborkaszab force-pushed the main_column_file_interface branch from 6a1cbe9 to c683e72 Compare June 4, 2026 06:55
Tracking.FIRST_ROW_ID,
Tracking.DELETED_POSITIONS,
Tracking.REPLACED_POSITIONS,
Tracking.LATEST_COLUMN_FILE_SNAPSHOT_ID,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the expectation is to have the physically persisted schema fields first and the manifest position after. Hence I placed the new field before ROW_POSITION Note, this changes the ordinal of the existing manifest pos field, however, since this is under development and there are no already written data files out there, this seems fine.

The implementation of the new field in getByPos and internalSet are meant to be in the follow-up implementation PR.

@pvary pvary moved this to In review in V4: metadata tree Jun 8, 2026
@gaborkaszab gaborkaszab force-pushed the main_column_file_interface branch from c683e72 to 6222fad Compare June 8, 2026 12:32
@gaborkaszab

Copy link
Copy Markdown
Contributor Author

Adjusted field IDs because 157 is going to be allocated for writer_format_version in this PR.

Comment thread core/src/main/java/org/apache/iceberg/TrackedFileStruct.java
Comment thread core/src/main/java/org/apache/iceberg/TrackingStruct.java Outdated
@gaborkaszab gaborkaszab force-pushed the main_column_file_interface branch from 6222fad to 5c04f55 Compare June 11, 2026 09:27
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileStruct.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackingBuilder.java
@gaborkaszab gaborkaszab force-pushed the main_column_file_interface branch from 5c04f55 to b6ae446 Compare June 12, 2026 14:37
this.status = EntryStatus.MODIFIED;
}
// Bumping 'dataSequenceNumber' to avoid having both equality deletes and column files.
this.dataSequenceNumber = null;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed bumping the data sequence number when adding column files. We haven't mentioned file seq num, so I'm not bumping it here.
This works if the manifest owning this data file entry bumps its own seq num when adding column files. Let me know if there is any other way achieving this.

@gaborkaszab gaborkaszab force-pushed the main_column_file_interface branch 2 times, most recently from 0a252e8 to 144cd39 Compare June 18, 2026 14:42
deletedPositions == null && replacedPositions == null,
"Cannot mark column files updated on a manifest entry (deleted/replaced positions are set)");
this.latestColumnFileSnapshotId = newSnapshotId;
if (status == EntryStatus.EXISTING) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add preconditions here to check status should never be DELETED or REPLACED?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is inline with the similar dvUpdated. Since DELETED and REPLACED aren't constructed through the builder, guarding against those statuses would be just noise here IMO

: null;
this.columnFiles =
toCopy.columnFiles != null
? toCopy.columnFiles.stream().map(ColumnFile::copy).collect(Collectors.toList())

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can NPE in .map(ColumnFile::copy), do we ensure that it's non null?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean columnFiles might contain a null value? Currently, you can pass whatever content for columnFiles through the constructor, you now technically there can be null too. In the long run the expectation is to build this class through its builder (PR) where we can prevent adding null to the list.
I'm not concerned about this, WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented a null-safe version of the copy and added test coverage, just to be on the safe side. Let me know what you think.

@Override
public String toString() {
return MoreObjects.toStringHelper(this)
.add("field_ids", fieldIds)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Does this print the list object instead of values?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does print the values in the list. Just checked the output:
ColumnFileStruct{field_ids=[1, 2, 3], location=s3://bucket/data/column.parquet, file_size_in_bytes=1024}
I was hesitating to add test coverage for this, but I haven't seen anywhere else testing the output of a toSting function.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ToStringHelper() should handle arrays just fine. No need to add a test for this.

@gaborkaszab gaborkaszab force-pushed the main_column_file_interface branch 2 times, most recently from 35c5fb4 to 8df135e Compare June 19, 2026 11:11
@gaborkaszab gaborkaszab requested a review from anuragmantri June 19, 2026 11:12
@gaborkaszab gaborkaszab force-pushed the main_column_file_interface branch from 8df135e to e1a430d Compare June 23, 2026 14:38
this.equalityIds = ArrayUtil.toIntArray((List<Integer>) value);
break;
case 16:
this.columnFiles = copyColumnFiles((List<ColumnFile>) value);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value can be null so we can NPE here. Above here the ArrayUtil method is guarding against the null input.

We can either cover the null here or in copyCOlumnFiles

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a broader point, do we really need to copy here? The List isn't a re-usuable container so why do we need to deep copy it? For example why isn't this just like DeletionVector or ManifestInfo?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! I was overly cautious here, no copy is needed.
Removed the copy, just cast value to List<ColumnFile> for assignment.

@gaborkaszab gaborkaszab force-pushed the main_column_file_interface branch from e1a430d to c3205ae Compare June 24, 2026 10:29

@gaborkaszab gaborkaszab left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look, @RussellSpitzer !

I removed the unwanted copy. Also did a rebase because there was a conflict with the new switch style PR.

this.equalityIds = ArrayUtil.toIntArray((List<Integer>) value);
break;
case 16:
this.columnFiles = copyColumnFiles((List<ColumnFile>) value);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! I was overly cautious here, no copy is needed.
Removed the copy, just cast value to List<ColumnFile> for assignment.

This change introduces the bases structs for column files and also
integrates them to the schema for TrackedFile and Tracking.
@gaborkaszab gaborkaszab force-pushed the main_column_file_interface branch from c3205ae to e235fc0 Compare June 24, 2026 10:53
Preconditions.checkArgument(!newColumnFiles.isEmpty(), "Invalid column files: empty");
Preconditions.checkArgument(
contentType == FileContent.DATA || contentType == FileContent.DATA_MANIFEST,
"Column files can only be set for DATA or DATA_MANIFEST entries, but entry type is %s",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we allow column files for DATA_MANIFEST entries? Is this for metadata updates? (override DV column?)

import org.apache.iceberg.types.Types;

/** Information about a column file. */
interface ColumnFile {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Efficient Column Updates proposal had sequence_number at the column file level. Is that stale? ie are we dropping per-file granularity?

assertThat(copy.fileSizeInBytes()).isEqualTo(2048L);

// verify deep copy
assertThat(copy.fieldIds()).isNotSameAs(columnFile.fieldIds());

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will always pass because the fieldIds() wrap a Collections.umodifiableList() which will always be different. I think the only way to test deep copy is to actually mutate the field IDs in the source and verify that the values don't change in the copy.

@Override
public String toString() {
return MoreObjects.toStringHelper(this)
.add("field_ids", fieldIds)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ToStringHelper() should handle arrays just fine. No need to add a test for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: In progress
Status: In review

Development

Successfully merging this pull request may close these issues.

6 participants