Skip to content

Core: Add v4 manifest reader#16958

Open
anoopj wants to merge 4 commits into
apache:mainfrom
anoopj:v4-manifest-reader
Open

Core: Add v4 manifest reader#16958
anoopj wants to merge 4 commits into
apache:mainfrom
anoopj:v4-manifest-reader

Conversation

@anoopj

@anoopj anoopj commented Jun 24, 2026

Copy link
Copy Markdown
Member

The reader supports:

  • column projection
  • partition pruning via a row filter, including multi-spec manifests

Content stats reading and metadata inheritance are not yet implemented.

supports:
  - column projection
  - partition pruning via a row filter, including multi-spec manifests

Content stats reading and metadata inheritance are not yet implemented.
@anoopj anoopj moved this to In review in V4: metadata tree Jun 24, 2026
@github-actions github-actions Bot added the core label Jun 24, 2026
.specId(0);
}

private static TrackedFile fileWithStatus(EntryStatus status, String location) {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not using the builder here so that we can set any status here.

@gaborkaszab gaborkaszab left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in the process of getting familiar with the details here. Asked early questions for my benefit.
Thanks @anoopj !

* Builds the tracking field with {@link MetadataColumns#ROW_POSITION} appended so the reader
* populates the manifest position of each entry.
*/
private static Types.NestedField trackingWithRowPosition() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These Tracking schemas are kind of confusing. Let me know if I get it wrong!
So Tracking.schema() is the physical representation of the data, what we expect to be written, or in other words, this is the write schema.
TrackingStruct.BASE_TYPE is the logical representation including field(s) that are physically not persisted, or in other words this is the read schema.
Naively asking, since we are on the read path, can't we simply use the read schema from TrackingStruct instead of constructing it here from the physical schema?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your understanding correct! Tracking.schema() is the on-disk schema. TrackingStruct.BASE_TYPE is the same schema plus row_position. We can reuse it. I just needed to change the visibility of TrackingStruct.BASE_TYPE from private to package private.

// manifestLocation is not stored in the manifest; the reader fills it from the file location.
// manifestPos is filled from ROW_POSITION while reading the tracking struct.
if (tracking instanceof TrackingStruct) {
((TrackingStruct) tracking).setManifestLocation(file.location());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We populate Tracking.manifestPos using a metadata column ROW_POSITION, while we populate Tracking.manifestLocation manually here. For the latter, is there a reason we can't use MetadataColumns.FILE_PATH to be consistent?
We could get rid of the custom setter TrackingStruct.setManifestlocation method too and let it flow through internalSet getByPos methods.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They look symmetric but the underlying mechanism differs. ROW_POSITION is synthesized by the reader itself. FILE_PATH us only populated when the caller injects it. I'd lean toward keeping the manual set for now


private TrackedFile prepare(TrackedFile trackedFile) {
Tracking tracking = trackedFile.tracking();
Preconditions.checkState(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of checking here, should we rely on the builders enforcing this invariant?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the read path, we don't use builders. Tracking here comes from deserializing the manifest and not from a builder.

But i think the check here is not needed for a different reason. We have made Tracking a required field, so the internal reader should throw before prepare() runs. So this check is likely redundant.

return entries;
}

private boolean matchesPartition(TrackedFile trackedFile) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we want to drop tuple-based pruning entirely and push predicates against partition-transform-expression columns in content_stats. Under that model the partition tuple is only needed to match data files to equality-delete files.

I am planning to bring up the discussion in Monday's sync.

import org.apache.iceberg.util.StructProjection;

/** Reader that reads a v4 manifest file as {@link TrackedFile}s. */
class V4ManifestReader extends CloseableGroup implements CloseableIterable<TrackedFile> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this name V4ManifestReader would be come stale when V5 rolls in. maybe TrackedFileManifestReader.

import org.apache.iceberg.types.Types;
import org.apache.iceberg.util.StructProjection;

/** Reader that reads a v4 manifest file as {@link TrackedFile}s. */

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: v4 -> v4+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

3 participants