Skip to content

Core: Adjustments for TrackedFileBuilder#16964

Open
gaborkaszab wants to merge 1 commit into
apache:mainfrom
gaborkaszab:main_trackedfilebuilder_adjustments
Open

Core: Adjustments for TrackedFileBuilder#16964
gaborkaszab wants to merge 1 commit into
apache:mainfrom
gaborkaszab:main_trackedfilebuilder_adjustments

Conversation

@gaborkaszab

Copy link
Copy Markdown
Contributor

Original PR: #16769
This one is a follow-up with adjustments addressing further comments.

@gaborkaszab gaborkaszab left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Carried over the unclosed comments from the original PR + some random extra explanation


// optional fields
private Integer specId = null;
private PartitionData partition = null;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

carry-over comment: "PartitionData should be StructLike. We don't want to rely on concrete classes here."

TrackedFileStruct constructor seem to accept PartitionData so we have to do the conversion at some point. Seemed straightforward to keep the same type in the builder as expected by the constructed class.

TrackedFileBuilder partition(PartitionData newPartitionData) {
Preconditions.checkArgument(newPartitionData != null, "Invalid partition: null");
this.partitionData = newPartitionData;
TrackedFileBuilder partition(PartitionSpec spec, StructLike newPartition) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

carry-over comment: "I think that this should be combined with specId: partition(int newSpecId, StructLike partitionTuple). If the tuple is set, then spec ID is required so that we know how to interpret it. I suspect that this should also be responsible for wrapping the partition tuple with a projection to the union type."

carry-over answer: "I probably miss something here, but I see that TrackedFile has a PartitionData member. If we want to produce that, is specId and partitionTuple enough? Don't we need a PartitionSpec to produce PartitionData?"

this.recordCount = source.recordCount();
this.fileSizeInBytes = source.fileSizeInBytes();
this.partitionData = (PartitionData) source.partition();
this.partition = (PartitionData) source.partition();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

carry-over comment: "We should not need unchecked casts like this."

carry-over answer: "
The difficulty here is the TrackedFile.partition() return StructLike while the TrackedFileStruct constructor accepts PartitionData. Since internally in TrackedFile we also have PartitionData this seemed safe"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Casting partition() to PartitionData where the concrete type is actually needed is already the established pattern in core. For example, ManifestFilterManager:

PartitionData partition = (PartitionData) file.partition();

We can't narrow TrackedFile.partition() to return PartitionData to avoid the cast either. Every file and scan type in the codebase exposes partition() as StructLikeContentFile, BaseFile, V1Metadata/V2Metadata, PartitionScanTask, ContentScanTask, etc.

private static final PartitionData PARTITION = partition("books");

// Tracking field ordinals, looked up from the schema so the tests do not hard-code offsets.
private static final int STATUS_ORDINAL = ordinalOf(Tracking.schema(), "status");

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted the part of the test that constructs Tracking to the state before the builder, because that seemed independent of this change. Hence these show as appeared, but in fact restored

Comment thread core/src/main/java/org/apache/iceberg/TrackedFileBuilder.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileBuilder.java Outdated
this.recordCount = source.recordCount();
this.fileSizeInBytes = source.fileSizeInBytes();
this.partitionData = (PartitionData) source.partition();
this.partition = (PartitionData) source.partition();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Casting partition() to PartitionData where the concrete type is actually needed is already the established pattern in core. For example, ManifestFilterManager:

PartitionData partition = (PartitionData) file.partition();

We can't narrow TrackedFile.partition() to return PartitionData to avoid the cast either. Every file and scan type in the codebase exposes partition() as StructLikeContentFile, BaseFile, V1Metadata/V2Metadata, PartitionScanTask, ContentScanTask, etc.

Preconditions.checkArgument(
this.deletionVector == null || !this.deletionVector.equals(newDeletionVector),
"The same deletion vector already added");
deletionVector == null || deletionVector.cardinality() < newDeletionVector.cardinality(),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe the spec requiring increasing cardinality. A writer could reserialize the DVs into a new puffin blob. Should we allow equal cardinality?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a very interesting scenario, like Puffin file compaction. It may not be useful to compact DV Puffin files due to the access pattern to DV blobs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional context: This was the original comment.
The intention was to prevent adding the same DV through the builder, because that bumps the dv_snapshot_id while not changing the DV itself. I first introduced an override for equals in DeletionVectorStruct but we decided to avoid such overrides.
As a workaround @rdblue suggested that we can "mimic" equality checks here by requiring 2 things:

  1. if DV cardinality doesn't increase than we probably add the same DV again
  2. if cardinality increased, that's great, but expectation is that we don't write the new DV in-place.

I haven't thought about DV compaction, seems something we might want to discuss in the future. If we decide to support it, we can still remove/restructure the conditions here. WDYT?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to keep the stricter validation for now and relax in the future if needed.

deletionVector == null
|| !deletionVector.location().equals(newDeletionVector.location())
|| deletionVector.offset() != newDeletionVector.offset(),
"Invalid DV update: same location and offset has changed cardinality");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moot if we allow equal cardinality, but the error message part about cardinality is dependent on prior precondition check. It can be a bit confusing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Simplified the message here, and don't mention cardinality

if (spec.isUnpartitioned()) {
Preconditions.checkArgument(
newPartition == null, "Invalid partition: must be null for unpartitioned spec");
this.specId = spec.specId();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: this is in both branches of the if. Hoist it out?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, was going to, but then I forgot :)

"The same deletion vector already added");
deletionVector == null || deletionVector.cardinality() < newDeletionVector.cardinality(),
"Invalid DV update, cardinality must increase: existing=%s, new=%s",
deletionVector == null ? null : deletionVector.cardinality(),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
deletionVector == null ? null : deletionVector.cardinality(),
deletionVector.cardinality(),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ternary is not needed because your precondition check already validates null

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I thought too, but apparently these Preconditions parameters aren't lazily evaluated. Without the ternary every test that wanted to set a DV gave NPE on this.

class TrackedFileBuilder {
private final long snapshotId;
private final FileContent contentType;
private final TrackingBuilder trackingBuilder;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can align my PR with this new shape to avoid the individual setters that you commented.
#16936

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically, nothing changes from the user's perspective, it's just how we create Tracking internally.

@gaborkaszab gaborkaszab force-pushed the main_trackedfilebuilder_adjustments branch from 0514210 to 390dc8c Compare June 26, 2026 12:45

@gaborkaszab gaborkaszab left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look, @stevenzwu @anoopj !

One open question remained from this set of comments: Do we want to allow came dv cardinality for DV compaction?

class TrackedFileBuilder {
private final long snapshotId;
private final FileContent contentType;
private final TrackingBuilder trackingBuilder;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically, nothing changes from the user's perspective, it's just how we create Tracking internally.

if (spec.isUnpartitioned()) {
Preconditions.checkArgument(
newPartition == null, "Invalid partition: must be null for unpartitioned spec");
this.specId = spec.specId();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, was going to, but then I forgot :)

Preconditions.checkArgument(
this.deletionVector == null || !this.deletionVector.equals(newDeletionVector),
"The same deletion vector already added");
deletionVector == null || deletionVector.cardinality() < newDeletionVector.cardinality(),

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional context: This was the original comment.
The intention was to prevent adding the same DV through the builder, because that bumps the dv_snapshot_id while not changing the DV itself. I first introduced an override for equals in DeletionVectorStruct but we decided to avoid such overrides.
As a workaround @rdblue suggested that we can "mimic" equality checks here by requiring 2 things:

  1. if DV cardinality doesn't increase than we probably add the same DV again
  2. if cardinality increased, that's great, but expectation is that we don't write the new DV in-place.

I haven't thought about DV compaction, seems something we might want to discuss in the future. If we decide to support it, we can still remove/restructure the conditions here. WDYT?

"The same deletion vector already added");
deletionVector == null || deletionVector.cardinality() < newDeletionVector.cardinality(),
"Invalid DV update, cardinality must increase: existing=%s, new=%s",
deletionVector == null ? null : deletionVector.cardinality(),

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I thought too, but apparently these Preconditions parameters aren't lazily evaluated. Without the ternary every test that wanted to set a DV gave NPE on this.

Comment thread core/src/main/java/org/apache/iceberg/TrackedFileBuilder.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileBuilder.java Outdated
deletionVector == null
|| !deletionVector.location().equals(newDeletionVector.location())
|| deletionVector.offset() != newDeletionVector.offset(),
"Invalid DV update: same location and offset has changed cardinality");

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Simplified the message here, and don't mention cardinality

@gaborkaszab gaborkaszab force-pushed the main_trackedfilebuilder_adjustments branch from 390dc8c to 116b838 Compare June 26, 2026 13:27
!isLeafManifest(source.contentType()),
"Manifest entries cannot transition to REPLACED, but entry type is: %s",
source.contentType());
!isLeafManifest(source.contentType()), "Cannot transition manifest files to REPLACED");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
!isLeafManifest(source.contentType()), "Cannot transition manifest files to REPLACED");
!isLeafManifest(source.contentType()), "Cannot transition manifest entries to REPLACED status");

Preconditions.checkArgument(
this.deletionVector == null || !this.deletionVector.equals(newDeletionVector),
"The same deletion vector already added");
deletionVector == null || deletionVector.cardinality() < newDeletionVector.cardinality(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to keep the stricter validation for now and relax in the future if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants