Skip to content

Core: Introduce builder for TrackedFile#16769

Merged
stevenzwu merged 1 commit into
apache:mainfrom
gaborkaszab:main_trackedfile_builder
Jun 22, 2026
Merged

Core: Introduce builder for TrackedFile#16769
stevenzwu merged 1 commit into
apache:mainfrom
gaborkaszab:main_trackedfile_builder

Conversation

@gaborkaszab

Copy link
Copy Markdown
Contributor

No description provided.

@github-actions github-actions Bot added the core label Jun 11, 2026
Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileStruct.java
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileBuilder.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileBuilder.java
Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileBuilder.java
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileBuilder.java Outdated
this.manifestInfo = manifestInfo;
this.keyMetadata = keyMetadata != null ? ByteBuffers.toByteArray(keyMetadata) : null;
this.splitOffsets = splitOffsets != null ? ArrayUtil.toLongArray(splitOffsets) : null;
this.equalityIds = equalityIds != null ? ArrayUtil.toIntArray(equalityIds) : null;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up, not a blocker: We are doing back and forth conversion btw array and list. I know this follows existing pattern like BaseFile. I am wondering if it makes sense to just switch to arrays for these to splitOffsets and equalityIds.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can open a separate PR for this where we can continue the discussion. WDYT?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, need a bit broader discussion on that topic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record: I did some investigation on this. Apparently, based on the schema the ValueReaders expect Lists for these types and not arrays so the internal get and set functions have to return List objects.

Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileBuilder.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileBuilder.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileBuilder.java
Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileBuilder.java Outdated
@gaborkaszab gaborkaszab force-pushed the main_trackedfile_builder branch from f9c86c3 to 2ad4d31 Compare June 17, 2026 10:04

@gaborkaszab gaborkaszab left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look @stevenzwu !
I addressed all the comments. I have a follow-up question for you firstRowId, though.

Since MODIFIED is merged, let me rebase and eliminate the relevant TODOs

Comment thread core/src/main/java/org/apache/iceberg/TrackedFileBuilder.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileBuilder.java
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileBuilder.java Outdated
this.manifestInfo = manifestInfo;
this.keyMetadata = keyMetadata != null ? ByteBuffers.toByteArray(keyMetadata) : null;
this.splitOffsets = splitOffsets != null ? ArrayUtil.toLongArray(splitOffsets) : null;
this.equalityIds = equalityIds != null ? ArrayUtil.toIntArray(equalityIds) : null;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can open a separate PR for this where we can continue the discussion. WDYT?

Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileBuilder.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileBuilder.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileBuilder.java
Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileBuilder.java
Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileBuilder.java Outdated
@gaborkaszab gaborkaszab force-pushed the main_trackedfile_builder branch from 2ad4d31 to 2319f92 Compare June 17, 2026 12:05
@gaborkaszab

Copy link
Copy Markdown
Contributor Author

Rebased with main and resolved the following:

  • TrackedFileAdapters to use the builder instead of the constructor
  • TestTrackedFileBuilder to adjust to MODIFIED state where necessary

assertThat(deleteFile.format()).isEqualTo(FileFormat.AVRO);
assertThat(deleteFile.recordCount()).isEqualTo(50L);
assertThat(deleteFile.fileSizeInBytes()).isEqualTo(512L);
assertThat(deleteFile.sortOrderId()).isEqualTo(5);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EQ-deletes don't allow sort order and splitOffset through the builder. Removed the checks before no longer makes sense.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After our conversation, I restored setting these fields for eq-deletes too

@gaborkaszab gaborkaszab force-pushed the main_trackedfile_builder branch from 2319f92 to bf22506 Compare June 17, 2026 12:10
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileBuilder.java
@gaborkaszab gaborkaszab force-pushed the main_trackedfile_builder branch 2 times, most recently from 6060c35 to 082a487 Compare June 17, 2026 12:37
@gaborkaszab gaborkaszab requested a review from stevenzwu June 17, 2026 13:19
this.manifestInfo = manifestInfo;
this.keyMetadata = keyMetadata != null ? ByteBuffers.toByteArray(keyMetadata) : null;
this.splitOffsets = splitOffsets != null ? ArrayUtil.toLongArray(splitOffsets) : null;
this.equalityIds = equalityIds != null ? ArrayUtil.toIntArray(equalityIds) : null;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, need a bit broader discussion on that topic.

}

@Override
public boolean equals(Object other) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StructLike classes in Iceberg often don't implement the equals method. but in this case, I am ok with it for usage in the TrackedFileBuilder

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should remove this. There are multiple ways of considering something "equal" and we try to avoid equals methods when it is unclear which one should be used. That's the case here, where it would be reasonable to assume that two DeletionVector instances are the same if they represent the same DV (location and offset are equal) but also reasonable to assume two instances are equal if they have the same values (mostly implemented here).

In addition to the problem of what this means, there's another problem that this definition only works for other DeletionVectorStruct instances. That means the definition of equality is more murky: two DeletionVector instances can be equal by either definition above but still may not be equal here. The purpose of using interfaces for the public API is to make it so that implementations are interchangeable and this breaks that principle.

I think this was introduced to be able to detect changes applied to the builder, but we should find a better way to solve that problem rather than introducing this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue the comparison was introduced due to my comment here. Do you think it is sth we should guard against? if not, we can maybe just add Javadoc to explain the usage. if yes, should we just do a reference comparison like this.deletionVector != newDeletionVector or we want a util method like `DelectionVectorUtil.sameContents(DeletionVector, DeletionVector)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped these overrides here, and went for the verification suggested by Ryan within the builder class.

Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileBuilder.java
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileBuilder.java
@stevenzwu stevenzwu moved this to In review in V4: metadata tree Jun 17, 2026
Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileAdapters.java
Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileAdapters.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileBuilder.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileAdapters.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileAdapters.java Outdated
@gaborkaszab gaborkaszab force-pushed the main_trackedfile_builder branch from 082a487 to 072a4e9 Compare June 18, 2026 12:29

@gaborkaszab gaborkaszab left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @stevenzwu for looking into this. I address your comments. For the new restrictions of eq-deletes I invited @amogh-jahagirdar to share some more context.

writer_format_version landed in the meantime. Let me rebase and get rid of the relevant TODOs

Comment thread core/src/main/java/org/apache/iceberg/TrackedFileBuilder.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileAdapters.java
Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileAdapters.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileAdapters.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileAdapters.java Outdated
@gaborkaszab gaborkaszab force-pushed the main_trackedfile_builder branch 3 times, most recently from 0360884 to 707b01c Compare June 18, 2026 14:13
@gaborkaszab gaborkaszab requested a review from stevenzwu June 18, 2026 14:18

@stevenzwu stevenzwu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one open question in the comment thread about equality files.

@amogh-jahagirdar amogh-jahagirdar left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need a bit more time for an in-depth review on this but I did comment on the equality delete discussion, thanks for raising that @gaborkaszab @stevenzwu ! I'll update the spec PR , I think it was a miss to be that restrictive for split offsets and sort order ID, technically they can apply for equality deletes as well

TrackedFileBuilder deletedPositions(ByteBuffer newDeletedPositions) {
Preconditions.checkArgument(newDeletedPositions != null, "Invalid deleted positions: null");
Preconditions.checkArgument(
contentType == FileContent.DATA_MANIFEST || contentType == FileContent.DELETE_MANIFEST,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may want to add a isLeafManifest() helper to this class?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a helper function for that, thx! This is called from static context too so added a FileContent parameter to the function.

Comment on lines +122 to +123
source.contentType() != FileContent.DATA_MANIFEST
&& source.contentType() != FileContent.DELETE_MANIFEST,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned below, but with the amount of times we do this kind of check, I think we'd benefit from a isLeafManifest helper

@amogh-jahagirdar

Copy link
Copy Markdown
Contributor

Also cc @anoopj in case he's interested in taking a look at this .

@gaborkaszab gaborkaszab force-pushed the main_trackedfile_builder branch from 707b01c to bf3dfe7 Compare June 22, 2026 12:06
@gaborkaszab gaborkaszab requested a review from stevenzwu June 22, 2026 12:14
Comment thread core/src/main/java/org/apache/iceberg/DeletionVectorStruct.java
}

/**
* Creates a builder for a tracked file derived from {@code newSource}.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newSource reference is stale?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting. Fixed

TrackedFileBuilder sortOrderId(int newSortOrderId) {
Preconditions.checkArgument(
!isLeafManifest(contentType),
"Sort order ID can not be added to manifest entries, but entry type is: %s",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Sort order ID can not be added to manifest entries, but entry type is: %s",
"Sort order ID cannot be added to manifest entries, but entry type is: %s",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Preconditions.checkArgument(newSplitOffsets != null, "Invalid split offsets: null");
Preconditions.checkArgument(
!isLeafManifest(contentType),
"Split offsets can not be added to manifest entries, but entry type is: %s",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Split offsets can not be added to manifest entries, but entry type is: %s",
"Split offsets cannot be added to manifest entries, but entry type is: %s",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

this.sortOrderId = sortOrderId;
this.deletionVector = deletionVector;
this.manifestInfo = manifestInfo;
this.keyMetadata = keyMetadata != null ? ByteBuffers.toByteArray(keyMetadata) : null;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ByteBuffers.toByteArray() and ArrayUtil.to*Array() are null safe. So it's ok to drop the ternary. e.g. just do:

 this.keyMetadata = ByteBuffers.toByteArray(keyMetadata);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed!

}

/**
* This method is for signalling that a new deletion vector is added. It is forbidden to add the

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to document the invariant in the comments. I would recommend just removing the javadoc for this method. It's the only method with a javadoc anyway.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggestion for the doc comment came from @stevenzwu here. I think this function is easy to read and interpret as is now, we can drop the comment. Let me know if you think otherwise @stevenzwu !

fileSizeInBytes != null, "Missing required field: file size in bytes");
Preconditions.checkArgument(partitionData != null, "Missing required field: partition data");
Preconditions.checkArgument(
(contentType != FileContent.DATA_MANIFEST && contentType != FileContent.DELETE_MANIFEST)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this check escaped the isLeafManifest consolidation — could be !isLeafManifest(contentType) || manifestInfo != null.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, thanks for spotting! Fixed

@gaborkaszab gaborkaszab force-pushed the main_trackedfile_builder branch from bf3dfe7 to 05fec44 Compare June 22, 2026 19:00
@gaborkaszab gaborkaszab requested a review from stevenzwu June 22, 2026 19:00
@gaborkaszab gaborkaszab force-pushed the main_trackedfile_builder branch from 05fec44 to aa0ec94 Compare June 22, 2026 21:29
@stevenzwu stevenzwu merged commit 7c13104 into apache:main Jun 22, 2026
54 checks passed
@github-project-automation github-project-automation Bot moved this from In review to Done in V4: metadata tree Jun 22, 2026
@stevenzwu

Copy link
Copy Markdown
Contributor

I merged this PR to unblock other PRs.

We can follow up separately if there are more comments.

@gaborkaszab

Copy link
Copy Markdown
Contributor Author

Thank you for the review @stevenzwu @anoopj @amogh-jahagirdar !
Let me know if you find anything that needs some adjustments! I'll take care of that

}

@Test
void testDvEquality() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: Iceberg keeps most acronyms upper case, so this should be DV instead of Dv.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx! Dropped the test, because also dropped the equals override

private final FileContent contentType;

// Required fields
private Integer writerFormatVersion = null;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this ever be null? I'm starting from the assumption that in v4 we will only write v4 metadata files. If that's true, then I think there are two cases when building a TrackedFile instance: either we are writing a v4 file or we are adapting an existing v1-3 file to be tracked in v4. The first case is easy: always set this to 4.

Does the second case go through this builder? I would expect to use a ManifestFile to TrackedFile adapter, like the adapters @anoopj introduced for TrackedFile to DataFile, DeleteFile, etc. That is a good pattern to follow because we never want a caller that is moving tracked files around to have the logic for migrating from v3 to v4.

@stevenzwu stevenzwu Jun 23, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I have a follow-up PR (draft status now) to adapt v3 shapes (like ManifestFile and DataFile) to v4 TrackedFile. It would be good to use builder for those adaptions too, unless you are thinking about using the constructor directly.

Regarding the null or not, current proposal/spec PR said it is not null (0 for pre-v4 and format versions for v4+)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, would we use adapters to wrap DataFile and DeleteFile anyway? Those are what our writers produce already and I doubt that we will want to go through and update all the plumbing to produce TrackedFile directly. The utility of this class may be very limited.

I could see us producing instances of TrackedFile by reading and wrapping v3 DataFile instances, or wrapping a new DataFile instance for v4. We should verify there are cases where we would use this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, as Steven noted, we could use the TrackedFileBuilder for the wrappers around V3 metadata structs, and there we could use this writer_format_version field through the builder. See the draft PR linked by Steven and let me know if it makes sense.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we concluded on (writer_)format_version related question on the other dedicated PR. Let me know if there is still something outstanding here!
Checking again Steven's PR on the wrappers, I think it makes sense to set this through the builder.

private FileFormat fileFormat = null;
private Long recordCount = null;
private Long fileSizeInBytes = null;
private PartitionData partitionData = null;

@rdblue rdblue Jun 23, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PartitionData should be StructLike. We don't want to rely on concrete classes here.

Also, this is in the "required fields" section, but I think that we want to allow this to be null for unpartitioned cases. The partition tuple will be null when specId is null or an unpartitioned spec, because the actual partition type that we store is the union schema of all valid partition specs in the table. And support for empty struct types and instances is unreliable.

source.writerFormatVersion(),
source.location(),
source.fileFormat(),
(PartitionData) source.partition(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not need unchecked casts like this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difficulty here is the TrackedFile.partition() return StructLike while the TrackedFileStruct constructor accepts PartitionData. Since internally in TrackedFile we also have PartitionData this seemed safe

Preconditions.checkArgument(source != null, "Invalid source: null");
Preconditions.checkArgument(
!isLeafManifest(source.contentType()),
"Manifest entries cannot transition to REPLACED, but entry type is: %s",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is less direct than we typically prefer, but is not bad. This states a general rule and then says that it applies. We would normally say that at once: "Cannot transition %s to REPLACED".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx! I think we can even remove the content type parameter of the message:
Cannot transition manifest files to REPLACED

TrackedFileBuilder sortOrderId(int newSortOrderId) {
Preconditions.checkArgument(
!isLeafManifest(contentType),
"Sort order ID cannot be added to manifest entries, but entry type is: %s",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Cannot set sort order for manifest files"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Preconditions.checkArgument(newDeletionVector != null, "Invalid deletion vector: null");
Preconditions.checkArgument(
contentType == FileContent.DATA,
"Deletion vector can only be added to DATA entries, but entry type is: %s",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Cannot add deletion vector for file with content: %s"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, and found couple other similar messages, fixed them too.

"Deletion vector can only be added to DATA entries, but entry type is: %s",
contentType);
Preconditions.checkArgument(
this.deletionVector == null || !this.deletionVector.equals(newDeletionVector),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: this. is used when assigning to a field, but not when accessing a field.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

contentType);
Preconditions.checkArgument(
this.deletionVector == null || !this.deletionVector.equals(newDeletionVector),
"The same deletion vector already added");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check that I would expect is deletionVector == null || newDeletionVector.cardinality() > deletionVector.cardinality(). That ensures that the deletion vector was updated, without relying on an equality check. Otherwise, bad data could still make it through just by using a different concrete class (which is not equal).

The cardinality check is closer to what we want although not foolproof. The underlying location and offset could be the same, although it is unlikely that would be the case. If you want to make sure that this is actually a different deletion vector, then I think the right thing to do is to do that check here:

Preconditions.checkArgument(
    deletionVector == null ||
        deletionVector.cardinality() < newDeletionVector.cardinality(),
    "Invalid DV update, cardinality must increase: existing=%s, new=%s", ...);
Preconditions.checkArgument(
    deletionVector == null ||
        !deletionVector.location().equals(newDeletionVector.location()) ||
        !deletionVector.offset().equals(newDeletionVector.offset()),
    "Invalid DV update: same offset and location has changed cardinality");

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, borrowed these checks!

return this;
}

TrackedFileBuilder deletedPositions(ByteBuffer newDeletedPositions) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against this since I haven't seen how it is used yet. But I find these methods that capture data that will be passed through to Tracking suspicious. Why not pass in a new Tracking and not mirror its builder's API here? Also, if this is the pattern we want, then why not wrap a TrackingBuilder and pass this directly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Felt a bit odd for me when I implemented this but went this way for two reasons:

  1. I tried to avoid allow passing Tracking in the TrackedFileBuilder for the users because similar reasons we don't allow passing status for TrackingBuilder: seems to represent an internal state and users might get it wrong to construct. We could guard against malformed Tracking objects with checks, but seemed that we already have everything to construct it internally and not push one more responsibility to the users of the builder.
  2. TrackingBuilder itself can't check for all of the constraints. For instance it doesn't know if it belongs to a data file or a delete manifest, hence can't verify that deleted/replaced positions are only allowed for manifests. We can perform such a check here, in the TrackedFileBuilder.

Introduced a TrackingBuilder member to the TrackedFileBuilder and we can delegate all these calls right away without keeping track the required information.

Preconditions.checkArgument(recordCount != null, "Missing required field: record count");
Preconditions.checkArgument(
fileSizeInBytes != null, "Missing required field: file size in bytes");
Preconditions.checkArgument(partitionData != null, "Missing required field: partition data");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx!
I change partition not to be required through the builder

return this;
}

TrackedFileBuilder partition(PartitionData newPartitionData) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this should be combined with specId: partition(int newSpecId, StructLike partitionTuple). If the tuple is set, then spec ID is required so that we know how to interpret it. I suspect that this should also be responsible for wrapping the partition tuple with a projection to the union type.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably miss something here, but I see that TrackedFile has a PartitionData member. If we want to produce that, is specId and partitionTuple enough? Don't we need a PartitionSpec to produce PartitionData?

file.set(SORT_ORDER_ID_ORDINAL, 3);
file.set(KEY_METADATA_ORDINAL, ByteBuffer.wrap(new byte[] {1, 2, 3}));
file.set(SPLIT_OFFSETS_ORDINAL, ImmutableList.of(50L, 100L));
TrackedFile file =

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use of going through the builder for tests? I think that this should simply use the constructor and pass all values. There's an argument for using this to make the test cases more clear (this is certainly readable) but it also doesn't allow us to construct some instances that we may want to test because they violate rules that the builder enforces.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to construct these objects in a "proper" way, however for these tests it doesn't really matter. Went back to using the constructor instead.
Also revived the createTracking() function that we had earlier, including reviving 2 Tracking ordinals, because using the TrackedFileStruct contractor makes those changes independent from this PR.

TrackedFile.schemaWithContentStats(Types.StructType.of(), Types.StructType.of());
private static final int CONTENT_TYPE_ORDINAL = ordinalOf(TRACKED_FILE_SCHEMA, "content_type");
private static final int SPEC_ID_ORDINAL = ordinalOf(TRACKED_FILE_SCHEMA, "spec_id");
private static final int CONTENT_STATS_ORDINAL = ordinalOf(TRACKED_FILE_SCHEMA, "content_stats");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't this fix the use of SPEC_ID_ORDINAL if it is updating these tests?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 tests still used it. Got rid of those usages too.
Also got rid of the DELETION_VECTOR_ORDINAL below

// TrackedFile optional field ordinals, looked up from the schema.
private static final Types.StructType TRACKED_FILE_SCHEMA =
TrackedFile.schemaWithContentStats(Types.StructType.of(), Types.StructType.of());
private static final int CONTENT_TYPE_ORDINAL = ordinalOf(TRACKED_FILE_SCHEMA, "content_type");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this introduce another ordinal? I think these are bad practice (these tests should not be using set/get) so I'm surprised to see that this eliminates some, but introduces others.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my answer below.
Removed

NO_PARTITION,
1L,
1L);
private static TrackedFileStruct dummyTrackedFile(FileContent contentType) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid unnecessary renames. This should be reverted.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to warn future readers/editor of this file that the TrackedFile returned here might have broken internal invariants because of going through the constructor. (it does, it has no Tracking for instance)
Reverted the rename

1L);
private static TrackedFileStruct dummyTrackedFile(FileContent contentType) {
TrackedFileStruct file = new TrackedFileStruct();
file.set(CONTENT_TYPE_ORDINAL, contentType.id());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces a bad practice, using the ordinals and the set methods. I'm surprised to see this change in the addition of builders. Instead, just update the usage of the constructor.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to find the easiest way to construct TrackedFile with the desired content type, where none go the other members matter.
Got rid of this new ordinal and changed to use the constructor.

@gaborkaszab gaborkaszab left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look, @rdblue !
I published the follow-up PR for this: #16964
Note, I'm not entirely sure about the partition StructLike / PartitionData part, but we can continue there.

file.set(SORT_ORDER_ID_ORDINAL, 3);
file.set(KEY_METADATA_ORDINAL, ByteBuffer.wrap(new byte[] {1, 2, 3}));
file.set(SPLIT_OFFSETS_ORDINAL, ImmutableList.of(50L, 100L));
TrackedFile file =

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to construct these objects in a "proper" way, however for these tests it doesn't really matter. Went back to using the constructor instead.
Also revived the createTracking() function that we had earlier, including reviving 2 Tracking ordinals, because using the TrackedFileStruct contractor makes those changes independent from this PR.

1L);
private static TrackedFileStruct dummyTrackedFile(FileContent contentType) {
TrackedFileStruct file = new TrackedFileStruct();
file.set(CONTENT_TYPE_ORDINAL, contentType.id());

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to find the easiest way to construct TrackedFile with the desired content type, where none go the other members matter.
Got rid of this new ordinal and changed to use the constructor.

NO_PARTITION,
1L,
1L);
private static TrackedFileStruct dummyTrackedFile(FileContent contentType) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to warn future readers/editor of this file that the TrackedFile returned here might have broken internal invariants because of going through the constructor. (it does, it has no Tracking for instance)
Reverted the rename

TrackedFile.schemaWithContentStats(Types.StructType.of(), Types.StructType.of());
private static final int CONTENT_TYPE_ORDINAL = ordinalOf(TRACKED_FILE_SCHEMA, "content_type");
private static final int SPEC_ID_ORDINAL = ordinalOf(TRACKED_FILE_SCHEMA, "spec_id");
private static final int CONTENT_STATS_ORDINAL = ordinalOf(TRACKED_FILE_SCHEMA, "content_stats");

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 tests still used it. Got rid of those usages too.
Also got rid of the DELETION_VECTOR_ORDINAL below

// TrackedFile optional field ordinals, looked up from the schema.
private static final Types.StructType TRACKED_FILE_SCHEMA =
TrackedFile.schemaWithContentStats(Types.StructType.of(), Types.StructType.of());
private static final int CONTENT_TYPE_ORDINAL = ordinalOf(TRACKED_FILE_SCHEMA, "content_type");

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my answer below.
Removed

Preconditions.checkArgument(source != null, "Invalid source: null");
Preconditions.checkArgument(
!isLeafManifest(source.contentType()),
"Manifest entries cannot transition to REPLACED, but entry type is: %s",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx! I think we can even remove the content type parameter of the message:
Cannot transition manifest files to REPLACED

private final FileContent contentType;

// Required fields
private Integer writerFormatVersion = null;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we concluded on (writer_)format_version related question on the other dedicated PR. Let me know if there is still something outstanding here!
Checking again Steven's PR on the wrappers, I think it makes sense to set this through the builder.

}

@Override
public boolean equals(Object other) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped these overrides here, and went for the verification suggested by Ryan within the builder class.

return this;
}

TrackedFileBuilder partition(PartitionData newPartitionData) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably miss something here, but I see that TrackedFile has a PartitionData member. If we want to produce that, is specId and partitionTuple enough? Don't we need a PartitionSpec to produce PartitionData?

source.writerFormatVersion(),
source.location(),
source.fileFormat(),
(PartitionData) source.partition(),

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difficulty here is the TrackedFile.partition() return StructLike while the TrackedFileStruct constructor accepts PartitionData. Since internally in TrackedFile we also have PartitionData this seemed safe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants