Core: Introduce builder for TrackedFile#16769
Conversation
| this.manifestInfo = manifestInfo; | ||
| this.keyMetadata = keyMetadata != null ? ByteBuffers.toByteArray(keyMetadata) : null; | ||
| this.splitOffsets = splitOffsets != null ? ArrayUtil.toLongArray(splitOffsets) : null; | ||
| this.equalityIds = equalityIds != null ? ArrayUtil.toIntArray(equalityIds) : null; |
There was a problem hiding this comment.
Follow-up, not a blocker: We are doing back and forth conversion btw array and list. I know this follows existing pattern like BaseFile. I am wondering if it makes sense to just switch to arrays for these to splitOffsets and equalityIds.
There was a problem hiding this comment.
I can open a separate PR for this where we can continue the discussion. WDYT?
There was a problem hiding this comment.
yes, need a bit broader discussion on that topic.
There was a problem hiding this comment.
For the record: I did some investigation on this. Apparently, based on the schema the ValueReaders expect Lists for these types and not arrays so the internal get and set functions have to return List objects.
f9c86c3 to
2ad4d31
Compare
gaborkaszab
left a comment
There was a problem hiding this comment.
Thanks for taking a look @stevenzwu !
I addressed all the comments. I have a follow-up question for you firstRowId, though.
Since MODIFIED is merged, let me rebase and eliminate the relevant TODOs
| this.manifestInfo = manifestInfo; | ||
| this.keyMetadata = keyMetadata != null ? ByteBuffers.toByteArray(keyMetadata) : null; | ||
| this.splitOffsets = splitOffsets != null ? ArrayUtil.toLongArray(splitOffsets) : null; | ||
| this.equalityIds = equalityIds != null ? ArrayUtil.toIntArray(equalityIds) : null; |
There was a problem hiding this comment.
I can open a separate PR for this where we can continue the discussion. WDYT?
2ad4d31 to
2319f92
Compare
|
Rebased with main and resolved the following:
|
| assertThat(deleteFile.format()).isEqualTo(FileFormat.AVRO); | ||
| assertThat(deleteFile.recordCount()).isEqualTo(50L); | ||
| assertThat(deleteFile.fileSizeInBytes()).isEqualTo(512L); | ||
| assertThat(deleteFile.sortOrderId()).isEqualTo(5); |
There was a problem hiding this comment.
EQ-deletes don't allow sort order and splitOffset through the builder. Removed the checks before no longer makes sense.
There was a problem hiding this comment.
After our conversation, I restored setting these fields for eq-deletes too
2319f92 to
bf22506
Compare
6060c35 to
082a487
Compare
| this.manifestInfo = manifestInfo; | ||
| this.keyMetadata = keyMetadata != null ? ByteBuffers.toByteArray(keyMetadata) : null; | ||
| this.splitOffsets = splitOffsets != null ? ArrayUtil.toLongArray(splitOffsets) : null; | ||
| this.equalityIds = equalityIds != null ? ArrayUtil.toIntArray(equalityIds) : null; |
There was a problem hiding this comment.
yes, need a bit broader discussion on that topic.
| } | ||
|
|
||
| @Override | ||
| public boolean equals(Object other) { |
There was a problem hiding this comment.
StructLike classes in Iceberg often don't implement the equals method. but in this case, I am ok with it for usage in the TrackedFileBuilder
There was a problem hiding this comment.
I think that we should remove this. There are multiple ways of considering something "equal" and we try to avoid equals methods when it is unclear which one should be used. That's the case here, where it would be reasonable to assume that two DeletionVector instances are the same if they represent the same DV (location and offset are equal) but also reasonable to assume two instances are equal if they have the same values (mostly implemented here).
In addition to the problem of what this means, there's another problem that this definition only works for other DeletionVectorStruct instances. That means the definition of equality is more murky: two DeletionVector instances can be equal by either definition above but still may not be equal here. The purpose of using interfaces for the public API is to make it so that implementations are interchangeable and this breaks that principle.
I think this was introduced to be able to detect changes applied to the builder, but we should find a better way to solve that problem rather than introducing this.
There was a problem hiding this comment.
@rdblue the comparison was introduced due to my comment here. Do you think it is sth we should guard against? if not, we can maybe just add Javadoc to explain the usage. if yes, should we just do a reference comparison like this.deletionVector != newDeletionVector or we want a util method like `DelectionVectorUtil.sameContents(DeletionVector, DeletionVector)?
There was a problem hiding this comment.
I dropped these overrides here, and went for the verification suggested by Ryan within the builder class.
082a487 to
072a4e9
Compare
There was a problem hiding this comment.
Thanks again @stevenzwu for looking into this. I address your comments. For the new restrictions of eq-deletes I invited @amogh-jahagirdar to share some more context.
writer_format_version landed in the meantime. Let me rebase and get rid of the relevant TODOs
0360884 to
707b01c
Compare
stevenzwu
left a comment
There was a problem hiding this comment.
LGTM except for one open question in the comment thread about equality files.
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
I need a bit more time for an in-depth review on this but I did comment on the equality delete discussion, thanks for raising that @gaborkaszab @stevenzwu ! I'll update the spec PR , I think it was a miss to be that restrictive for split offsets and sort order ID, technically they can apply for equality deletes as well
| TrackedFileBuilder deletedPositions(ByteBuffer newDeletedPositions) { | ||
| Preconditions.checkArgument(newDeletedPositions != null, "Invalid deleted positions: null"); | ||
| Preconditions.checkArgument( | ||
| contentType == FileContent.DATA_MANIFEST || contentType == FileContent.DELETE_MANIFEST, |
There was a problem hiding this comment.
we may want to add a isLeafManifest() helper to this class?
There was a problem hiding this comment.
Added a helper function for that, thx! This is called from static context too so added a FileContent parameter to the function.
| source.contentType() != FileContent.DATA_MANIFEST | ||
| && source.contentType() != FileContent.DELETE_MANIFEST, |
There was a problem hiding this comment.
Mentioned below, but with the amount of times we do this kind of check, I think we'd benefit from a isLeafManifest helper
|
Also cc @anoopj in case he's interested in taking a look at this . |
707b01c to
bf3dfe7
Compare
| } | ||
|
|
||
| /** | ||
| * Creates a builder for a tracked file derived from {@code newSource}. |
There was a problem hiding this comment.
Thanks for spotting. Fixed
| TrackedFileBuilder sortOrderId(int newSortOrderId) { | ||
| Preconditions.checkArgument( | ||
| !isLeafManifest(contentType), | ||
| "Sort order ID can not be added to manifest entries, but entry type is: %s", |
There was a problem hiding this comment.
| "Sort order ID can not be added to manifest entries, but entry type is: %s", | |
| "Sort order ID cannot be added to manifest entries, but entry type is: %s", |
| Preconditions.checkArgument(newSplitOffsets != null, "Invalid split offsets: null"); | ||
| Preconditions.checkArgument( | ||
| !isLeafManifest(contentType), | ||
| "Split offsets can not be added to manifest entries, but entry type is: %s", |
There was a problem hiding this comment.
| "Split offsets can not be added to manifest entries, but entry type is: %s", | |
| "Split offsets cannot be added to manifest entries, but entry type is: %s", |
| this.sortOrderId = sortOrderId; | ||
| this.deletionVector = deletionVector; | ||
| this.manifestInfo = manifestInfo; | ||
| this.keyMetadata = keyMetadata != null ? ByteBuffers.toByteArray(keyMetadata) : null; |
There was a problem hiding this comment.
ByteBuffers.toByteArray() and ArrayUtil.to*Array() are null safe. So it's ok to drop the ternary. e.g. just do:
this.keyMetadata = ByteBuffers.toByteArray(keyMetadata);
| } | ||
|
|
||
| /** | ||
| * This method is for signalling that a new deletion vector is added. It is forbidden to add the |
There was a problem hiding this comment.
You don't need to document the invariant in the comments. I would recommend just removing the javadoc for this method. It's the only method with a javadoc anyway.
There was a problem hiding this comment.
The suggestion for the doc comment came from @stevenzwu here. I think this function is easy to read and interpret as is now, we can drop the comment. Let me know if you think otherwise @stevenzwu !
| fileSizeInBytes != null, "Missing required field: file size in bytes"); | ||
| Preconditions.checkArgument(partitionData != null, "Missing required field: partition data"); | ||
| Preconditions.checkArgument( | ||
| (contentType != FileContent.DATA_MANIFEST && contentType != FileContent.DELETE_MANIFEST) |
There was a problem hiding this comment.
nit: this check escaped the isLeafManifest consolidation — could be !isLeafManifest(contentType) || manifestInfo != null.
There was a problem hiding this comment.
Indeed, thanks for spotting! Fixed
bf3dfe7 to
05fec44
Compare
05fec44 to
aa0ec94
Compare
|
I merged this PR to unblock other PRs. We can follow up separately if there are more comments. |
|
Thank you for the review @stevenzwu @anoopj @amogh-jahagirdar ! |
| } | ||
|
|
||
| @Test | ||
| void testDvEquality() { |
There was a problem hiding this comment.
Style: Iceberg keeps most acronyms upper case, so this should be DV instead of Dv.
There was a problem hiding this comment.
Thx! Dropped the test, because also dropped the equals override
| private final FileContent contentType; | ||
|
|
||
| // Required fields | ||
| private Integer writerFormatVersion = null; |
There was a problem hiding this comment.
Would this ever be null? I'm starting from the assumption that in v4 we will only write v4 metadata files. If that's true, then I think there are two cases when building a TrackedFile instance: either we are writing a v4 file or we are adapting an existing v1-3 file to be tracked in v4. The first case is easy: always set this to 4.
Does the second case go through this builder? I would expect to use a ManifestFile to TrackedFile adapter, like the adapters @anoopj introduced for TrackedFile to DataFile, DeleteFile, etc. That is a good pattern to follow because we never want a caller that is moving tracked files around to have the logic for migrating from v3 to v4.
There was a problem hiding this comment.
yes, I have a follow-up PR (draft status now) to adapt v3 shapes (like ManifestFile and DataFile) to v4 TrackedFile. It would be good to use builder for those adaptions too, unless you are thinking about using the constructor directly.
Regarding the null or not, current proposal/spec PR said it is not null (0 for pre-v4 and format versions for v4+)
There was a problem hiding this comment.
Thinking about this more, would we use adapters to wrap DataFile and DeleteFile anyway? Those are what our writers produce already and I doubt that we will want to go through and update all the plumbing to produce TrackedFile directly. The utility of this class may be very limited.
I could see us producing instances of TrackedFile by reading and wrapping v3 DataFile instances, or wrapping a new DataFile instance for v4. We should verify there are cases where we would use this.
There was a problem hiding this comment.
Apparently, as Steven noted, we could use the TrackedFileBuilder for the wrappers around V3 metadata structs, and there we could use this writer_format_version field through the builder. See the draft PR linked by Steven and let me know if it makes sense.
There was a problem hiding this comment.
I believe we concluded on (writer_)format_version related question on the other dedicated PR. Let me know if there is still something outstanding here!
Checking again Steven's PR on the wrappers, I think it makes sense to set this through the builder.
| private FileFormat fileFormat = null; | ||
| private Long recordCount = null; | ||
| private Long fileSizeInBytes = null; | ||
| private PartitionData partitionData = null; |
There was a problem hiding this comment.
PartitionData should be StructLike. We don't want to rely on concrete classes here.
Also, this is in the "required fields" section, but I think that we want to allow this to be null for unpartitioned cases. The partition tuple will be null when specId is null or an unpartitioned spec, because the actual partition type that we store is the union schema of all valid partition specs in the table. And support for empty struct types and instances is unreliable.
| source.writerFormatVersion(), | ||
| source.location(), | ||
| source.fileFormat(), | ||
| (PartitionData) source.partition(), |
There was a problem hiding this comment.
We should not need unchecked casts like this.
There was a problem hiding this comment.
The difficulty here is the TrackedFile.partition() return StructLike while the TrackedFileStruct constructor accepts PartitionData. Since internally in TrackedFile we also have PartitionData this seemed safe
| Preconditions.checkArgument(source != null, "Invalid source: null"); | ||
| Preconditions.checkArgument( | ||
| !isLeafManifest(source.contentType()), | ||
| "Manifest entries cannot transition to REPLACED, but entry type is: %s", |
There was a problem hiding this comment.
This is less direct than we typically prefer, but is not bad. This states a general rule and then says that it applies. We would normally say that at once: "Cannot transition %s to REPLACED".
There was a problem hiding this comment.
Thx! I think we can even remove the content type parameter of the message:
Cannot transition manifest files to REPLACED
| TrackedFileBuilder sortOrderId(int newSortOrderId) { | ||
| Preconditions.checkArgument( | ||
| !isLeafManifest(contentType), | ||
| "Sort order ID cannot be added to manifest entries, but entry type is: %s", |
There was a problem hiding this comment.
"Cannot set sort order for manifest files"?
| Preconditions.checkArgument(newDeletionVector != null, "Invalid deletion vector: null"); | ||
| Preconditions.checkArgument( | ||
| contentType == FileContent.DATA, | ||
| "Deletion vector can only be added to DATA entries, but entry type is: %s", |
There was a problem hiding this comment.
"Cannot add deletion vector for file with content: %s"
There was a problem hiding this comment.
Fixed, and found couple other similar messages, fixed them too.
| "Deletion vector can only be added to DATA entries, but entry type is: %s", | ||
| contentType); | ||
| Preconditions.checkArgument( | ||
| this.deletionVector == null || !this.deletionVector.equals(newDeletionVector), |
There was a problem hiding this comment.
Style: this. is used when assigning to a field, but not when accessing a field.
| contentType); | ||
| Preconditions.checkArgument( | ||
| this.deletionVector == null || !this.deletionVector.equals(newDeletionVector), | ||
| "The same deletion vector already added"); |
There was a problem hiding this comment.
The check that I would expect is deletionVector == null || newDeletionVector.cardinality() > deletionVector.cardinality(). That ensures that the deletion vector was updated, without relying on an equality check. Otherwise, bad data could still make it through just by using a different concrete class (which is not equal).
The cardinality check is closer to what we want although not foolproof. The underlying location and offset could be the same, although it is unlikely that would be the case. If you want to make sure that this is actually a different deletion vector, then I think the right thing to do is to do that check here:
Preconditions.checkArgument(
deletionVector == null ||
deletionVector.cardinality() < newDeletionVector.cardinality(),
"Invalid DV update, cardinality must increase: existing=%s, new=%s", ...);
Preconditions.checkArgument(
deletionVector == null ||
!deletionVector.location().equals(newDeletionVector.location()) ||
!deletionVector.offset().equals(newDeletionVector.offset()),
"Invalid DV update: same offset and location has changed cardinality");There was a problem hiding this comment.
Thanks, borrowed these checks!
| return this; | ||
| } | ||
|
|
||
| TrackedFileBuilder deletedPositions(ByteBuffer newDeletedPositions) { |
There was a problem hiding this comment.
I'm not against this since I haven't seen how it is used yet. But I find these methods that capture data that will be passed through to Tracking suspicious. Why not pass in a new Tracking and not mirror its builder's API here? Also, if this is the pattern we want, then why not wrap a TrackingBuilder and pass this directly?
There was a problem hiding this comment.
Felt a bit odd for me when I implemented this but went this way for two reasons:
- I tried to avoid allow passing
Trackingin theTrackedFileBuilderfor the users because similar reasons we don't allow passing status forTrackingBuilder: seems to represent an internal state and users might get it wrong to construct. We could guard against malformedTrackingobjects with checks, but seemed that we already have everything to construct it internally and not push one more responsibility to the users of the builder. TrackingBuilderitself can't check for all of the constraints. For instance it doesn't know if it belongs to a data file or a delete manifest, hence can't verify that deleted/replaced positions are only allowed for manifests. We can perform such a check here, in theTrackedFileBuilder.
Introduced a TrackingBuilder member to the TrackedFileBuilder and we can delegate all these calls right away without keeping track the required information.
| Preconditions.checkArgument(recordCount != null, "Missing required field: record count"); | ||
| Preconditions.checkArgument( | ||
| fileSizeInBytes != null, "Missing required field: file size in bytes"); | ||
| Preconditions.checkArgument(partitionData != null, "Missing required field: partition data"); |
There was a problem hiding this comment.
I don't think this is correct.
There was a problem hiding this comment.
Thx!
I change partition not to be required through the builder
| return this; | ||
| } | ||
|
|
||
| TrackedFileBuilder partition(PartitionData newPartitionData) { |
There was a problem hiding this comment.
I think that this should be combined with specId: partition(int newSpecId, StructLike partitionTuple). If the tuple is set, then spec ID is required so that we know how to interpret it. I suspect that this should also be responsible for wrapping the partition tuple with a projection to the union type.
There was a problem hiding this comment.
I probably miss something here, but I see that TrackedFile has a PartitionData member. If we want to produce that, is specId and partitionTuple enough? Don't we need a PartitionSpec to produce PartitionData?
| file.set(SORT_ORDER_ID_ORDINAL, 3); | ||
| file.set(KEY_METADATA_ORDINAL, ByteBuffer.wrap(new byte[] {1, 2, 3})); | ||
| file.set(SPLIT_OFFSETS_ORDINAL, ImmutableList.of(50L, 100L)); | ||
| TrackedFile file = |
There was a problem hiding this comment.
What is the use of going through the builder for tests? I think that this should simply use the constructor and pass all values. There's an argument for using this to make the test cases more clear (this is certainly readable) but it also doesn't allow us to construct some instances that we may want to test because they violate rules that the builder enforces.
There was a problem hiding this comment.
I wanted to construct these objects in a "proper" way, however for these tests it doesn't really matter. Went back to using the constructor instead.
Also revived the createTracking() function that we had earlier, including reviving 2 Tracking ordinals, because using the TrackedFileStruct contractor makes those changes independent from this PR.
| TrackedFile.schemaWithContentStats(Types.StructType.of(), Types.StructType.of()); | ||
| private static final int CONTENT_TYPE_ORDINAL = ordinalOf(TRACKED_FILE_SCHEMA, "content_type"); | ||
| private static final int SPEC_ID_ORDINAL = ordinalOf(TRACKED_FILE_SCHEMA, "spec_id"); | ||
| private static final int CONTENT_STATS_ORDINAL = ordinalOf(TRACKED_FILE_SCHEMA, "content_stats"); |
There was a problem hiding this comment.
Why doesn't this fix the use of SPEC_ID_ORDINAL if it is updating these tests?
There was a problem hiding this comment.
2 tests still used it. Got rid of those usages too.
Also got rid of the DELETION_VECTOR_ORDINAL below
| // TrackedFile optional field ordinals, looked up from the schema. | ||
| private static final Types.StructType TRACKED_FILE_SCHEMA = | ||
| TrackedFile.schemaWithContentStats(Types.StructType.of(), Types.StructType.of()); | ||
| private static final int CONTENT_TYPE_ORDINAL = ordinalOf(TRACKED_FILE_SCHEMA, "content_type"); |
There was a problem hiding this comment.
Why does this introduce another ordinal? I think these are bad practice (these tests should not be using set/get) so I'm surprised to see that this eliminates some, but introduces others.
There was a problem hiding this comment.
See my answer below.
Removed
| NO_PARTITION, | ||
| 1L, | ||
| 1L); | ||
| private static TrackedFileStruct dummyTrackedFile(FileContent contentType) { |
There was a problem hiding this comment.
Please avoid unnecessary renames. This should be reverted.
There was a problem hiding this comment.
I wanted to warn future readers/editor of this file that the TrackedFile returned here might have broken internal invariants because of going through the constructor. (it does, it has no Tracking for instance)
Reverted the rename
| 1L); | ||
| private static TrackedFileStruct dummyTrackedFile(FileContent contentType) { | ||
| TrackedFileStruct file = new TrackedFileStruct(); | ||
| file.set(CONTENT_TYPE_ORDINAL, contentType.id()); |
There was a problem hiding this comment.
This introduces a bad practice, using the ordinals and the set methods. I'm surprised to see this change in the addition of builders. Instead, just update the usage of the constructor.
There was a problem hiding this comment.
I wanted to find the easiest way to construct TrackedFile with the desired content type, where none go the other members matter.
Got rid of this new ordinal and changed to use the constructor.
| file.set(SORT_ORDER_ID_ORDINAL, 3); | ||
| file.set(KEY_METADATA_ORDINAL, ByteBuffer.wrap(new byte[] {1, 2, 3})); | ||
| file.set(SPLIT_OFFSETS_ORDINAL, ImmutableList.of(50L, 100L)); | ||
| TrackedFile file = |
There was a problem hiding this comment.
I wanted to construct these objects in a "proper" way, however for these tests it doesn't really matter. Went back to using the constructor instead.
Also revived the createTracking() function that we had earlier, including reviving 2 Tracking ordinals, because using the TrackedFileStruct contractor makes those changes independent from this PR.
| 1L); | ||
| private static TrackedFileStruct dummyTrackedFile(FileContent contentType) { | ||
| TrackedFileStruct file = new TrackedFileStruct(); | ||
| file.set(CONTENT_TYPE_ORDINAL, contentType.id()); |
There was a problem hiding this comment.
I wanted to find the easiest way to construct TrackedFile with the desired content type, where none go the other members matter.
Got rid of this new ordinal and changed to use the constructor.
| NO_PARTITION, | ||
| 1L, | ||
| 1L); | ||
| private static TrackedFileStruct dummyTrackedFile(FileContent contentType) { |
There was a problem hiding this comment.
I wanted to warn future readers/editor of this file that the TrackedFile returned here might have broken internal invariants because of going through the constructor. (it does, it has no Tracking for instance)
Reverted the rename
| TrackedFile.schemaWithContentStats(Types.StructType.of(), Types.StructType.of()); | ||
| private static final int CONTENT_TYPE_ORDINAL = ordinalOf(TRACKED_FILE_SCHEMA, "content_type"); | ||
| private static final int SPEC_ID_ORDINAL = ordinalOf(TRACKED_FILE_SCHEMA, "spec_id"); | ||
| private static final int CONTENT_STATS_ORDINAL = ordinalOf(TRACKED_FILE_SCHEMA, "content_stats"); |
There was a problem hiding this comment.
2 tests still used it. Got rid of those usages too.
Also got rid of the DELETION_VECTOR_ORDINAL below
| // TrackedFile optional field ordinals, looked up from the schema. | ||
| private static final Types.StructType TRACKED_FILE_SCHEMA = | ||
| TrackedFile.schemaWithContentStats(Types.StructType.of(), Types.StructType.of()); | ||
| private static final int CONTENT_TYPE_ORDINAL = ordinalOf(TRACKED_FILE_SCHEMA, "content_type"); |
There was a problem hiding this comment.
See my answer below.
Removed
| Preconditions.checkArgument(source != null, "Invalid source: null"); | ||
| Preconditions.checkArgument( | ||
| !isLeafManifest(source.contentType()), | ||
| "Manifest entries cannot transition to REPLACED, but entry type is: %s", |
There was a problem hiding this comment.
Thx! I think we can even remove the content type parameter of the message:
Cannot transition manifest files to REPLACED
| private final FileContent contentType; | ||
|
|
||
| // Required fields | ||
| private Integer writerFormatVersion = null; |
There was a problem hiding this comment.
I believe we concluded on (writer_)format_version related question on the other dedicated PR. Let me know if there is still something outstanding here!
Checking again Steven's PR on the wrappers, I think it makes sense to set this through the builder.
| } | ||
|
|
||
| @Override | ||
| public boolean equals(Object other) { |
There was a problem hiding this comment.
I dropped these overrides here, and went for the verification suggested by Ryan within the builder class.
| return this; | ||
| } | ||
|
|
||
| TrackedFileBuilder partition(PartitionData newPartitionData) { |
There was a problem hiding this comment.
I probably miss something here, but I see that TrackedFile has a PartitionData member. If we want to produce that, is specId and partitionTuple enough? Don't we need a PartitionSpec to produce PartitionData?
| source.writerFormatVersion(), | ||
| source.location(), | ||
| source.fileFormat(), | ||
| (PartitionData) source.partition(), |
There was a problem hiding this comment.
The difficulty here is the TrackedFile.partition() return StructLike while the TrackedFileStruct constructor accepts PartitionData. Since internally in TrackedFile we also have PartitionData this seemed safe
No description provided.