Core: Add ManifestFile adapter for v4 tracked files#16867
Conversation
b50410f to
718ae34
Compare
Add TrackedFileAdapters.asManifestFile to wrap a DATA_MANIFEST or DELETE_MANIFEST TrackedFile as a ManifestFile. Closes apache#16227
718ae34 to
93368cb
Compare
wombatu-kun
left a comment
There was a problem hiding this comment.
Code looks correct: field delegation, equals/hashCode (consistent with GenericManifestFile), constructor preconditions, and the test coverage are all solid.
Non-blocking discussion point: partitionSpecId() throws UnsupportedOperationException. That is defensible since a v4 manifest is not bound to a single spec. But @rdblue pushed back on #16100 against these adapters exposing an invalid or arbitrary spec ID, so it may be worth confirming with reviewers that a throw is the intended contract for any future caller that reads partitionSpecId().
| NO_PARTITION, | ||
| 8L, | ||
| 2048L); | ||
| file.set(MANIFEST_INFO_ORDINAL, manifestInfo); |
There was a problem hiding this comment.
we can use the builder from @gaborkaszab 's PR: #16769
There was a problem hiding this comment.
Sounds good. I can make the change if #16769 gets merged first or do this as a followup PR if this gets merged first.
There was a problem hiding this comment.
FYI, in the meantime the builder got merged
gaborkaszab
left a comment
There was a problem hiding this comment.
Thank you for the PR, @anoopj !
I took a Quick Look and left some minor comment.
|
|
||
| private TrackedManifestFile(TrackedFile file) { | ||
| Preconditions.checkArgument( | ||
| file.tracking() != null, "Cannot create manifest file: no tracking"); |
There was a problem hiding this comment.
nit: other adapters don't check if tracking is null. Not sure we have to here. It's a required field anyway
| private TrackedManifestFile(TrackedFile file) { | ||
| Preconditions.checkArgument( | ||
| file.tracking() != null, "Cannot create manifest file: no tracking"); | ||
| Preconditions.checkArgument( |
There was a problem hiding this comment.
As per the spec proposal data seq num is 'Optional for leaf manifests, required for root.'. I think it's valid to have it as null here. Seems also optional in ManifestFile
There was a problem hiding this comment.
This PR is for root-level manifest tracking where null is truly an invalid state. We can either validate or just have it fail with an NPE.
Seems also optional in ManifestFile
It is required for a manifest file: ManifestFile.sequenceNumber() returns a primitive
There was a problem hiding this comment.
just to add on top of what Anoop said. this is for leaf manifest file entry (TrackedFile) in the root manifest file. Hence dataSequenceNumber should be required
There was a problem hiding this comment.
Thanks for the explanation @anoopj and @stevenzwu !
| file.tracking().snapshotId() != null, "Cannot create manifest file: no snapshot ID"); | ||
| Preconditions.checkArgument( | ||
| file.manifestInfo() != null, "Cannot create manifest file: no manifest info"); | ||
| Preconditions.checkArgument( |
There was a problem hiding this comment.
This checks an internal invariant for a TrackedFile. As per the spec proposal a DELETE_MANIFEST shouldn't have firstRowId, so I'd assume that this is a contract to keep when we construct a TrackedFile and set this value. In fact this is check in the relevant PR
| } | ||
|
|
||
| @Override | ||
| public boolean equals(Object other) { |
There was a problem hiding this comment.
The other adapters don't override equals and hashCode. Is there a particular reason it's needed here?
There was a problem hiding this comment.
I'm trying to be consistent with the semantics of GenericManifestFile.
There was a problem hiding this comment.
Do we actually use equals and hashCode? We try to avoid implementing those methods where there are different interpretations of what "equality" might mean.
There was a problem hiding this comment.
I didn't add any code that uses it. I was trying to be consistent with the behavior of the GenericManifestFile which defines equality with the location. I have removed it.
stevenzwu
left a comment
There was a problem hiding this comment.
LGTM.
I will merge Gabor's TrackedFileBuilder once CI passed. then let's rebase this one and use the builder as applicable.
| private TrackedManifestFile(TrackedFile file) { | ||
| Preconditions.checkArgument( | ||
| file.tracking() != null, "Cannot create manifest file: no tracking"); | ||
| Preconditions.checkArgument( |
There was a problem hiding this comment.
just to add on top of what Anoop said. this is for leaf manifest file entry (TrackedFile) in the root manifest file. Hence dataSequenceNumber should be required
| private TrackedManifestFile(TrackedFile file) { | ||
| Preconditions.checkArgument( | ||
| file.tracking().dataSequenceNumber() != null, | ||
| "Cannot create manifest file: no data sequence number"); |
There was a problem hiding this comment.
Nit: I think we prefer the slightly more specific "Invalid data sequence number: null" because "no X" doesn't necessarily mean "x is null".
Also, even though the attempt is to create a manifest file, that's not the actual problem. It's debatable whether to include what is being attempted, "Cannot create manifest file", but I would opt to leave it out because that's clear from other context (that the check is in asManifestFile and the TrackedManifestFile constructor).
| @Override | ||
| public int partitionSpecId() { | ||
| throw new UnsupportedOperationException( | ||
| "Tracked manifests are not bound to a single partition spec"); |
There was a problem hiding this comment.
I don't think that it is correct to say "Tracked manifests": what is a "tracked" manifest and why is one from a manifest list not "tracked"?
I think it is better to say "v4 manifests are not bound to a single partition spec".
There was a problem hiding this comment.
Fixed. I was trying to avoid having too many v4 references.
| public ManifestContent content() { | ||
| return file.contentType() == FileContent.DATA_MANIFEST | ||
| ? ManifestContent.DATA | ||
| : ManifestContent.DELETES; |
There was a problem hiding this comment.
Minor: I think it is safer to avoid a dependency between this and the constructor's verification. I would recommend using a switch statement:
switch (file.contentType()) {
case DATA_MANIFEST:
return ManifestContent.DATA;
case DELETE_MANIFEST:
return ManifestContent.DELETES;
default:
throw new UnsupportedOperationException("Unsupported content type for manifests: " + file.contentType());
}| public boolean equals(Object other) { | ||
| if (this == other) { | ||
| return true; | ||
| } else if (!(other instanceof TrackedManifestFile)) { |
There was a problem hiding this comment.
I don't think that the implementation of GenericManifestFile is very helpful. It considers two files equal if they have the same path (which I think is reasonable) but it also excludes any implementation other than GenericManifestFile, which means that they aren't actually interchangeable as the interface suggests.
In addition, this creates a situation where a.equals(b) may be true, but b.equals(a) can be false for reasonable implementations.
I'd prefer to remove this and fix places where it is assumed or used. Good catch, @gaborkaszab!
| "Cannot create manifest file: no data sequence number"); | ||
| Preconditions.checkArgument( | ||
| file.tracking().snapshotId() != null, "Cannot create manifest file: no snapshot ID"); | ||
| Preconditions.checkArgument( |
There was a problem hiding this comment.
This also seems an internal invariant for TrackedFile. The new TrackedFileBuilder guarantees that manifestInfo is set for manifest entries.
There was a problem hiding this comment.
That makes sense. Removed.
| NO_PARTITION, | ||
| 8L, | ||
| 2048L); | ||
| file.set(MANIFEST_INFO_ORDINAL, manifestInfo); |
There was a problem hiding this comment.
FYI, in the meantime the builder got merged
| private TrackedManifestFile(TrackedFile file) { | ||
| Preconditions.checkArgument( | ||
| file.tracking() != null, "Cannot create manifest file: no tracking"); | ||
| Preconditions.checkArgument( |
There was a problem hiding this comment.
Thanks for the explanation @anoopj and @stevenzwu !
| void testManifestFileAdapterDelegation() { | ||
| Tracking tracking = createTracking(); | ||
| ManifestInfo manifestInfo = createManifestInfo(); | ||
| TrackedFileStruct file = |
There was a problem hiding this comment.
nit: Can we hide the construction of the input TrackedFile into a separate function? Like dataManifestFile() it could get ManifestInfo, FileContent, and/or Tracking parameters if that matters for the test.
There was a problem hiding this comment.
Done. extracted manifestBuilder(FileContent) / manifestFile(FileContent, ManifestInfo) and now tests build through them.
| } | ||
|
|
||
| @Test | ||
| void testManifestFileAdapterDelegation() { |
There was a problem hiding this comment.
nit: maybe parameterize the test for data + delete manifest. The testDeleteManifestContentMapsToDeletes test could be dropped then
# Conflicts: # core/src/test/java/org/apache/iceberg/TestTrackedFileAdapters.java
gaborkaszab
left a comment
There was a problem hiding this comment.
just some minor comments. Other than them, LGTM!
| .manifestInfo(manifestInfo) | ||
| .keyMetadata(ByteBuffer.wrap(new byte[] {7, 8, 9})) | ||
| .build(); | ||
| populateTrackingFields(file); |
There was a problem hiding this comment.
Just FYI: in the follow-up for the TrackedFileBuilder PR, we decided to drop populateTrackingFields and went to call the constructor of TrackedFile instead using the createTracking (revived in the follow-up PR) method.
There was a problem hiding this comment.
Ack. I can rebase when the followup PR is merged.
| manifestBuilder(FileContent.DATA_MANIFEST).manifestInfo(createManifestInfo()).build(); | ||
| // Inheritance fills the data sequence number, which the adapter requires; first row ID and | ||
| // key metadata stay unset. | ||
| ((TrackingStruct) file.tracking()).set(DATA_SEQUENCE_NUMBER_ORDINAL, DATA_SEQUENCE_NUMBER); |
There was a problem hiding this comment.
instead of using the ordinal, we can use the inheritFrom(some_manifest) method to do that.
There was a problem hiding this comment.
Routing through inheritFrom() will require building a parent Tzracking with several nulls. I prefer to keep it since this is simpler.
Add TrackedFileAdapters.asManifestFile to wrap a DATA_MANIFEST or DELETE_MANIFEST TrackedFile as a ManifestFile.
The ManifestFile adapter will only be used by tests. Note that ManifestFile doesn't support manifest DVs. So in the actual manifest read path, the reader resolves the MDV first and produces the resulting TrackedFiles, which are then adapted to data/delete files.
Closes #16227