Core: Adjustments for TrackedFileBuilder#16964
Conversation
gaborkaszab
left a comment
There was a problem hiding this comment.
Carried over the unclosed comments from the original PR + some random extra explanation
|
|
||
| // optional fields | ||
| private Integer specId = null; | ||
| private PartitionData partition = null; |
There was a problem hiding this comment.
carry-over comment: "PartitionData should be StructLike. We don't want to rely on concrete classes here."
TrackedFileStruct constructor seem to accept PartitionData so we have to do the conversion at some point. Seemed straightforward to keep the same type in the builder as expected by the constructed class.
| TrackedFileBuilder partition(PartitionData newPartitionData) { | ||
| Preconditions.checkArgument(newPartitionData != null, "Invalid partition: null"); | ||
| this.partitionData = newPartitionData; | ||
| TrackedFileBuilder partition(PartitionSpec spec, StructLike newPartition) { |
There was a problem hiding this comment.
carry-over comment: "I think that this should be combined with specId: partition(int newSpecId, StructLike partitionTuple). If the tuple is set, then spec ID is required so that we know how to interpret it. I suspect that this should also be responsible for wrapping the partition tuple with a projection to the union type."
carry-over answer: "I probably miss something here, but I see that TrackedFile has a PartitionData member. If we want to produce that, is specId and partitionTuple enough? Don't we need a PartitionSpec to produce PartitionData?"
| this.recordCount = source.recordCount(); | ||
| this.fileSizeInBytes = source.fileSizeInBytes(); | ||
| this.partitionData = (PartitionData) source.partition(); | ||
| this.partition = (PartitionData) source.partition(); |
There was a problem hiding this comment.
carry-over comment: "We should not need unchecked casts like this."
carry-over answer: "
The difficulty here is the TrackedFile.partition() return StructLike while the TrackedFileStruct constructor accepts PartitionData. Since internally in TrackedFile we also have PartitionData this seemed safe"
There was a problem hiding this comment.
Casting partition() to PartitionData where the concrete type is actually needed is already the established pattern in core. For example, ManifestFilterManager:
PartitionData partition = (PartitionData) file.partition();We can't narrow TrackedFile.partition() to return PartitionData to avoid the cast either. Every file and scan type in the codebase exposes partition() as StructLike — ContentFile, BaseFile, V1Metadata/V2Metadata, PartitionScanTask, ContentScanTask, etc.
| private static final PartitionData PARTITION = partition("books"); | ||
|
|
||
| // Tracking field ordinals, looked up from the schema so the tests do not hard-code offsets. | ||
| private static final int STATUS_ORDINAL = ordinalOf(Tracking.schema(), "status"); |
There was a problem hiding this comment.
I reverted the part of the test that constructs Tracking to the state before the builder, because that seemed independent of this change. Hence these show as appeared, but in fact restored
| this.recordCount = source.recordCount(); | ||
| this.fileSizeInBytes = source.fileSizeInBytes(); | ||
| this.partitionData = (PartitionData) source.partition(); | ||
| this.partition = (PartitionData) source.partition(); |
There was a problem hiding this comment.
Casting partition() to PartitionData where the concrete type is actually needed is already the established pattern in core. For example, ManifestFilterManager:
PartitionData partition = (PartitionData) file.partition();We can't narrow TrackedFile.partition() to return PartitionData to avoid the cast either. Every file and scan type in the codebase exposes partition() as StructLike — ContentFile, BaseFile, V1Metadata/V2Metadata, PartitionScanTask, ContentScanTask, etc.
| Preconditions.checkArgument( | ||
| this.deletionVector == null || !this.deletionVector.equals(newDeletionVector), | ||
| "The same deletion vector already added"); | ||
| deletionVector == null || deletionVector.cardinality() < newDeletionVector.cardinality(), |
There was a problem hiding this comment.
I don't believe the spec requiring increasing cardinality. A writer could reserialize the DVs into a new puffin blob. Should we allow equal cardinality?
There was a problem hiding this comment.
this is a very interesting scenario, like Puffin file compaction. It may not be useful to compact DV Puffin files due to the access pattern to DV blobs.
There was a problem hiding this comment.
Additional context: This was the original comment.
The intention was to prevent adding the same DV through the builder, because that bumps the dv_snapshot_id while not changing the DV itself. I first introduced an override for equals in DeletionVectorStruct but we decided to avoid such overrides.
As a workaround @rdblue suggested that we can "mimic" equality checks here by requiring 2 things:
- if DV cardinality doesn't increase than we probably add the same DV again
- if cardinality increased, that's great, but expectation is that we don't write the new DV in-place.
I haven't thought about DV compaction, seems something we might want to discuss in the future. If we decide to support it, we can still remove/restructure the conditions here. WDYT?
There was a problem hiding this comment.
+1 to keep the stricter validation for now and relax in the future if needed.
| deletionVector == null | ||
| || !deletionVector.location().equals(newDeletionVector.location()) | ||
| || deletionVector.offset() != newDeletionVector.offset(), | ||
| "Invalid DV update: same location and offset has changed cardinality"); |
There was a problem hiding this comment.
This is moot if we allow equal cardinality, but the error message part about cardinality is dependent on prior precondition check. It can be a bit confusing.
There was a problem hiding this comment.
I agree. Simplified the message here, and don't mention cardinality
| if (spec.isUnpartitioned()) { | ||
| Preconditions.checkArgument( | ||
| newPartition == null, "Invalid partition: must be null for unpartitioned spec"); | ||
| this.specId = spec.specId(); |
There was a problem hiding this comment.
Minor: this is in both branches of the if. Hoist it out?
There was a problem hiding this comment.
thx, was going to, but then I forgot :)
| "The same deletion vector already added"); | ||
| deletionVector == null || deletionVector.cardinality() < newDeletionVector.cardinality(), | ||
| "Invalid DV update, cardinality must increase: existing=%s, new=%s", | ||
| deletionVector == null ? null : deletionVector.cardinality(), |
There was a problem hiding this comment.
| deletionVector == null ? null : deletionVector.cardinality(), | |
| deletionVector.cardinality(), |
There was a problem hiding this comment.
ternary is not needed because your precondition check already validates null
There was a problem hiding this comment.
That's what I thought too, but apparently these Preconditions parameters aren't lazily evaluated. Without the ternary every test that wanted to set a DV gave NPE on this.
| class TrackedFileBuilder { | ||
| private final long snapshotId; | ||
| private final FileContent contentType; | ||
| private final TrackingBuilder trackingBuilder; |
There was a problem hiding this comment.
I can align my PR with this new shape to avoid the individual setters that you commented.
#16936
There was a problem hiding this comment.
technically, nothing changes from the user's perspective, it's just how we create Tracking internally.
0514210 to
390dc8c
Compare
gaborkaszab
left a comment
There was a problem hiding this comment.
Thanks for taking a look, @stevenzwu @anoopj !
One open question remained from this set of comments: Do we want to allow came dv cardinality for DV compaction?
| class TrackedFileBuilder { | ||
| private final long snapshotId; | ||
| private final FileContent contentType; | ||
| private final TrackingBuilder trackingBuilder; |
There was a problem hiding this comment.
technically, nothing changes from the user's perspective, it's just how we create Tracking internally.
| if (spec.isUnpartitioned()) { | ||
| Preconditions.checkArgument( | ||
| newPartition == null, "Invalid partition: must be null for unpartitioned spec"); | ||
| this.specId = spec.specId(); |
There was a problem hiding this comment.
thx, was going to, but then I forgot :)
| Preconditions.checkArgument( | ||
| this.deletionVector == null || !this.deletionVector.equals(newDeletionVector), | ||
| "The same deletion vector already added"); | ||
| deletionVector == null || deletionVector.cardinality() < newDeletionVector.cardinality(), |
There was a problem hiding this comment.
Additional context: This was the original comment.
The intention was to prevent adding the same DV through the builder, because that bumps the dv_snapshot_id while not changing the DV itself. I first introduced an override for equals in DeletionVectorStruct but we decided to avoid such overrides.
As a workaround @rdblue suggested that we can "mimic" equality checks here by requiring 2 things:
- if DV cardinality doesn't increase than we probably add the same DV again
- if cardinality increased, that's great, but expectation is that we don't write the new DV in-place.
I haven't thought about DV compaction, seems something we might want to discuss in the future. If we decide to support it, we can still remove/restructure the conditions here. WDYT?
| "The same deletion vector already added"); | ||
| deletionVector == null || deletionVector.cardinality() < newDeletionVector.cardinality(), | ||
| "Invalid DV update, cardinality must increase: existing=%s, new=%s", | ||
| deletionVector == null ? null : deletionVector.cardinality(), |
There was a problem hiding this comment.
That's what I thought too, but apparently these Preconditions parameters aren't lazily evaluated. Without the ternary every test that wanted to set a DV gave NPE on this.
| deletionVector == null | ||
| || !deletionVector.location().equals(newDeletionVector.location()) | ||
| || deletionVector.offset() != newDeletionVector.offset(), | ||
| "Invalid DV update: same location and offset has changed cardinality"); |
There was a problem hiding this comment.
I agree. Simplified the message here, and don't mention cardinality
390dc8c to
116b838
Compare
| !isLeafManifest(source.contentType()), | ||
| "Manifest entries cannot transition to REPLACED, but entry type is: %s", | ||
| source.contentType()); | ||
| !isLeafManifest(source.contentType()), "Cannot transition manifest files to REPLACED"); |
There was a problem hiding this comment.
| !isLeafManifest(source.contentType()), "Cannot transition manifest files to REPLACED"); | |
| !isLeafManifest(source.contentType()), "Cannot transition manifest entries to REPLACED status"); |
| Preconditions.checkArgument( | ||
| this.deletionVector == null || !this.deletionVector.equals(newDeletionVector), | ||
| "The same deletion vector already added"); | ||
| deletionVector == null || deletionVector.cardinality() < newDeletionVector.cardinality(), |
There was a problem hiding this comment.
+1 to keep the stricter validation for now and relax in the future if needed.
Original PR: #16769
This one is a follow-up with adjustments addressing further comments.