Core: Basic fields and schemas for column files#16285
Conversation
|
First piece of the column update work: introducing the basic interface of the column updates files, aka column files |
630b00e to
ca3259e
Compare
ca3259e to
e6f7cf6
Compare
681633b to
813d5c0
Compare
|
I opened a thread on dev@ to discuss the metadata structs for column files. Once that's finalized, I'll incorporate the changes here. |
596f6a4 to
6a1cbe9
Compare
6a1cbe9 to
c683e72
Compare
| Tracking.FIRST_ROW_ID, | ||
| Tracking.DELETED_POSITIONS, | ||
| Tracking.REPLACED_POSITIONS, | ||
| Tracking.LATEST_COLUMN_FILE_SNAPSHOT_ID, |
There was a problem hiding this comment.
I believe the expectation is to have the physically persisted schema fields first and the manifest position after. Hence I placed the new field before ROW_POSITION Note, this changes the ordinal of the existing manifest pos field, however, since this is under development and there are no already written data files out there, this seems fine.
The implementation of the new field in getByPos and internalSet are meant to be in the follow-up implementation PR.
c683e72 to
6222fad
Compare
|
Adjusted field IDs because |
6222fad to
5c04f55
Compare
5c04f55 to
b6ae446
Compare
| this.status = EntryStatus.MODIFIED; | ||
| } | ||
| // Bumping 'dataSequenceNumber' to avoid having both equality deletes and column files. | ||
| this.dataSequenceNumber = null; |
There was a problem hiding this comment.
We discussed bumping the data sequence number when adding column files. We haven't mentioned file seq num, so I'm not bumping it here.
This works if the manifest owning this data file entry bumps its own seq num when adding column files. Let me know if there is any other way achieving this.
0a252e8 to
144cd39
Compare
| deletedPositions == null && replacedPositions == null, | ||
| "Cannot mark column files updated on a manifest entry (deleted/replaced positions are set)"); | ||
| this.latestColumnFileSnapshotId = newSnapshotId; | ||
| if (status == EntryStatus.EXISTING) { |
There was a problem hiding this comment.
Should we add preconditions here to check status should never be DELETED or REPLACED?
There was a problem hiding this comment.
This method is inline with the similar dvUpdated. Since DELETED and REPLACED aren't constructed through the builder, guarding against those statuses would be just noise here IMO
| : null; | ||
| this.columnFiles = | ||
| toCopy.columnFiles != null | ||
| ? toCopy.columnFiles.stream().map(ColumnFile::copy).collect(Collectors.toList()) |
There was a problem hiding this comment.
This can NPE in .map(ColumnFile::copy), do we ensure that it's non null?
There was a problem hiding this comment.
You mean columnFiles might contain a null value? Currently, you can pass whatever content for columnFiles through the constructor, you now technically there can be null too. In the long run the expectation is to build this class through its builder (PR) where we can prevent adding null to the list.
I'm not concerned about this, WDYT?
There was a problem hiding this comment.
I implemented a null-safe version of the copy and added test coverage, just to be on the safe side. Let me know what you think.
| @Override | ||
| public String toString() { | ||
| return MoreObjects.toStringHelper(this) | ||
| .add("field_ids", fieldIds) |
There was a problem hiding this comment.
Nit: Does this print the list object instead of values?
There was a problem hiding this comment.
This does print the values in the list. Just checked the output:
ColumnFileStruct{field_ids=[1, 2, 3], location=s3://bucket/data/column.parquet, file_size_in_bytes=1024}
I was hesitating to add test coverage for this, but I haven't seen anywhere else testing the output of a toSting function.
There was a problem hiding this comment.
ToStringHelper() should handle arrays just fine. No need to add a test for this.
35c5fb4 to
8df135e
Compare
8df135e to
e1a430d
Compare
| this.equalityIds = ArrayUtil.toIntArray((List<Integer>) value); | ||
| break; | ||
| case 16: | ||
| this.columnFiles = copyColumnFiles((List<ColumnFile>) value); |
There was a problem hiding this comment.
value can be null so we can NPE here. Above here the ArrayUtil method is guarding against the null input.
We can either cover the null here or in copyCOlumnFiles
There was a problem hiding this comment.
On a broader point, do we really need to copy here? The List isn't a re-usuable container so why do we need to deep copy it? For example why isn't this just like DeletionVector or ManifestInfo?
There was a problem hiding this comment.
You're right! I was overly cautious here, no copy is needed.
Removed the copy, just cast value to List<ColumnFile> for assignment.
e1a430d to
c3205ae
Compare
gaborkaszab
left a comment
There was a problem hiding this comment.
Thanks for taking a look, @RussellSpitzer !
I removed the unwanted copy. Also did a rebase because there was a conflict with the new switch style PR.
| this.equalityIds = ArrayUtil.toIntArray((List<Integer>) value); | ||
| break; | ||
| case 16: | ||
| this.columnFiles = copyColumnFiles((List<ColumnFile>) value); |
There was a problem hiding this comment.
You're right! I was overly cautious here, no copy is needed.
Removed the copy, just cast value to List<ColumnFile> for assignment.
This change introduces the bases structs for column files and also integrates them to the schema for TrackedFile and Tracking.
c3205ae to
e235fc0
Compare
| Preconditions.checkArgument(!newColumnFiles.isEmpty(), "Invalid column files: empty"); | ||
| Preconditions.checkArgument( | ||
| contentType == FileContent.DATA || contentType == FileContent.DATA_MANIFEST, | ||
| "Column files can only be set for DATA or DATA_MANIFEST entries, but entry type is %s", |
There was a problem hiding this comment.
Why do we allow column files for DATA_MANIFEST entries? Is this for metadata updates? (override DV column?)
| import org.apache.iceberg.types.Types; | ||
|
|
||
| /** Information about a column file. */ | ||
| interface ColumnFile { |
There was a problem hiding this comment.
The Efficient Column Updates proposal had sequence_number at the column file level. Is that stale? ie are we dropping per-file granularity?
| assertThat(copy.fileSizeInBytes()).isEqualTo(2048L); | ||
|
|
||
| // verify deep copy | ||
| assertThat(copy.fieldIds()).isNotSameAs(columnFile.fieldIds()); |
There was a problem hiding this comment.
This will always pass because the fieldIds() wrap a Collections.umodifiableList() which will always be different. I think the only way to test deep copy is to actually mutate the field IDs in the source and verify that the values don't change in the copy.
| @Override | ||
| public String toString() { | ||
| return MoreObjects.toStringHelper(this) | ||
| .add("field_ids", fieldIds) |
There was a problem hiding this comment.
ToStringHelper() should handle arrays just fine. No need to add a test for this.
This change introduces the interface for column files and also integrates it to the schema for TrackedFile.