feat(update): add MergingSnapshotUpdate#682
Conversation
e950bed to
0eebdbf
Compare
Abstract base for merge-based snapshot operations (MergeAppend, OverwriteFiles, RowDelta, etc.), implementing the filter → write → merge pipeline consistent with Java's MergingSnapshotProducer. Also fixes SnapshotSummaryBuilder manifest count fields and a use-after-free bug in SnapshotUpdate::DeleteFile.
0eebdbf to
3085e61
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| const auto& metadata = table.metadata(); | ||
| auto it = metadata->refs.find(ref); | ||
| if (it == metadata->refs.cend() || it->second->type() == SnapshotRefType::kBranch) { | ||
| if (it == metadata->refs.cend()) { |
There was a problem hiding this comment.
Keep Java parity here: branches should return the current table schema, and only tags should use the referenced snapshot schema. Please keep the branch check in both SchemaFor(ref) overloads.
There was a problem hiding this comment.
Agreed. I’ll restore Java parity here by returning the current table schema for branch refs and only resolving the referenced snapshot schema for tags, in both SchemaFor(ref) overloads. I’ll also update the related docs/tests if needed
| /// \brief Return an error if any snapshot in [starting_snapshot_id+1, parent] | ||
| /// added a data file matching the given filter expression. | ||
| static Status ValidateAddedDataFiles(const TableMetadata& metadata, | ||
| int64_t starting_snapshot_id, |
There was a problem hiding this comment.
Java allows startingSnapshotId to be null and validates back to the first snapshot. Please make this optional through these validation helpers so default RowDelta/Overwrite/Rewrite validation matches Java.
There was a problem hiding this comment.
I’ll make starting_snapshot_id optional across these validation helpers and update the history traversal so a missing starting snapshot validates back to the beginning of the snapshot lineage, matching Java’s default validation behavior
| } | ||
|
|
||
| // ------------------------------------------------------------------------- | ||
| // ValidateNewDeleteFile format version tests |
There was a problem hiding this comment.
Please add Java-parity tests for the v3 gates: reject non-DV position deletes on v3, allow DVs on v3, and reject a v2-staged position delete after upgrading to v3 before commit.
There was a problem hiding this comment.
The production validation path already rechecks staged delete files during Apply, but the v3 gate coverage is incomplete. I’ll add Java-parity tests for rejecting non-DV position deletes on v3, allowing DVs on v3, and rejecting a v2-staged position delete after upgrading to v3 before commit
|
|
||
| const auto& metadata = table.metadata(); | ||
| auto it = metadata->refs.find(ref); | ||
| if (it == metadata->refs.cend() || it->second->type() == SnapshotRefType::kBranch) { |
There was a problem hiding this comment.
Why here and below are changed? I checked that they still exist in the Java implementation.
| int32_t max_changed_partitions_for_summaries_{0}; | ||
| int64_t deleted_duplicate_files_{0}; | ||
| bool trust_partition_metrics_{true}; | ||
| bool manifests_counts_set_{false}; |
There was a problem hiding this comment.
Java does not populate these fields in the summary builder and instead directly call Set(const std::string& property, const std::string& value) to set these values. Should we follow a similar pattern as these fields are not commonly used?
| } | ||
|
|
||
| deleted_duplicate_files_ += other.deleted_duplicate_files_; | ||
| // Manifest counts (manifests_counts_set_ / manifests_created_ / manifests_kept_ / |
There was a problem hiding this comment.
If we remove these fields, then these comments should be removed too.
| Status Finalize(Result<const TableMetadata*> commit_result) override; | ||
|
|
||
| protected: | ||
| struct DeleteManifestEntry { |
There was a problem hiding this comment.
Do we want to add a dedicated DataFileWrapper structure to equip it with optional fields like data_sequence_number and file_sequence_number? It can be relocated to manifest_entry.h as DeleteManifestEntry looks confusing.
| bool fail_any_delete_{false}; | ||
| bool case_sensitive_{true}; | ||
| int32_t duplicate_deletes_count_{0}; | ||
|
|
|
|
||
| std::shared_ptr<Expression> delete_expr_; | ||
| std::unordered_set<std::string> delete_paths_; | ||
| std::unordered_set<DeleteFileKey, DeleteFileKeyHash> delete_file_keys_; |
There was a problem hiding this comment.
Why not using DataFileSet? Do content_offset and content_size_in_bytes really matter in this case? I don't see Java ManifestFilterManager has this special handling.
Abstract base for merge-based snapshot operations (MergeAppend, OverwriteFiles, RowDelta, etc.), implementing the filter → write → merge pipeline consistent with Java's MergingSnapshotProducer.
Also fixes SnapshotSummaryBuilder manifest count fields and a use-after-free bug in SnapshotUpdate::DeleteFile.