Skip to content

feat(update): add MergingSnapshotUpdate#682

Open
gty404 wants to merge 16 commits into
apache:mainfrom
gty404:merging-snapshot-update
Open

feat(update): add MergingSnapshotUpdate#682
gty404 wants to merge 16 commits into
apache:mainfrom
gty404:merging-snapshot-update

Conversation

@gty404
Copy link
Copy Markdown
Contributor

@gty404 gty404 commented May 26, 2026

Abstract base for merge-based snapshot operations (MergeAppend, OverwriteFiles, RowDelta, etc.), implementing the filter → write → merge pipeline consistent with Java's MergingSnapshotProducer.

Also fixes SnapshotSummaryBuilder manifest count fields and a use-after-free bug in SnapshotUpdate::DeleteFile.

@gty404 gty404 force-pushed the merging-snapshot-update branch 6 times, most recently from e950bed to 0eebdbf Compare May 26, 2026 12:39
Abstract base for merge-based snapshot operations (MergeAppend,
OverwriteFiles, RowDelta, etc.), implementing the filter → write →
merge pipeline consistent with Java's MergingSnapshotProducer.

Also fixes SnapshotSummaryBuilder manifest count fields and a
use-after-free bug in SnapshotUpdate::DeleteFile.
@gty404 gty404 force-pushed the merging-snapshot-update branch from 0eebdbf to 3085e61 Compare May 26, 2026 15:40
Comment thread src/iceberg/update/merging_snapshot_update.cc Outdated
Comment thread src/iceberg/update/merging_snapshot_update.cc Outdated
Comment thread src/iceberg/update/merging_snapshot_update.cc Outdated
Comment thread src/iceberg/update/merging_snapshot_update.cc Outdated
Comment thread src/iceberg/manifest/manifest_filter_manager.cc Outdated
Comment thread src/iceberg/manifest/manifest_merge_manager.cc Outdated
Comment thread src/iceberg/update/merging_snapshot_update.cc Outdated
Comment thread src/iceberg/update/merging_snapshot_update.cc Outdated
Comment thread src/iceberg/update/merging_snapshot_update.cc Outdated
Comment thread src/iceberg/update/merging_snapshot_update.cc Outdated
Comment thread src/iceberg/manifest/manifest_filter_manager.cc Outdated
Comment thread src/iceberg/update/merging_snapshot_update.cc
Comment thread src/iceberg/update/merging_snapshot_update.cc Outdated
Comment thread src/iceberg/update/merging_snapshot_update.cc Outdated
Comment thread src/iceberg/update/merging_snapshot_update.cc Outdated
Comment thread src/iceberg/update/merging_snapshot_update.cc
Comment thread src/iceberg/update/merging_snapshot_update.cc Outdated
Comment thread src/iceberg/manifest/manifest_filter_manager.h Outdated
Comment thread src/iceberg/test/merging_snapshot_update_test.cc Outdated
Comment thread src/iceberg/update/merging_snapshot_update.cc Outdated
Comment thread src/iceberg/update/merging_snapshot_update.cc
Comment thread src/iceberg/util/snapshot_util.cc Outdated
const auto& metadata = table.metadata();
auto it = metadata->refs.find(ref);
if (it == metadata->refs.cend() || it->second->type() == SnapshotRefType::kBranch) {
if (it == metadata->refs.cend()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep Java parity here: branches should return the current table schema, and only tags should use the referenced snapshot schema. Please keep the branch check in both SchemaFor(ref) overloads.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I’ll restore Java parity here by returning the current table schema for branch refs and only resolving the referenced snapshot schema for tags, in both SchemaFor(ref) overloads. I’ll also update the related docs/tests if needed

/// \brief Return an error if any snapshot in [starting_snapshot_id+1, parent]
/// added a data file matching the given filter expression.
static Status ValidateAddedDataFiles(const TableMetadata& metadata,
int64_t starting_snapshot_id,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java allows startingSnapshotId to be null and validates back to the first snapshot. Please make this optional through these validation helpers so default RowDelta/Overwrite/Rewrite validation matches Java.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll make starting_snapshot_id optional across these validation helpers and update the history traversal so a missing starting snapshot validates back to the beginning of the snapshot lineage, matching Java’s default validation behavior

}

// -------------------------------------------------------------------------
// ValidateNewDeleteFile format version tests
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add Java-parity tests for the v3 gates: reject non-DV position deletes on v3, allow DVs on v3, and reject a v2-staged position delete after upgrading to v3 before commit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The production validation path already rechecks staged delete files during Apply, but the v3 gate coverage is incomplete. I’ll add Java-parity tests for rejecting non-DV position deletes on v3, allowing DVs on v3, and rejecting a v2-staged position delete after upgrading to v3 before commit

Comment thread src/iceberg/update/snapshot_update.cc Outdated

const auto& metadata = table.metadata();
auto it = metadata->refs.find(ref);
if (it == metadata->refs.cend() || it->second->type() == SnapshotRefType::kBranch) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why here and below are changed? I checked that they still exist in the Java implementation.

Comment thread src/iceberg/snapshot.h
int32_t max_changed_partitions_for_summaries_{0};
int64_t deleted_duplicate_files_{0};
bool trust_partition_metrics_{true};
bool manifests_counts_set_{false};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java does not populate these fields in the summary builder and instead directly call Set(const std::string& property, const std::string& value) to set these values. Should we follow a similar pattern as these fields are not commonly used?

Comment thread src/iceberg/snapshot.cc
}

deleted_duplicate_files_ += other.deleted_duplicate_files_;
// Manifest counts (manifests_counts_set_ / manifests_created_ / manifests_kept_ /
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove these fields, then these comments should be removed too.

Status Finalize(Result<const TableMetadata*> commit_result) override;

protected:
struct DeleteManifestEntry {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add a dedicated DataFileWrapper structure to equip it with optional fields like data_sequence_number and file_sequence_number? It can be relocated to manifest_entry.h as DeleteManifestEntry looks confusing.

bool fail_any_delete_{false};
bool case_sensitive_{true};
int32_t duplicate_deletes_count_{0};

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change


std::shared_ptr<Expression> delete_expr_;
std::unordered_set<std::string> delete_paths_;
std::unordered_set<DeleteFileKey, DeleteFileKeyHash> delete_file_keys_;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using DataFileSet? Do content_offset and content_size_in_bytes really matter in this case? I don't see Java ManifestFilterManager has this special handling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants