Skip to content

Core: Fix concurrent mutation of deleteFiles during parallel manifest filtering#16984

Open
liucao-dd wants to merge 1 commit into
apache:mainfrom
liucao-dd:fix-concurrent-add-race-condition-ManifestFilterManager
Open

Core: Fix concurrent mutation of deleteFiles during parallel manifest filtering#16984
liucao-dd wants to merge 1 commit into
apache:mainfrom
liucao-dd:fix-concurrent-add-race-condition-ManifestFilterManager

Conversation

@liucao-dd

@liucao-dd liucao-dd commented Jun 28, 2026

Copy link
Copy Markdown

closes #16978

Summary

  • Fix a race in ManifestFilterManager where parallel manifest filtering concurrently mutates the shared deleteFiles set, corrupting its backing store.
  • Make duplicateDeleteCount thread-safe with AtomicInteger.
  • Add a high-contention regression test for concurrent adds to deleteFiles during a full overwrite across many manifests.

Problem

ManifestFilterManager.filterManifests() filters manifests in parallel via Tasks.range(...).executeWith(workerPoolSupplier.get()). During expression-based deletes, each worker thread calls deleteFiles.add(fileCopy) on a shared instance-level set backed by WrapperSet / LinkedHashSet, which is not thread-safe.

Under heavy concurrent adds (large overwrites spanning many manifests), the backing LinkedHashMap can corrupt. Observed symptoms include ClassCastException: LinkedHashMap$Entry cannot be cast to HashMap$TreeNode, StackOverflowError, and hangs.

duplicateDeleteCount is also incremented from worker threads with a non-atomic ++, which can lose increments.

Fix

Wrap deleteFiles with Collections.synchronizedSet(newFileSet()) so concurrent add/contains calls from worker threads are safe, and convert duplicateDeleteCount to AtomicInteger.

Collections.synchronizedSet synchronizes individual method calls. Iteration is not auto-synchronized, so iterating deleteFiles must remain after the parallel filtering phase (current behavior) or be guarded by synchronized (deleteFiles). This invariant is documented at the field.

Why Collections.synchronizedSet

The fix is intentionally minimal and confined to ManifestFilterManager, while preserving the existing semantics of deleteFiles exactly:

  • Location-based matching. newFileSet() returns a DataFileSet / DeleteFileSet, which match files by file.location() via WrapperSet. A raw concurrent set keyed on DataFile / DeleteFile would use object identity because the file classes do not override equals/hashCode. That would break the explicit deleteFile(...) path: the caller-supplied file instance added to deleteFiles would not match the separate file instance read back from a manifest during filtering.
  • Insertion order. WrapperSet documents that it maintains insertion order. It is a shared api/ type used outside this path, so re-backing it with a non-ordered concurrent map to fix one caller would silently change that contract for every consumer.

Test plan

  • ./gradlew :iceberg-core:test --tests org.apache.iceberg.TestOverwrite
  • ./gradlew :iceberg-core:test --tests org.apache.iceberg.TestDeleteFiles
  • ./gradlew :iceberg-core:test --tests org.apache.iceberg.TestOverwriteWithValidation
  • ./gradlew :iceberg-core:spotlessApply :iceberg-core:spotlessJavaCheck

testFullOverwriteAcrossManyManifests disables manifest merging and appends 16 manifests of 64 files each, using file locations engineered to share one String.hashCode() so that DataFileSet wrapper hashes all collide into a single bucket. A full overwrite (alwaysTrue()) then forces many worker threads to add to deleteFiles concurrently, concentrating contention in that bucket. Without the fix this reproduces the corruption (ClassCastException / StackOverflowError / hang); with the fix it passes consistently.

AI Disclosure
LLM is used to research codebase standards and draft the regression test that actually fails without the fix, and drafting the PR description.


Seems like since I opened the issue yesterday there are already 2 open PRs attempted to close the issue, hence adding some thoughts:

A deferred-merge alternative, where worker threads collect deleted files into a concurrent structure and the caller merges them into deleteFiles after the parallel phase, is also behavior-preserving: expression-deleted files added to deleteFiles during filtering are not needed by other workers to decide expression matches, and the actual post-phase consumers (filesToBeDeleted() and required-delete validation) run after Tasks.range(...) joins. It is a reasonable design with a lock-free hot path that is has higher throughput potential, but more invasive. This is available in #16980

My PR prefers the synchronized wrapper for its minimal change and exact preservation of WrapperSet semantics; a deferred merge is a sensible alternative if we have proof on the lock contention.

In object-store-backed deployments, each manifest is read from S3/GCS, so millisecond-scale I/O per manifest dominates both the microseconds of CPU and the nanoseconds of lock. So i would expect the lock to be genuinely negligible here. If the env is local FS, warm cache, manifests of many tiny entries, high core count, then there is merit to the deferred-merge approach.

Deferring the decision to the community and maintainers.

@github-actions github-actions Bot added the core label Jun 28, 2026
@liucao-dd liucao-dd force-pushed the fix-concurrent-add-race-condition-ManifestFilterManager branch from c17686d to 9d16b52 Compare June 28, 2026 00:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Core: ManifestFilterManager thread safety issue during parallel overwrite commit

2 participants