Skip to content

Core: Fix concurrent modification of deleteFiles in ManifestFilterManager#16980

Open
lilei1128 wants to merge 2 commits into
apache:mainfrom
lilei1128:fix
Open

Core: Fix concurrent modification of deleteFiles in ManifestFilterManager#16980
lilei1128 wants to merge 2 commits into
apache:mainfrom
lilei1128:fix

Conversation

@lilei1128

@lilei1128 lilei1128 commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

When multiple worker threads processed manifests in parallel via
Tasks.range(), expression-matched files were written directly to the
non-thread-safe deleteFiles set, causing data races and incorrect
deleted-file counts.

Close #16978

Fix by collecting expression-deleted files into a ConcurrentLinkedQueue
during parallel filtering, then merging into deleteFiles on the calling
thread after all workers complete. Also convert duplicateDeleteCount
from int to AtomicInteger to safely handle concurrent increments across
worker threads.

Add a regression test that appends 128 files in separate commits (one
manifest each) and deletes all via overwriteByRowFilter with a 16-thread
pool. File paths are constructed using the "Aa"/"BB" Java hash-collision
pair so all locations share the same String.hashCode(), concentrating
concurrent adds into the same DataFileSet hash bucket and making the
race deterministically reproducible on。unfixed code.

@github-actions github-actions Bot added the core label Jun 27, 2026
@lilei1128 lilei1128 changed the title Core:Fix concurrent modification of deleteFiles in ManifestFilterManager Core: Fix concurrent modification of deleteFiles in ManifestFilterManager Jun 27, 2026
}

@TestTemplate
public void testConcurrentManifestFilteringWithExpression() {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up! Since I opened #16978 yesterday and also opened #16984 with a different fix, want to acknowledge that your approach is lock-free on the hot path and has higher throughput potential.

I was thinking that in object-store-backed deployments, each manifest is read from S3/GCS, so millisecond-scale I/O per manifest dominates both the microseconds of CPU and the nanoseconds of lock. Hence I went with a simpler one-liner fix to use synchronizedSet.

The one thing I’d suggest is strengthening the regression test. With only 4 manifests / 4 files, the test may pass even without the fix because it does not create enough concurrent pressure on the underlying LinkedHashSet. In #16984 I used many files across many manifests with colliding file-location hashes to make the unfixed race much easier to reproduce. I’ll defer to maintainers on which implementation they prefer, but wanted to share this since we’re addressing the same issue.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing #16984 and the collidingLocation technique — forcing all paths into the same hash bucket is a much more reliable way to reproduce the race than simply increasing file count, and we will adapt that approach in our test.

One note on the synchronizedSet approach: as your own comment acknowledges, synchronizedSet only protects individual add/contains calls — iteration is not covered. validateRequiredDeletes uses containsAll and buildSummary iterates over deleteFiles, both of which are safe today only because they run after filtering completes. That's a correctness constraint buried in call-site ordering rather than enforced by the data structure itself.

Our approach (collect into a ConcurrentLinkedQueue during parallel filtering, then merge into deleteFiles on the calling thread after Tasks.range returns) avoids this entirely — deleteFiles is never written concurrently, so iteration safety is guaranteed structurally rather than by convention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Core: ManifestFilterManager thread safety issue during parallel overwrite commit

2 participants