Skip to content

fix(amber): surface writer-thread failure as FatalError instead of silent hang#4683

Merged
Yicong-Huang merged 3 commits intoapache:mainfrom
Yicong-Huang:fix/iceberg-write-fatal-fallback
May 6, 2026
Merged

fix(amber): surface writer-thread failure as FatalError instead of silent hang#4683
Yicong-Huang merged 3 commits intoapache:mainfrom
Yicong-Huang:fix/iceberg-write-fatal-fallback

Conversation

@Yicong-Huang
Copy link
Copy Markdown
Contributor

@Yicong-Huang Yicong-Huang commented May 2, 2026

What changes were proposed in this PR?

When OutputPortResultWriterThread.run() throws (e.g. iceberg commit-retry budget exhausted), the writer thread dies silently and the worker still reports portCompleted to the controller. The user sees a 1-minute completion timeout with no signal pointing at iceberg.

Capture the failure on the writer thread, re-throw it from OutputManager.closeOutputStorageWriterIfNeeded, and let the existing DP-thread → worker-actor → controller-supervisor path surface it as a FatalError to the client.

Any related issues, documentation, discussions?

Closes #4682.

How was this PR tested?

OutputPortResultWriterThreadSpec (6 tests) covers clean run, putOne failure (close() still runs), close() failure, both-fail (close() suppressed on putOne), and OutputManager.closeOutputStorageWriterIfNeeded re-throw + no-op cases.

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7, 1M context)

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 2, 2026

Codecov Report

❌ Patch coverage is 71.42857% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.51%. Comparing base (8b5dbf8) to head (b215402).

Files with missing lines Patch % Lines
...worker/managers/OutputPortResultWriterThread.scala 69.23% 0 Missing and 4 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #4683      +/-   ##
============================================
+ Coverage     42.49%   42.51%   +0.01%     
- Complexity     2180     2185       +5     
============================================
  Files          1005     1005              
  Lines         37429    37436       +7     
  Branches       3914     3918       +4     
============================================
+ Hits          15907    15915       +8     
+ Misses        20558    20557       -1     
  Partials        964      964              
Flag Coverage Δ
amber 43.16% <71.42%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Yicong-Huang
Copy link
Copy Markdown
Contributor Author

Another natural occurrence to fold into the test plan: PR #4648's release backport leg hit this exact path today.

Tracked separately in #4682.

@Yicong-Huang Yicong-Huang force-pushed the fix/iceberg-write-fatal-fallback branch from 97eb43d to 9262054 Compare May 3, 2026 05:03
Copy link
Copy Markdown
Contributor

@kunwp1 kunwp1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I have no changes to suggest.

@aglinxinyuan
Copy link
Copy Markdown
Contributor

Please wait for @Xiao-zhen-Liu 's review.

@aglinxinyuan aglinxinyuan removed their request for review May 4, 2026 07:43
Copy link
Copy Markdown
Contributor

@Xiao-zhen-Liu Xiao-zhen-Liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix, and I hope this can also solve the flaky and unreproducible #3950 test case issue. Left two minor suggestions before merge.

Yicong-Huang and others added 3 commits May 5, 2026 23:37
…lent hang

OutputPortResultWriterThread previously let exceptions in close()
escape Thread.run(), so the iceberg commit failure was invisible to
the worker, the controller still saw normal portCompleted, downstream
operators read incomplete data, and the test/user observed only a
1-minute Await timeout. Capture the failure, re-throw on
closeOutputStorageWriterIfNeeded, let DPThread's existing
MainThreadDelegateMessage path route it to the worker actor, and
let the Controller's AllForOneStrategy supervisor emit FatalError
to the client immediately.

Closes apache#4682.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add OutputPortResultWriterThreadSpec with 4 tests:

- OutputPortResultWriterThread leaves getFailure empty on a clean run.
- OutputPortResultWriterThread captures a close() exception in
  getFailure so the worker can re-throw it.
- OutputManager.closeOutputStorageWriterIfNeeded re-throws the writer
  thread's captured failure (this is the bridge from the writer
  thread to the DP thread → worker actor → controller supervisor →
  FatalError to client).
- OutputManager.closeOutputStorageWriterIfNeeded is a no-op when the
  port has no writer thread.

Together these pin every link of the fatal-error chain that this PR
introduces. The OutputManager test reaches into the private
outputPortResultWriterThreads map by reflection rather than going
through addPort, which would require a real iceberg URI; the test
file otherwise only depends on a 4-method stub BufferedItemWriter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address review: a putOne failure mid-loop bypassed close() and leaked
the underlying writer's file handles. Move close() into a finally
clause; if both legs fail, attach close()'s exception as suppressed on
the original.

Tests: rework StubWriter to take onPutOne/onClose thunks, add a
putOne-failure test (asserts close() still ran) and a both-fail test
(asserts the original is captured with the close() failure as
suppressed).
@Yicong-Huang Yicong-Huang force-pushed the fix/iceberg-write-fatal-fallback branch from c709d9e to b215402 Compare May 6, 2026 06:42
@Yicong-Huang Yicong-Huang enabled auto-merge (squash) May 6, 2026 06:46
@Yicong-Huang Yicong-Huang merged commit 34b004d into apache:main May 6, 2026
18 checks passed
@Yicong-Huang Yicong-Huang deleted the fix/iceberg-write-fatal-fallback branch May 6, 2026 06:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Iceberg writer-thread failure dies silently, masking root cause as 1-minute test timeout

5 participants