fix(amber): surface writer-thread failure as FatalError instead of silent hang#4683
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4683 +/- ##
============================================
+ Coverage 42.49% 42.51% +0.01%
- Complexity 2180 2185 +5
============================================
Files 1005 1005
Lines 37429 37436 +7
Branches 3914 3918 +4
============================================
+ Hits 15907 15915 +8
+ Misses 20558 20557 -1
Partials 964 964
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Another natural occurrence to fold into the test plan: PR #4648's release backport leg hit this exact path today.
Tracked separately in #4682. |
97eb43d to
9262054
Compare
kunwp1
left a comment
There was a problem hiding this comment.
LGTM! I have no changes to suggest.
|
Please wait for @Xiao-zhen-Liu 's review. |
Xiao-zhen-Liu
left a comment
There was a problem hiding this comment.
Nice fix, and I hope this can also solve the flaky and unreproducible #3950 test case issue. Left two minor suggestions before merge.
…lent hang OutputPortResultWriterThread previously let exceptions in close() escape Thread.run(), so the iceberg commit failure was invisible to the worker, the controller still saw normal portCompleted, downstream operators read incomplete data, and the test/user observed only a 1-minute Await timeout. Capture the failure, re-throw on closeOutputStorageWriterIfNeeded, let DPThread's existing MainThreadDelegateMessage path route it to the worker actor, and let the Controller's AllForOneStrategy supervisor emit FatalError to the client immediately. Closes apache#4682. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add OutputPortResultWriterThreadSpec with 4 tests: - OutputPortResultWriterThread leaves getFailure empty on a clean run. - OutputPortResultWriterThread captures a close() exception in getFailure so the worker can re-throw it. - OutputManager.closeOutputStorageWriterIfNeeded re-throws the writer thread's captured failure (this is the bridge from the writer thread to the DP thread → worker actor → controller supervisor → FatalError to client). - OutputManager.closeOutputStorageWriterIfNeeded is a no-op when the port has no writer thread. Together these pin every link of the fatal-error chain that this PR introduces. The OutputManager test reaches into the private outputPortResultWriterThreads map by reflection rather than going through addPort, which would require a real iceberg URI; the test file otherwise only depends on a 4-method stub BufferedItemWriter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address review: a putOne failure mid-loop bypassed close() and leaked the underlying writer's file handles. Move close() into a finally clause; if both legs fail, attach close()'s exception as suppressed on the original. Tests: rework StubWriter to take onPutOne/onClose thunks, add a putOne-failure test (asserts close() still ran) and a both-fail test (asserts the original is captured with the close() failure as suppressed).
c709d9e to
b215402
Compare
What changes were proposed in this PR?
When
OutputPortResultWriterThread.run()throws (e.g. iceberg commit-retry budget exhausted), the writer thread dies silently and the worker still reportsportCompletedto the controller. The user sees a 1-minute completion timeout with no signal pointing at iceberg.Capture the failure on the writer thread, re-throw it from
OutputManager.closeOutputStorageWriterIfNeeded, and let the existing DP-thread → worker-actor → controller-supervisor path surface it as aFatalErrorto the client.Any related issues, documentation, discussions?
Closes #4682.
How was this PR tested?
OutputPortResultWriterThreadSpec(6 tests) covers clean run, putOne failure (close() still runs), close() failure, both-fail (close() suppressed on putOne), andOutputManager.closeOutputStorageWriterIfNeededre-throw + no-op cases.Was this PR authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.7, 1M context)