[GOBBLIN-2267] Block GobblinYarnAppLauncher launch() until the YARN application is terminal#4203
Open
pratapaditya04 wants to merge 1 commit into
Conversation
2ef4049 to
c200885
Compare
…pplication is terminal GobblinYarnAppLauncher.launch() returned as soon as the YARN application was submitted while the status monitor kept polling on a background non-daemon thread. For the Azkaban submitter path run() therefore returned at submission time, and the launcher-owned non-daemon threads (the status monitor, the ServiceManager, the YARN client) kept the JVM alive for the whole job, so the process hung after run() finished. The status monitor also called stop() from its own worker thread on a terminal report, which then awaited termination of its own executor and stalled for the full timeout. Add a CountDownLatch that the status monitor releases once the application reaches a terminal state (or the launcher is otherwise stopped, e.g. lost AM visibility). In attached mode launch()/run() blocks on the latch and then runs stop() on the calling thread, so teardown shuts the (non-daemon) monitor and the launcher services down before launch() returns: no launcher-owned thread is left to keep the JVM alive, and stop() no longer runs on the monitor thread, removing the self-shutdown stall. Detached launches return right after submission and are torn down by the monitor as before. Add unit tests covering that a terminal or lost-AM report releases the latch without calling stop() on the monitor thread (attached), and that a detached launch still stops on the monitor thread. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
c200885 to
803364e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Dear Gobblin maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
JIRA
Description
Summary
GobblinYarnAppLauncher.launch()returned as soon as the YARN application was submitted, while the status monitor kept polling on a background non-daemon thread. For the Azkaban submitter path this meantrun()returned at submission time and the launcher-owned non-daemon threads kept the JVM alive for the whole job, so the process hung afterrun()finished. This change makeslaunch()block (in attached mode) until the application reaches a terminal state, then tear down — sorun()represents the full application lifecycle and the JVM exits cleanly once it returns.Problem
launch()scheduled theapplicationStatusMonitorand returned immediately after submission;AzkabanGobblinYarnAppLauncher.run()therefore returned at submission time, and the launcher's non-daemon threads (the status monitor, theServiceManager, the YARN client) kept the JVM alive for the entire job.stop()from its own worker thread on a terminal report;stop()then calledshutdownExecutorService(applicationStatusMonitor)— a thread cannot await termination of its own executor, so teardown stalled for the fullawaitTerminationwindow.main()calledlaunch()thenSystem.exit(getExitCode()); with a non-blockinglaunch()that exit raced submission.Changes
CountDownLatch applicationTerminalLatchthat the status monitor releases once the application reaches a terminal state, or the launcher is otherwise stopped (e.g. lost visibility of the AM).launch()/run()blocks on the latch and then runsstop()on the calling thread. Because teardown runs off the monitor thread, it shuts the (non-daemon) monitor and the launcher services down beforelaunch()returns — no launcher-owned thread is left to keep the JVM alive, and the self-shutdown stall is gone. This also makesmain()'sSystem.exit(getExitCode())correct.@Subscribehandlers (handleApplicationReportArrivalEvent,handleGetApplicationReportFailureEvent) now signal the latch instead of callingstop()from the monitor thread. Detached launches (which have no caller blocked on the latch) still stop directly in the handler, unchanged.stop()on the calling thread.Compatibility
The only callers of
launch()aremain()andAzkabanGobblinYarnAppLauncher.run(), both of which want blocking behaviour; there are no OSS subclasses ofGobblinYarnAppLauncher. Detached-mode behaviour is preserved.Tests
Unit — adds three tests to
GobblinYarnAppLauncherTerminalGteTest: a terminal report releases the latch without callingstop()on the monitor thread (attached); the lost-AM-visibility path releases the latch even though it never setsapplicationCompleted(regression guard against an indefinite block); and a detached launch still stops directly in the handler. Existing tests continue to pass;:gobblin-yarnbuilds clean.E2E — validated via a snapshot gobblin build wired into
gobblin-temporal-workersandcarboncopy flows on prod-ltx1 across the success, cancel, and failure paths; GGW-pod launcher logs confirmlaunch()blocks on the latch and tears down on the calling thread. (Results appended in a comment.)Commits