Skip to content

[GOBBLIN-2267] Block GobblinYarnAppLauncher launch() until the YARN application is terminal#4203

Open
pratapaditya04 wants to merge 1 commit into
apache:masterfrom
pratapaditya04:pratapaditya04/depend-106410-launcher-block-until-yarn-completion
Open

[GOBBLIN-2267] Block GobblinYarnAppLauncher launch() until the YARN application is terminal#4203
pratapaditya04 wants to merge 1 commit into
apache:masterfrom
pratapaditya04:pratapaditya04/depend-106410-launcher-block-until-yarn-completion

Conversation

@pratapaditya04

@pratapaditya04 pratapaditya04 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):

Summary

GobblinYarnAppLauncher.launch() returned as soon as the YARN application was submitted, while the status monitor kept polling on a background non-daemon thread. For the Azkaban submitter path this meant run() returned at submission time and the launcher-owned non-daemon threads kept the JVM alive for the whole job, so the process hung after run() finished. This change makes launch() block (in attached mode) until the application reaches a terminal state, then tear down — so run() represents the full application lifecycle and the JVM exits cleanly once it returns.

Problem

  • launch() scheduled the applicationStatusMonitor and returned immediately after submission; AzkabanGobblinYarnAppLauncher.run() therefore returned at submission time, and the launcher's non-daemon threads (the status monitor, the ServiceManager, the YARN client) kept the JVM alive for the entire job.
  • The status monitor also invoked stop() from its own worker thread on a terminal report; stop() then called shutdownExecutorService(applicationStatusMonitor) — a thread cannot await termination of its own executor, so teardown stalled for the full awaitTermination window.
  • main() called launch() then System.exit(getExitCode()); with a non-blocking launch() that exit raced submission.

Changes

  • Add a CountDownLatch applicationTerminalLatch that the status monitor releases once the application reaches a terminal state, or the launcher is otherwise stopped (e.g. lost visibility of the AM).
  • In attached mode, launch()/run() blocks on the latch and then runs stop() on the calling thread. Because teardown runs off the monitor thread, it shuts the (non-daemon) monitor and the launcher services down before launch() returns — no launcher-owned thread is left to keep the JVM alive, and the self-shutdown stall is gone. This also makes main()'s System.exit(getExitCode()) correct.
  • The two @Subscribe handlers (handleApplicationReportArrivalEvent, handleGetApplicationReportFailureEvent) now signal the latch instead of calling stop() from the monitor thread. Detached launches (which have no caller blocked on the latch) still stop directly in the handler, unchanged.
  • The monitor thread is left non-daemon as before; nothing is force-killed — the threads are explicitly stopped by stop() on the calling thread.

Compatibility

The only callers of launch() are main() and AzkabanGobblinYarnAppLauncher.run(), both of which want blocking behaviour; there are no OSS subclasses of GobblinYarnAppLauncher. Detached-mode behaviour is preserved.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Unit — adds three tests to GobblinYarnAppLauncherTerminalGteTest: a terminal report releases the latch without calling stop() on the monitor thread (attached); the lost-AM-visibility path releases the latch even though it never sets applicationCompleted (regression guard against an indefinite block); and a detached launch still stops directly in the handler. Existing tests continue to pass; :gobblin-yarn builds clean.

E2E — validated via a snapshot gobblin build wired into gobblin-temporal-workers and carbon copy flows on prod-ltx1 across the success, cancel, and failure paths; GGW-pod launcher logs confirm launch() blocks on the latch and tears down on the calling thread. (Results appended in a comment.)

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@pratapaditya04 pratapaditya04 changed the title GOBBLIN-2265: Make GobblinYarnAppLauncher.launch() block in-thread until the YARN application is terminal GOBBLIN-2267: Block GobblinYarnAppLauncher launch() in-thread until the YARN application is terminal Jun 24, 2026
@pratapaditya04 pratapaditya04 marked this pull request as ready for review June 24, 2026 09:00
@pratapaditya04 pratapaditya04 force-pushed the pratapaditya04/depend-106410-launcher-block-until-yarn-completion branch from 2ef4049 to c200885 Compare June 24, 2026 09:03
@pratapaditya04 pratapaditya04 changed the title GOBBLIN-2267: Block GobblinYarnAppLauncher launch() in-thread until the YARN application is terminal [GOBBLIN-2267] Block GobblinYarnAppLauncher launch() in-thread until the YARN application is terminal Jun 24, 2026
…pplication is terminal

GobblinYarnAppLauncher.launch() returned as soon as the YARN application
was submitted while the status monitor kept polling on a background
non-daemon thread. For the Azkaban submitter path run() therefore
returned at submission time, and the launcher-owned non-daemon threads
(the status monitor, the ServiceManager, the YARN client) kept the JVM
alive for the whole job, so the process hung after run() finished. The
status monitor also called stop() from its own worker thread on a
terminal report, which then awaited termination of its own executor and
stalled for the full timeout.

Add a CountDownLatch that the status monitor releases once the
application reaches a terminal state (or the launcher is otherwise
stopped, e.g. lost AM visibility). In attached mode launch()/run() blocks
on the latch and then runs stop() on the calling thread, so teardown
shuts the (non-daemon) monitor and the launcher services down before
launch() returns: no launcher-owned thread is left to keep the JVM alive,
and stop() no longer runs on the monitor thread, removing the
self-shutdown stall. Detached launches return right after submission and
are torn down by the monitor as before.

Add unit tests covering that a terminal or lost-AM report releases the
latch without calling stop() on the monitor thread (attached), and that a
detached launch still stops on the monitor thread.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@pratapaditya04 pratapaditya04 force-pushed the pratapaditya04/depend-106410-launcher-block-until-yarn-completion branch from c200885 to 803364e Compare June 24, 2026 09:25
@pratapaditya04 pratapaditya04 changed the title [GOBBLIN-2267] Block GobblinYarnAppLauncher launch() in-thread until the YARN application is terminal [GOBBLIN-2267] Block GobblinYarnAppLauncher launch() until the YARN application is terminal Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant