Skip to content

fix(trainer): remove LocalJob nested os.chdir and verify race condition#373

Open
priyank766 wants to merge 3 commits intokubeflow:mainfrom
priyank766:fix/localjob-race-condition-356
Open

fix(trainer): remove LocalJob nested os.chdir and verify race condition#373
priyank766 wants to merge 3 commits intokubeflow:mainfrom
priyank766:fix/localjob-race-condition-356

Conversation

@priyank766
Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

When running [LocalJob] instances concurrently, using os.chdir() within a shared process (even if enclosed in a try...finally block within a thread) leads to race conditions where jobs mix up working directories. This can cause paths to resolve incorrectly across the threads.

This PR removes the non-thread-safe os.chdir() context switching inside LocalJob.run(). Instead, it explicitly passes the execution path down to the subprocess by leveraging the argument in subprocess.Popen(cwd=self.execution_dir, ...).

Additionally, PR includes a robust integration test using the required [TestCase] fixture to ensure concurrent execution properly scopes execution paths without mutating the main process' working directory.

Which issue(s) this PR fixes:

Fixes #356

Checklist:

  • Docs included if any changes are user facing

Copilot AI review requested due to automatic review settings March 9, 2026 05:39
@google-oss-prow google-oss-prow bot requested a review from astefanutti March 9, 2026 05:39
@google-oss-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign electronic-waste for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot requested review from kramaranya and szaher March 9, 2026 05:39
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 9, 2026

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
  • Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a race condition (issue #356) in LocalJob where concurrent threads using os.chdir() could interfere with each other's working directories. The fix replaces the process-wide os.chdir() call with the thread-safe subprocess.Popen(cwd=...) parameter, and adds a concurrency regression test.

Changes:

  • Removed os.chdir()/os.getcwd() working directory manipulation from LocalJob.run() and replaced it with cwd=self.execution_dir in the subprocess.Popen call
  • Added a concurrent regression test that verifies multiple LocalJob threads don't mutate the parent process's cwd

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
kubeflow/trainer/backends/localprocess/job.py Replaced os.chdir() with subprocess.Popen(cwd=...) and removed the finally block that restored the cwd
kubeflow/trainer/backends/localprocess/backend_test.py Added test_concurrent_localjobs_do_not_change_cwd regression test using barrier-synchronized threads

You can also share your feedback on Copilot code review. Take the survey.

Signed-off-by: priyank <priyank8445@gmail.com>
…nt_localjobs_do_not_change_cwd function

Signed-off-by: priyank <priyank8445@gmail.com>
@priyank766 priyank766 force-pushed the fix/localjob-race-condition-356 branch from 26f73c6 to 5c65288 Compare March 9, 2026 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Potential Race condition in LocalJob Executions

2 participants