fix(trainer): remove LocalJob nested os.chdir and verify race condition#373
fix(trainer): remove LocalJob nested os.chdir and verify race condition#373priyank766 wants to merge 3 commits intokubeflow:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow SDK! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
This PR fixes a race condition (issue #356) in LocalJob where concurrent threads using os.chdir() could interfere with each other's working directories. The fix replaces the process-wide os.chdir() call with the thread-safe subprocess.Popen(cwd=...) parameter, and adds a concurrency regression test.
Changes:
- Removed
os.chdir()/os.getcwd()working directory manipulation fromLocalJob.run()and replaced it withcwd=self.execution_dirin thesubprocess.Popencall - Added a concurrent regression test that verifies multiple
LocalJobthreads don't mutate the parent process's cwd
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
kubeflow/trainer/backends/localprocess/job.py |
Replaced os.chdir() with subprocess.Popen(cwd=...) and removed the finally block that restored the cwd |
kubeflow/trainer/backends/localprocess/backend_test.py |
Added test_concurrent_localjobs_do_not_change_cwd regression test using barrier-synchronized threads |
You can also share your feedback on Copilot code review. Take the survey.
Signed-off-by: priyank <priyank8445@gmail.com>
…nt_localjobs_do_not_change_cwd function Signed-off-by: priyank <priyank8445@gmail.com>
26f73c6 to
5c65288
Compare
What this PR does / why we need it:
When running [LocalJob] instances concurrently, using
os.chdir()within a shared process (even if enclosed in atry...finallyblock within a thread) leads to race conditions where jobs mix up working directories. This can cause paths to resolve incorrectly across the threads.This PR removes the non-thread-safe
os.chdir()context switching insideLocalJob.run(). Instead, it explicitly passes the execution path down to the subprocess by leveraging the argument insubprocess.Popen(cwd=self.execution_dir, ...).Additionally, PR includes a robust integration test using the required [TestCase] fixture to ensure concurrent execution properly scopes execution paths without mutating the main process' working directory.
Which issue(s) this PR fixes:
Fixes #356
Checklist: