Mock provision: make detection of finished processes more deterministic#4437
Mock provision: make detection of finished processes more deterministic#4437mkoncek wants to merge 5 commits intoteemtee:mainfrom
Conversation
fdead6f to
702c1c4
Compare
|
@happz Rebased on the current |
702c1c4 to
c640c33
Compare
c640c33 to
6604f03
Compare
|
The primary cause of this bug was the assumption that each finished command prints a single newline on stdout after it is finished. This turned out not to be true in some environments. So we print a binary zero after the initial commands are finished and then loop until we read that binary zero back. Later the command completion is detected by returncode being written into its pre-determined file, instead of waiting on a newline on stdout again. |
|
Also: self.mock_shell.stdin.write(''.join(command + '\n' for command in commands))
# Issue a command writing a binary zero on the standard output after all
# the previous commands are finished.
self.mock_shell.stdin.write('echo -e \\\\x00\n')could be rewritten as: self.mock_shell.stdin.write('\n'.join(commands))
# Issue a command writing a binary zero on the standard output after all
# the previous commands are finished.
self.mock_shell.stdin.write('\necho -e \\\\x00\n')And then we could use a different string to mark the end of commands. Afaik in that initialization we only execute things like |
|
@LecrisUT I went over your concerns and tried to apply fixes. |
LecrisUT
left a comment
There was a problem hiding this comment.
Looks good enough for me. Could you rebase please?
Unrelated, but can you also open a PR to revert and give more context on what needs to be done?
tmt/tests/prepare/shell/main.fmf
Lines 9 to 11 in 644c58c
Would prefer to be explicit on the AVC checks that are skipped, but when I tried to revert that, it seemed to have some more genuine errors in it.
I rebased it. |
thrix
left a comment
There was a problem hiding this comment.
Code Review: Mock provision — make detection of finished processes more deterministic
Reviewed against commit 3f5cf6d.
Summary
The PR fixes a real bug where mock shell environments produce varying numbers of spurious newlines on stdout, causing the previous counting-based completion detection to hang. The fix introduces two improvements:
- In
_simple_execute: uses a sentinel keyword (TMT_FINISHED_EXEC) echoed to stdout instead of counting newlines - In
_spawn_command: uses the returncode file fd instead of mock shell stdout for completion detection
The overall approach is sound and addresses the root cause well. However, there's a bug in the sentinel detection that could cause the same hanging behavior.
Issues
1. _simple_execute sentinel comparison is fragile — potential hang (bug)
The sentinel check uses exact == equality on self.mock_shell.stdout.read(). Since stdout is a non-blocking TextIOWrapper, read() returns all currently available data. If spurious newlines and the sentinel arrive in the same pipe buffer (very likely since all commands are fast and flushed at once), read() returns e.g. '\n\n\n\nTMT_FINISHED_EXEC\n' which does NOT equal 'TMT_FINISHED_EXEC\n'. The loop hangs.
Verified with a Python test demonstrating the behavior. Fix: use endswith() instead of ==. See inline comment.
2. Stale comment (minor)
"Wait until we read the binary zero from stdout" refers to a previous implementation. Should say "finished keyword" or "sentinel". See inline comment.
Observations
3. _spawn_command: returncode_fd not in the event loop
When returncode_fd has data alongside other events, the data is left unread until it becomes the sole event. Works in practice but is less responsive than the old code. See inline comment.
Verdict
The approach is correct and a clear improvement over counting newlines. Issue #1 (fragile == comparison) must be fixed before merging — it could reproduce the exact hang this PR aims to solve.
Generated-by: Claude Code
tmt/steps/provision/mock.py
Outdated
| if ( | ||
| fileno == self.mock_shell_stdout_fd | ||
| and self.mock_shell.stdout.read() == f'{finished_keyword}\n' | ||
| ): |
There was a problem hiding this comment.
This exact == comparison is fragile and could reproduce the same hang this PR aims to fix. self.mock_shell.stdout is a non-blocking TextIOWrapper (text=True + O_NONBLOCK), so read() returns all currently available data.
If the mock shell produces spurious newlines (the very bug being fixed) AND the sentinel arrives in the same pipe buffer — which is very likely since rm, mkdir, mkfifo, chmod, and echo are all fast commands flushed to stdin at once — read() returns something like '\n\n\n\nTMT_FINISHED_EXEC\n', which does NOT equal 'TMT_FINISHED_EXEC\n'. The comparison fails and the loop hangs.
I verified this with a quick Python test:
p = subprocess.Popen(['bash', '-c', 'echo a; echo b; echo TMT_FINISHED_EXEC'],
stdout=subprocess.PIPE, text=True)
time.sleep(0.1)
fcntl.fcntl(p.stdout.fileno(), fcntl.F_SETFL, flags | os.O_NONBLOCK)
data = p.stdout.read()
# data == 'a\nb\nTMT_FINISHED_EXEC\n' -- does NOT == 'TMT_FINISHED_EXEC\n'Fix: use endswith() instead of ==:
and self.mock_shell.stdout.read().endswith(f'{finished_keyword}\n')Generated-by: Claude Code
There was a problem hiding this comment.
What? Why? That seems equally fragile. A split would be better recommendation over endswith if the concern is about other lines contamination.
PS: the generated comment is way too verbose and distracting. Please instruct it to be more concise next time. It took me 5 readings to understand what it is trying to say
| loop = len(commands) | ||
| while loop != 0 and self.mock_shell.poll() is None: | ||
| # Wait until we read the binary zero from stdout. | ||
| while True: |
There was a problem hiding this comment.
Stale comment — this refers to a previous implementation that used echo -e \\\\x00. Should say "finished keyword" or "sentinel" instead of "binary zero".
Generated-by: Claude Code
tmt/steps/provision/mock.py
Outdated
| if len(events) == 1 and events[0][0] == returncode_fd: | ||
| content = os.read(returncode_fd, 16) | ||
| returncode = int(content.decode('utf-8').strip()) | ||
| returncode_io.try_unregister() |
There was a problem hiding this comment.
Critical bug: Missing empty content check before parsing returncode. If os.read(returncode_fd, 16) returns empty bytes (EOF or closed file), the code attempts int(''.strip()) which raises ValueError: invalid literal for int() with base 10: ''.
Fix: Check for empty content before parsing:
if len(events) == 1 and events[0][0] == returncode_fd:
content = os.read(returncode_fd, 16)
if not content:
returncode_io.try_unregister()
continue
returncode = int(content.decode('utf-8').strip())
returncode_io.try_unregister()
break| if len(events) == 1 and events[0][0] == returncode_fd: | |
| content = os.read(returncode_fd, 16) | |
| returncode = int(content.decode('utf-8').strip()) | |
| returncode_io.try_unregister() | |
| if len(events) == 1 and events[0][0] == returncode_fd: | |
| content = os.read(returncode_fd, 16) | |
| if not content: | |
| returncode_io.try_unregister() | |
| continue | |
| returncode = int(content.decode('utf-8').strip()) | |
| returncode_io.try_unregister() | |
| break | |
Spotted by Graphite
Is this helpful? React 👍 or 👎 to let us know.
There was a problem hiding this comment.
In that case maybe i can re-throw a ProvisionError?
There was a problem hiding this comment.
Maybe, when would the content be empty?
Btw, can you also give a quick explanation of the events output here and what the (fd, event) tuple looks like?
There was a problem hiding this comment.
It would be empty if the mock shell process was killed abruptly and between the other checks...
You mean add a comment in code? Because events is exactly the output of epoll.poll so I don't really want to explain what is already in epoll-s documentation.
There was a problem hiding this comment.
It would be empty if the mock shell process was killed abruptly and between the other checks...
That I gathered from the documentation part
An empty list indicates that the call timed out and no file descriptors had any events to report.
You mean add a comment? Because events is exactly the output of epoll.poll so I don't really want to explain what is already in epoll-s documentation.
Well there are a few things confusing with the current documentation state:
- It is unclear when the events length is more than 1 and what the other list members would have
- The event in the tuple is never used and upstream documentation is not really clear on what these events actually are
- The
epoll.polldocumentation is really technical. An overview documentation on how this works would be good, or a reference to a minimal project, example, or equivalents would be good to understand the overall flow
There was a problem hiding this comment.
Allright, I have a different point of view as the implementor :)
I will add some explanations.
Based on: #4173
I found out that various environments (
fedora-rawhide-x86_64) print different number of empty strings on their stdout as the commands spawned in the chroots finish. This causes some environments to hang.This PR makes the detection more deterministic.
Pull Request Checklist