Skip to content

Mock provision: make detection of finished processes more deterministic#4437

Open
mkoncek wants to merge 5 commits intoteemtee:mainfrom
mkoncek:mock-process-fix
Open

Mock provision: make detection of finished processes more deterministic#4437
mkoncek wants to merge 5 commits intoteemtee:mainfrom
mkoncek:mock-process-fix

Conversation

@mkoncek
Copy link
Collaborator

@mkoncek mkoncek commented Dec 17, 2025

Based on: #4173
I found out that various environments (fedora-rawhide-x86_64) print different number of empty strings on their stdout as the commands spawned in the chroots finish. This causes some environments to hang.
This PR makes the detection more deterministic.

Pull Request Checklist

  • implement the feature
  • write the documentation
  • extend the test coverage
  • update the specification
  • adjust plugin docstring
  • modify the json schema
  • mention the version
  • include a release note

@mkoncek mkoncek added bug Something isn't working step | provision Stuff related to the provision step plugin | mock Related to mock provision plugin labels Dec 17, 2025
@happz happz added the ci | full test Pull request is ready for the full test execution label Jan 7, 2026
@happz happz added this to planning Jan 7, 2026
@github-project-automation github-project-automation bot moved this to backlog in planning Jan 7, 2026
@happz happz moved this from backlog to review in planning Jan 7, 2026
@happz happz added the status | blocked The merging of PR is blocked on some other issue label Jan 7, 2026
@mkoncek
Copy link
Collaborator Author

mkoncek commented Jan 21, 2026

@happz Rebased on the current HEAD.

@mkoncek
Copy link
Collaborator Author

mkoncek commented Feb 23, 2026

The primary cause of this bug was the assumption that each finished command prints a single newline on stdout after it is finished. This turned out not to be true in some environments. So we print a binary zero after the initial commands are finished and then loop until we read that binary zero back.

Later the command completion is detected by returncode being written into its pre-determined file, instead of waiting on a newline on stdout again.

@LecrisUT LecrisUT removed the status | blocked The merging of PR is blocked on some other issue label Feb 23, 2026
@mkoncek
Copy link
Collaborator Author

mkoncek commented Feb 23, 2026

Also:

        self.mock_shell.stdin.write(''.join(command + '\n' for command in commands))
        # Issue a command writing a binary zero on the standard output after all
        # the previous commands are finished.
        self.mock_shell.stdin.write('echo -e \\\\x00\n')

could be rewritten as:

        self.mock_shell.stdin.write('\n'.join(commands))
        # Issue a command writing a binary zero on the standard output after all
        # the previous commands are finished.
        self.mock_shell.stdin.write('\necho -e \\\\x00\n')

And then we could use a different string to mark the end of commands.

Afaik in that initialization we only execute things like rm, mkdir, mkfifo and such so there are no arbitrary strings written to stdout.

@mkoncek
Copy link
Collaborator Author

mkoncek commented Feb 26, 2026

@LecrisUT I went over your concerns and tried to apply fixes.

Copy link
Contributor

@LecrisUT LecrisUT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good enough for me. Could you rebase please?

Unrelated, but can you also open a PR to revert and give more context on what needs to be done?

# TODO Disable for now under the mock provision plan because of
# https://bugzilla.redhat.com/show_bug.cgi?id=2415701
#- provision-mock

Would prefer to be explicit on the AVC checks that are skipped, but when I tried to revert that, it seemed to have some more genuine errors in it.

@LecrisUT LecrisUT removed their assignment Mar 13, 2026
@thrix thrix force-pushed the mock-process-fix branch from 8a7c7df to 3f5cf6d Compare March 15, 2026 20:28
@thrix
Copy link
Contributor

thrix commented Mar 15, 2026

Looks good enough for me. Could you rebase please?

I rebased it.

Copy link
Contributor

@thrix thrix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Mock provision — make detection of finished processes more deterministic

Reviewed against commit 3f5cf6d.

Summary

The PR fixes a real bug where mock shell environments produce varying numbers of spurious newlines on stdout, causing the previous counting-based completion detection to hang. The fix introduces two improvements:

  1. In _simple_execute: uses a sentinel keyword (TMT_FINISHED_EXEC) echoed to stdout instead of counting newlines
  2. In _spawn_command: uses the returncode file fd instead of mock shell stdout for completion detection

The overall approach is sound and addresses the root cause well. However, there's a bug in the sentinel detection that could cause the same hanging behavior.

Issues

1. _simple_execute sentinel comparison is fragile — potential hang (bug)

The sentinel check uses exact == equality on self.mock_shell.stdout.read(). Since stdout is a non-blocking TextIOWrapper, read() returns all currently available data. If spurious newlines and the sentinel arrive in the same pipe buffer (very likely since all commands are fast and flushed at once), read() returns e.g. '\n\n\n\nTMT_FINISHED_EXEC\n' which does NOT equal 'TMT_FINISHED_EXEC\n'. The loop hangs.

Verified with a Python test demonstrating the behavior. Fix: use endswith() instead of ==. See inline comment.

2. Stale comment (minor)

"Wait until we read the binary zero from stdout" refers to a previous implementation. Should say "finished keyword" or "sentinel". See inline comment.

Observations

3. _spawn_command: returncode_fd not in the event loop

When returncode_fd has data alongside other events, the data is left unread until it becomes the sole event. Works in practice but is less responsive than the old code. See inline comment.

Verdict

The approach is correct and a clear improvement over counting newlines. Issue #1 (fragile == comparison) must be fixed before merging — it could reproduce the exact hang this PR aims to solve.

Generated-by: Claude Code

Comment on lines +194 to +197
if (
fileno == self.mock_shell_stdout_fd
and self.mock_shell.stdout.read() == f'{finished_keyword}\n'
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exact == comparison is fragile and could reproduce the same hang this PR aims to fix. self.mock_shell.stdout is a non-blocking TextIOWrapper (text=True + O_NONBLOCK), so read() returns all currently available data.

If the mock shell produces spurious newlines (the very bug being fixed) AND the sentinel arrives in the same pipe buffer — which is very likely since rm, mkdir, mkfifo, chmod, and echo are all fast commands flushed to stdin at once — read() returns something like '\n\n\n\nTMT_FINISHED_EXEC\n', which does NOT equal 'TMT_FINISHED_EXEC\n'. The comparison fails and the loop hangs.

I verified this with a quick Python test:

p = subprocess.Popen(['bash', '-c', 'echo a; echo b; echo TMT_FINISHED_EXEC'],
                      stdout=subprocess.PIPE, text=True)
time.sleep(0.1)
fcntl.fcntl(p.stdout.fileno(), fcntl.F_SETFL, flags | os.O_NONBLOCK)
data = p.stdout.read()
# data == 'a\nb\nTMT_FINISHED_EXEC\n'  -- does NOT == 'TMT_FINISHED_EXEC\n'

Fix: use endswith() instead of ==:

and self.mock_shell.stdout.read().endswith(f'{finished_keyword}\n')

Generated-by: Claude Code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What? Why? That seems equally fragile. A split would be better recommendation over endswith if the concern is about other lines contamination.

PS: the generated comment is way too verbose and distracting. Please instruct it to be more concise next time. It took me 5 readings to understand what it is trying to say

loop = len(commands)
while loop != 0 and self.mock_shell.poll() is None:
# Wait until we read the binary zero from stdout.
while True:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale comment — this refers to a previous implementation that used echo -e \\\\x00. Should say "finished keyword" or "sentinel" instead of "binary zero".

Generated-by: Claude Code

@psss psss added this to the 1.70 milestone Mar 16, 2026
Comment on lines +421 to +424
if len(events) == 1 and events[0][0] == returncode_fd:
content = os.read(returncode_fd, 16)
returncode = int(content.decode('utf-8').strip())
returncode_io.try_unregister()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical bug: Missing empty content check before parsing returncode. If os.read(returncode_fd, 16) returns empty bytes (EOF or closed file), the code attempts int(''.strip()) which raises ValueError: invalid literal for int() with base 10: ''.

Fix: Check for empty content before parsing:

if len(events) == 1 and events[0][0] == returncode_fd:
    content = os.read(returncode_fd, 16)
    if not content:
        returncode_io.try_unregister()
        continue
    returncode = int(content.decode('utf-8').strip())
    returncode_io.try_unregister()
    break
Suggested change
if len(events) == 1 and events[0][0] == returncode_fd:
content = os.read(returncode_fd, 16)
returncode = int(content.decode('utf-8').strip())
returncode_io.try_unregister()
if len(events) == 1 and events[0][0] == returncode_fd:
content = os.read(returncode_fd, 16)
if not content:
returncode_io.try_unregister()
continue
returncode = int(content.decode('utf-8').strip())
returncode_io.try_unregister()
break

Spotted by Graphite

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case maybe i can re-throw a ProvisionError?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, when would the content be empty?

Btw, can you also give a quick explanation of the events output here and what the (fd, event) tuple looks like?

Copy link
Collaborator Author

@mkoncek mkoncek Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be empty if the mock shell process was killed abruptly and between the other checks...

You mean add a comment in code? Because events is exactly the output of epoll.poll so I don't really want to explain what is already in epoll-s documentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be empty if the mock shell process was killed abruptly and between the other checks...

That I gathered from the documentation part

An empty list indicates that the call timed out and no file descriptors had any events to report.

You mean add a comment? Because events is exactly the output of epoll.poll so I don't really want to explain what is already in epoll-s documentation.

Well there are a few things confusing with the current documentation state:

  • It is unclear when the events length is more than 1 and what the other list members would have
  • The event in the tuple is never used and upstream documentation is not really clear on what these events actually are
  • The epoll.poll documentation is really technical. An overview documentation on how this works would be good, or a reference to a minimal project, example, or equivalents would be good to understand the overall flow

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allright, I have a different point of view as the implementor :)
I will add some explanations.

@mkoncek

This comment was marked as resolved.

@LecrisUT

This comment was marked as resolved.

@mkoncek

This comment was marked as resolved.

@thrix thrix self-requested a review March 16, 2026 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ci | full test Pull request is ready for the full test execution plugin | mock Related to mock provision plugin step | provision Stuff related to the provision step

Projects

Status: review

Development

Successfully merging this pull request may close these issues.

5 participants