Mock provision: make detection of finished processes more deterministic by mkoncek · Pull Request #4437 · teemtee/tmt

mkoncek · 2025-12-17T13:21:19Z

Based on: #4173
I found out that various environments (fedora-rawhide-x86_64) print different number of empty strings on their stdout as the commands spawned in the chroots finish. This causes some environments to hang.
This PR makes the detection more deterministic.

Pull Request Checklist

mkoncek · 2026-01-21T09:48:49Z

@happz Rebased on the current HEAD.

tmt/steps/provision/mock.py

mkoncek · 2026-02-23T12:40:30Z

The primary cause of this bug was the assumption that each finished command prints a single newline on stdout after it is finished. This turned out not to be true in some environments. So we print a binary zero after the initial commands are finished and then loop until we read that binary zero back.

Later the command completion is detected by returncode being written into its pre-determined file, instead of waiting on a newline on stdout again.

tmt/steps/provision/mock.py

mkoncek · 2026-02-23T13:34:29Z

Also:

        self.mock_shell.stdin.write(''.join(command + '\n' for command in commands))
        # Issue a command writing a binary zero on the standard output after all
        # the previous commands are finished.
        self.mock_shell.stdin.write('echo -e \\\\x00\n')

could be rewritten as:

        self.mock_shell.stdin.write('\n'.join(commands))
        # Issue a command writing a binary zero on the standard output after all
        # the previous commands are finished.
        self.mock_shell.stdin.write('\necho -e \\\\x00\n')

And then we could use a different string to mark the end of commands.

Afaik in that initialization we only execute things like rm, mkdir, mkfifo and such so there are no arbitrary strings written to stdout.

tmt/steps/provision/mock.py

mkoncek · 2026-02-26T09:41:15Z

@LecrisUT I went over your concerns and tried to apply fixes.

LecrisUT

Looks good enough for me. Could you rebase please?

Unrelated, but can you also open a PR to revert and give more context on what needs to be done?

tmt/tests/prepare/shell/main.fmf

Lines 9 to 11 in 644c58c

    
           # TODO Disable for now under the mock provision plan because of 
        
           # https://bugzilla.redhat.com/show_bug.cgi?id=2415701 
        
           #- provision-mock

Would prefer to be explicit on the AVC checks that are skipped, but when I tried to revert that, it seemed to have some more genuine errors in it.

thrix · 2026-03-15T20:29:05Z

Looks good enough for me. Could you rebase please?

I rebased it.

thrix

Code Review: Mock provision — make detection of finished processes more deterministic

Reviewed against commit 3f5cf6d.

Summary

The PR fixes a real bug where mock shell environments produce varying numbers of spurious newlines on stdout, causing the previous counting-based completion detection to hang. The fix introduces two improvements:

In _simple_execute: uses a sentinel keyword (TMT_FINISHED_EXEC) echoed to stdout instead of counting newlines
In _spawn_command: uses the returncode file fd instead of mock shell stdout for completion detection

The overall approach is sound and addresses the root cause well. However, there's a bug in the sentinel detection that could cause the same hanging behavior.

Issues

1. _simple_execute sentinel comparison is fragile — potential hang (bug)

The sentinel check uses exact == equality on self.mock_shell.stdout.read(). Since stdout is a non-blocking TextIOWrapper, read() returns all currently available data. If spurious newlines and the sentinel arrive in the same pipe buffer (very likely since all commands are fast and flushed at once), read() returns e.g. '\n\n\n\nTMT_FINISHED_EXEC\n' which does NOT equal 'TMT_FINISHED_EXEC\n'. The loop hangs.

Verified with a Python test demonstrating the behavior. Fix: use endswith() instead of ==. See inline comment.

2. Stale comment (minor)

"Wait until we read the binary zero from stdout" refers to a previous implementation. Should say "finished keyword" or "sentinel". See inline comment.

Observations

3. _spawn_command: returncode_fd not in the event loop

When returncode_fd has data alongside other events, the data is left unread until it becomes the sole event. Works in practice but is less responsive than the old code. See inline comment.

Verdict

The approach is correct and a clear improvement over counting newlines. Issue #1 (fragile == comparison) must be fixed before merging — it could reproduce the exact hang this PR aims to solve.

Generated-by: Claude Code

thrix · 2026-03-15T20:45:46Z

tmt/steps/provision/mock.py

+                if (
+                    fileno == self.mock_shell_stdout_fd
+                    and self.mock_shell.stdout.read() == f'{finished_keyword}\n'
+                ):


This exact == comparison is fragile and could reproduce the same hang this PR aims to fix. self.mock_shell.stdout is a non-blocking TextIOWrapper (text=True + O_NONBLOCK), so read() returns all currently available data.

If the mock shell produces spurious newlines (the very bug being fixed) AND the sentinel arrives in the same pipe buffer — which is very likely since rm, mkdir, mkfifo, chmod, and echo are all fast commands flushed to stdin at once — read() returns something like '\n\n\n\nTMT_FINISHED_EXEC\n', which does NOT equal 'TMT_FINISHED_EXEC\n'. The comparison fails and the loop hangs.

I verified this with a quick Python test:

p = subprocess.Popen(['bash', '-c', 'echo a; echo b; echo TMT_FINISHED_EXEC'], stdout=subprocess.PIPE, text=True) time.sleep(0.1) fcntl.fcntl(p.stdout.fileno(), fcntl.F_SETFL, flags | os.O_NONBLOCK) data = p.stdout.read() # data == 'a\nb\nTMT_FINISHED_EXEC\n' -- does NOT == 'TMT_FINISHED_EXEC\n'

Fix: use endswith() instead of ==:

and self.mock_shell.stdout.read().endswith(f'{finished_keyword}\n')

Generated-by: Claude Code

What? Why? That seems equally fragile. A split would be better recommendation over endswith if the concern is about other lines contamination.

PS: the generated comment is way too verbose and distracting. Please instruct it to be more concise next time. It took me 5 readings to understand what it is trying to say

thrix · 2026-03-15T20:45:47Z

tmt/steps/provision/mock.py

-        loop = len(commands)
-        while loop != 0 and self.mock_shell.poll() is None:
+        # Wait until we read the binary zero from stdout.
+        while True:


Stale comment — this refers to a previous implementation that used echo -e \\\\x00. Should say "finished keyword" or "sentinel" instead of "binary zero".

Generated-by: Claude Code

tmt/steps/provision/mock.py

graphite-app · 2026-03-16T09:20:41Z

tmt/steps/provision/mock.py

+                if len(events) == 1 and events[0][0] == returncode_fd:
+                    content = os.read(returncode_fd, 16)
+                    returncode = int(content.decode('utf-8').strip())
+                    returncode_io.try_unregister()


Critical bug: Missing empty content check before parsing returncode. If os.read(returncode_fd, 16) returns empty bytes (EOF or closed file), the code attempts int(''.strip()) which raises ValueError: invalid literal for int() with base 10: ''.

Fix: Check for empty content before parsing:

if len(events) == 1 and events[0][0] == returncode_fd: content = os.read(returncode_fd, 16) if not content: returncode_io.try_unregister() continue returncode = int(content.decode('utf-8').strip()) returncode_io.try_unregister() break

Suggested change

if len(events) == 1 and events[0][0] == returncode_fd:

content = os.read(returncode_fd, 16)

returncode = int(content.decode('utf-8').strip())

returncode_io.try_unregister()

if len(events) == 1 and events[0][0] == returncode_fd:

content = os.read(returncode_fd, 16)

if not content:

returncode_io.try_unregister()

continue

returncode = int(content.decode('utf-8').strip())

returncode_io.try_unregister()

break

Spotted by Graphite

Is this helpful? React 👍 or 👎 to let us know.

In that case maybe i can re-throw a ProvisionError?

Maybe, when would the content be empty?

Btw, can you also give a quick explanation of the events output here and what the (fd, event) tuple looks like?

It would be empty if the mock shell process was killed abruptly and between the other checks...

You mean add a comment in code? Because events is exactly the output of epoll.poll so I don't really want to explain what is already in epoll-s documentation.

It would be empty if the mock shell process was killed abruptly and between the other checks...

That I gathered from the documentation part

An empty list indicates that the call timed out and no file descriptors had any events to report.

You mean add a comment? Because events is exactly the output of epoll.poll so I don't really want to explain what is already in epoll-s documentation.

Well there are a few things confusing with the current documentation state:

It is unclear when the events length is more than 1 and what the other list members would have

The event in the tuple is never used and upstream documentation is not really clear on what these events actually are

The epoll.poll documentation is really technical. An overview documentation on how this works would be good, or a reference to a minimal project, example, or equivalents would be good to understand the overall flow

Allright, I have a different point of view as the implementor :)
I will add some explanations.

mkoncek added bug Something isn't working step | provision Stuff related to the provision step plugin | mock Related to mock provision plugin labels Dec 17, 2025

happz added the ci | full test Pull request is ready for the full test execution label Jan 7, 2026

happz added this to planning Jan 7, 2026

github-project-automation bot moved this to backlog in planning Jan 7, 2026

happz moved this from backlog to review in planning Jan 7, 2026

happz added the status | blocked The merging of PR is blocked on some other issue label Jan 7, 2026

mkoncek force-pushed the mock-process-fix branch from fdead6f to 702c1c4 Compare January 21, 2026 09:08

mkoncek force-pushed the mock-process-fix branch from 702c1c4 to c640c33 Compare February 23, 2026 12:24

LecrisUT reviewed Feb 23, 2026

View reviewed changes

tmt/steps/provision/mock.py Outdated Show resolved Hide resolved

mkoncek force-pushed the mock-process-fix branch from c640c33 to 6604f03 Compare February 23, 2026 12:37

graphite-app bot reviewed Feb 23, 2026

View reviewed changes

tmt/steps/provision/mock.py Show resolved Hide resolved

LecrisUT reviewed Feb 23, 2026

View reviewed changes

tmt/steps/provision/mock.py Outdated Show resolved Hide resolved

tmt/steps/provision/mock.py Outdated Show resolved Hide resolved

LecrisUT removed the status | blocked The merging of PR is blocked on some other issue label Feb 23, 2026

LecrisUT reviewed Feb 23, 2026

View reviewed changes

tmt/steps/provision/mock.py Outdated Show resolved Hide resolved

LecrisUT assigned LecrisUT and happz Mar 12, 2026

LecrisUT approved these changes Mar 13, 2026

View reviewed changes

LecrisUT removed their assignment Mar 13, 2026

mkoncek added 2 commits March 15, 2026 21:28

mock provision: make detection of finished processes more deterministic

07c2c86

mock provision: Refactor

3f5cf6d

thrix force-pushed the mock-process-fix branch from 8a7c7df to 3f5cf6d Compare March 15, 2026 20:28

thrix requested changes Mar 15, 2026

View reviewed changes

psss added this to the 1.70 milestone Mar 16, 2026

mock provision: Make command finish detection more reliable

28974ed

graphite-app bot reviewed Mar 16, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

mkoncek added 2 commits March 16, 2026 14:14

mock provision: Improve code-level documentation

9f914b9

mock provision: Improve mock shell missing returncode handling

6f9ce99

thrix self-requested a review March 16, 2026 14:18

	# TODO Disable for now under the mock provision plan because of
	# https://bugzilla.redhat.com/show_bug.cgi?id=2415701
	#- provision-mock

Conversation

mkoncek commented Dec 17, 2025

Uh oh!

mkoncek commented Jan 21, 2026

Uh oh!

Uh oh!

mkoncek commented Feb 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mkoncek commented Feb 23, 2026

Uh oh!

Uh oh!

mkoncek commented Feb 26, 2026

Uh oh!

LecrisUT left a comment

Choose a reason for hiding this comment

Uh oh!

thrix commented Mar 15, 2026

Uh oh!

thrix left a comment

Choose a reason for hiding this comment

Code Review: Mock provision — make detection of finished processes more deterministic

Summary

Issues

Observations

Verdict

Uh oh!

thrix Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

LecrisUT Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

thrix Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

graphite-app bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

mkoncek Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

LecrisUT Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

mkoncek Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LecrisUT Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

mkoncek Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mkoncek Mar 16, 2026 •

edited

Loading