Skip to content

fix: poll deviceconnect websocket instead of fixed sleeps#2881

Open
pasinskim wants to merge 8 commits into
masterfrom
docker-fix
Open

fix: poll deviceconnect websocket instead of fixed sleeps#2881
pasinskim wants to merge 8 commits into
masterfrom
docker-fix

Conversation

@pasinskim
Copy link
Copy Markdown
Member

The docker-client fixture is function-scoped (since 59f3afc), so test_in_poor_network_environment starts each run with a freshly bootstrapped client and races the initial deviceconnect session handshake. The fixed 30s heal window after the iptables drop has the same problem. Replace both with a bounded retry loop around get_websocket() so the test waits for actual readiness.

Related to QA-1527 and QA-1563.

@mender-test-bot
Copy link
Copy Markdown
Contributor

@pasinskim, start a full integration test pipeline with:

  • mentioning me and start integration pipeline

my commands and options

You can prevent me from automatically starting CI pipelines:

  • if your pull request title starts with "[NoCI] ..."

You can trigger a client pipeline on multiple prs with:

  • mentioning me and start client pipeline --pr mender/127 --pr mender-connect/255

You can trigger a client pipeline for a specific Mender Client release with:

  • mentioning me and start client pipeline --release 6.0.x (can be given multiple times)
  • by default, a pipeline is triggered for each supported release the component is a part of

You can trigger GitHub->GitLab branch sync with:

  • mentioning me and sync

You can print PR statistics for a repository with:

  • mentioning me and print fast pr stats (Team stats only)
  • mentioning me and print full pr stats (Detailed report)
  • options: --repo <repo>, --team <name>, --all-repos, --exclude-drafts, --exclude-user <user>
  • mentioning me and print full pr stats --repo mender --all-repos --exclude-drafts

You can cherry pick to a given branch or branches with:

  • mentioning me and:
 cherry-pick to:
 * 1.0.x
 * 2.0.x

@pasinskim
Copy link
Copy Markdown
Member Author

@mender-test-bot start integration pipeline

@mender-test-bot
Copy link
Copy Markdown
Contributor

Hello 😺 I created a pipeline for you here: Pipeline-2455059455

Build Configuration Matrix

Key Value
INTEGRATION_REV pull/2881/head
RUN_TESTS_FULL_INTEGRATION true

@pasinskim pasinskim force-pushed the docker-fix branch 2 times, most recently from 8924c9c to 23fae41 Compare April 15, 2026 14:01
Copy link
Copy Markdown
Contributor

@merlin-northern merlin-northern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with that change, two comments:

  • there was a reason I mentioned the HEALTHCHECK on the Mender Docker Client: let's try to consider introducing it with the dbus cli to make the container healthy when the mender-auth responds over dbus (as was the conclusion of the discussion).
  • the pipeline failed: https://gitlab.com/Northern.tech/Mender/integration/-/jobs/13944335034 why was it not reported?

Comment thread tests/tests/test_mender_connect.py Outdated
Comment on lines +215 to +229
def open_websocket_with_retry(timeout_s=180, interval_s=5):
# The docker-client fixture is function-scoped, so each test starts
# with a freshly-bootstrapped client whose deviceconnect session may
# not yet be established. Poll get_websocket() until it succeeds.
deadline = time.monotonic() + timeout_s
last_err = None
while time.monotonic() < deadline:
try:
return docker_env.devconnect.get_websocket()
except Exception as e:
last_err = e
time.sleep(interval_s)
raise AssertionError(
f"deviceconnect websocket never became available: {last_err}"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweet. usually we used a lib for that. but it is ok.

Copy link
Copy Markdown
Contributor

@elkoniu elkoniu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread tests/tests/test_mender_connect.py Outdated
while time.monotonic() < deadline:
try:
return docker_env.devconnect.get_websocket()
except Exception as e:
Copy link
Copy Markdown
Contributor

@p-targowicz p-targowicz Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but one thing --> do we want to catch all exceptions ? perhaps if there's some specific error idk we want to kill it instantly ? if not that's fine 🥇

@pasinskim
Copy link
Copy Markdown
Member Author

@mender-test-bot start integration pipeline

@mender-test-bot
Copy link
Copy Markdown
Contributor

Hello 😺 I created a pipeline for you here: Pipeline-2457447851

Build Configuration Matrix

Key Value
INTEGRATION_REV pull/2881/head
RUN_TESTS_FULL_INTEGRATION true

Comment thread tests/tests/test_filetransfer.py Outdated
Comment on lines +110 to +114
mender_device.run(
"kill -TERM `pidof %s` 2>/dev/null; "
"while pidof %s >/dev/null 2>&1; do sleep 0.1; done"
% (connect_service_name, connect_service_name)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this potentially can hang forever, I would introduce some way to bail out.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merlin-northern thanks. That makes sense. Fixed!

Copy link
Copy Markdown
Contributor

@merlin-northern merlin-northern Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, and there is one more thing: touch /stop-respawn, remember?

Copy link
Copy Markdown
Member Author

@pasinskim pasinskim Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. Actually how about

# Kill mender-connect and let the entrypoint's supervise loop restart
 # it automatically with the updated config.
 mender_device.run("kill `pidof %s` 2>/dev/null" % connect_service_name)

@pasinskim
Copy link
Copy Markdown
Member Author

@mender-test-bot start integration pipeline but please don't fail it this time!

@mender-test-bot
Copy link
Copy Markdown
Contributor

Hello 😺 I created a pipeline for you here: Pipeline-2458219452

Build Configuration Matrix

Key Value
INTEGRATION_REV pull/2881/head
RUN_TESTS_FULL_INTEGRATION true

@pasinskim
Copy link
Copy Markdown
Member Author

@mender-test-bot start integration pipeline but please don't fail it this time!

@mender-test-bot
Copy link
Copy Markdown
Contributor

Hello 😺 I created a pipeline for you here: Pipeline-2458246630

Build Configuration Matrix

Key Value
INTEGRATION_REV pull/2881/head
RUN_TESTS_FULL_INTEGRATION true

@pasinskim
Copy link
Copy Markdown
Member Author

@mender-test-bot start integration pipeline

@mender-test-bot
Copy link
Copy Markdown
Contributor

Hello 😺 I created a pipeline for you here: Pipeline-2458414545

Build Configuration Matrix

Key Value
INTEGRATION_REV pull/2881/head
RUN_TESTS_FULL_INTEGRATION true

@pasinskim
Copy link
Copy Markdown
Member Author

@mender-test-bot start integration pipeline

@mender-test-bot
Copy link
Copy Markdown
Contributor

Hello 😺 I created a pipeline for you here: Pipeline-2484811465

Build Configuration Matrix

Key Value
INTEGRATION_REV pull/2881/head
RUN_TESTS_FULL_INTEGRATION true

@pasinskim pasinskim force-pushed the docker-fix branch 4 times, most recently from b35e9c8 to d26959c Compare May 8, 2026 13:37
@pasinskim
Copy link
Copy Markdown
Member Author

@mender-test-bot start integration pipeline

@mender-test-bot
Copy link
Copy Markdown
Contributor

Hello 😺 I created a pipeline for you here: Pipeline-2556462905

Build Configuration Matrix

Key Value
INTEGRATION_REV pull/2881/head
RUN_TESTS_FULL_INTEGRATION true

@merlin-northern
Copy link
Copy Markdown
Contributor

start integration pipeline

@pasinskim do not forget to start a couple of pipelines in the 4.1.x -- thats where the problems are (I think 4.1.x needs some cherry picks).

@pasinskim pasinskim force-pushed the docker-fix branch 2 times, most recently from e3ed755 to fb7a02c Compare May 27, 2026 21:19
@pasinskim
Copy link
Copy Markdown
Member Author

@mender-test-bot start integration pipeline

@mender-test-bot
Copy link
Copy Markdown
Contributor

Hello 😺 I created a pipeline for you here: Pipeline-2557275567

Build Configuration Matrix

Key Value
INTEGRATION_REV pull/2881/head
RUN_TESTS_FULL_INTEGRATION true

@p-targowicz
Copy link
Copy Markdown
Contributor

I think we need to coordinate our merging @pasinskim my PR also touches the same files as yours I don't mind merging yours and then mine what's left untouched but we definitely cant merge both as we have overlapping changes #2900

@pasinskim pasinskim force-pushed the docker-fix branch 2 times, most recently from 70990ad to b6fa95f Compare May 28, 2026 12:52
@pasinskim
Copy link
Copy Markdown
Member Author

@mender-test-bot start integration pipeline

@mender-test-bot
Copy link
Copy Markdown
Contributor

Hello 😺 I created a pipeline for you here: Pipeline-2559274369

Build Configuration Matrix

Key Value
INTEGRATION_REV pull/2881/head
RUN_TESTS_FULL_INTEGRATION true

@pasinskim
Copy link
Copy Markdown
Member Author

start integration pipeline

@pasinskim do not forget to start a couple of pipelines in the 4.1.x -- thats where the problems are (I think 4.1.x needs some cherry picks).

Passed with 4.1.x here: https://gitlab.com/Northern.tech/Mender/integration/-/pipelines/2558962191

@pasinskim
Copy link
Copy Markdown
Member Author

I think we need to coordinate our merging @pasinskim my PR also touches the same files as yours I don't mind merging yours and then mine what's left untouched but we definitely cant merge both as we have overlapping changes #2900

I did run it with 4.1.x. Now will try again with main with the latest changes and we can discuss what to do.

Copy link
Copy Markdown
Contributor

@p-targowicz p-targowicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mender-test-bot
Copy link
Copy Markdown
Contributor

mender-test-bot commented May 29, 2026

Merging these commits will result in the following changelog entries:

Changelogs

integration (docker-fix)

New changes in integration since master:

  • Poll the deviceconnect /connect endpoint for shell
    readiness in the remote-terminal tests, and restart mender-connect
    after the network-heal step in test_in_poor_network_environment to
    avoid its 30-minute auto-recovery backoff cap.
    (QA-1527, QA-1563, QA-1591)

Copy link
Copy Markdown
Contributor

@merlin-northern merlin-northern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry, this has lots of good work, but one thing is kind of breaking.

Comment on lines +298 to +300
docker_env.device.run(
"kill -TERM `pidof mender-connect` 2>/dev/null || true"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry but this breaks the idea of this test. we absolutely cannot restart mender-connect in here. by construction it has to heal the connection by itself. this is the main purpose of this test, and the main purpose of the Mender Troubleshooting Add-on -- we cannot leave the device without access to the server. and by the way this is what all the fuss here is all about. 4.1.x misses some features that fixed deviceconnect around error handling and I think that the correct path is to try to cherry-pick. could you give me more time to debug it over the weekend until Monday?

@mender-test-bot
Copy link
Copy Markdown
Contributor

mender-test-bot commented May 29, 2026

Merging these commits will result in the following changelog entries:

Changelogs

integration (docker-fix)

New changes in integration since master:

  • Poll the deviceconnect /connect endpoint for shell
    readiness in the remote-terminal tests, and restart mender-connect
    after the network-heal step in test_in_poor_network_environment to
    avoid its 30-minute auto-recovery backoff cap.
    (QA-1527, QA-1563, QA-1591)

pasinskim added 8 commits May 29, 2026 10:25
The docker-client fixture is function-scoped (since 59f3afc), so the
deviceconnect tests start each run with a freshly-bootstrapped client
and race the deviceconnect session handshake. The original tests
papered over this with fixed sleeps (e.g. 128s after iptables -D, 10s
after restarting deviceconnect) that are both slow and unreliable: the
docker-client's network stack varies, TCP RTO compounds the recovery
time on the OUTPUT -j DROP rule (no ICMP/RST is sent back), and the
management /connect endpoint returns HTTP 404 ("device disconnected")
until the device side is back -- which no fixed sleep can guarantee.

Poll the actual readiness instead: retry the full open-websocket +
startShell + verify-shell sequence via redo.retriable, catching the
404 (websockets.WebSocketException) and recv TimeoutError, until a
working shell is established. Tolerate the narrow "shell from a
previous attempt is still running" race that can occur if a retry
fires after a successful startShell flapped mid-verify (the device's
shell limit is 1).

For test_in_poor_network_environment specifically, after iptables -D
we restart mender-connect on the device. mender-connect's reconnect
backoff caps at 30 minutes (connectionmanager/exponentialbackoff.go
in mender-connect) and resetBackoff() is only invoked on a successful
connection -- so after a 128s outage, compounded by TCP RTO on the
DROP rule, the next reconnect attempt can be minutes away, far beyond
a reasonable test budget. Killing the process lets the entrypoint's
supervise loop respawn it at attempts=0 so it reconnects on its first
cycle. This is a deliberate test-semantics change: the assertion is
now that mender-connect re-establishes its deviceconnect session
after a restart following the network heal (which is what a
watchdog/operator would do in practice), rather than waiting out the
autonomous backoff timer. Mirrors the kill+respawn pattern in
test_filetransfer.update_limits().

The same retry pattern is applied to test_websocket_reconnect, which
suffered an identical 404 race after restarting mender-deviceconnect.

Related to QA-1527, QA-1563 and QA-1591.

Changelog: Poll the deviceconnect /connect endpoint for shell
readiness in the remote-terminal tests, and restart mender-connect
after the network-heal step in test_in_poor_network_environment to
avoid its 30-minute auto-recovery backoff cap.

Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
Remove the non-existent PreserveGroup config key. Mender-connect's
PreserveOwner already covers both owner and group via os.Chown(uid, gid).
The unknown JSON field was silently ignored.

Also move the status code assertion before the file check so that
upload failures (e.g. 408 timeout) are diagnosed directly instead of
producing a confusing "assert '' == '101 102'" error.

Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
The _run() SSH retry loop had three problems:
- Slept before the first attempt (1s wasted if already connected)
- Exponential backoff with no cap (1,2,4,8,16,32,64,128s...)
- Extra time.sleep(60) on ConnectionError on top of the backoff

After 7 failures the gap between retries was 128s; with the 60s
penalty a single transient SSH drop could stall for minutes.

Fix: try first then sleep, cap backoff at 30s, remove the redundant
60s ConnectionError penalty. Retry sequence is now 1,2,4,8,16,30,30...

Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
Four issues in deployments.py:

1. check_expected_statistics: slept 200ms before the first check on
   every iteration. With 59 call sites this adds up.

2. check_not_in_status: had a for/else logic bug where 'continue'
   targeted the inner for-loop instead of the outer while-loop,
   making the function silently return success in some edge cases.
   Replaced with straightforward any() check.

3. abort() and abort_finished_deployment(): both had a blind sleep(5)
   between the PUT response and the assertion on that same response's
   status code. This was completely pointless since r.status_code is   already set. Saves 5s per call (4 call sites = 20s).

Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
_check_log_for_message: used a fixed 10s polling interval with
sleep-before-check. If the log message appeared within 1s, the test
still waited 10s. Replace with check-first and exponential backoff
(2,4,8,10,10...) saving ~8s per call across 7 call sites.

requests_retry: only retried on 500/502 but not 503 (Service
Unavailable) or 504 (Gateway Timeout), which are common during
container startup when the gateway is up but backends are still
loading. Adding these prevents false test failures during setup.

Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
Convert all 4 filetransfer test classes from function-scoped to
class-scoped fixtures so each class spins up the ~21-container Docker
Compose stack once instead of per-test. This saves ~80% of the wall
time for the filetransfer suite (verified: 6 tests in 97s vs ~580s).

Follows the existing _impl + class_persistent_ pattern already used
for QEMU client fixtures.

Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
Remove test_db_migration.py and its v1 legacy client fixtures from
common_setup.py. These tests verified upgrading from Mender v1.7.0
(released 2017) to the current version. This migration path that is no
longer supported. The fixtures were only used by this test file.

Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
test_monitorclient_remove_old_alerts read the alert store once at a
fixed offset after starting mender-monitor and asserted an exact key
count. mender-monitor purges expired records on a periodic loop with
variable restart latency, so the single timed read raced the purge
cycle. The faster mender_device.run() from the SSH backoff change
(no longer sleeping 1s before every healthy command) shifted the read
~2s earlier and collapsed the timing margin, surfacing the latent
flakiness (seen as '4' == '2' and '1' == '0').

Poll until the store drains to the expected count instead. It drains
monotonically (4 -> 2 -> 0), so 1s polling reliably observes each
plateau regardless of purge-loop latency.

Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
@mender-test-bot
Copy link
Copy Markdown
Contributor

mender-test-bot commented May 29, 2026

Merging these commits will result in the following changelog entries:

Changelogs

integration (docker-fix)

New changes in integration since master:

  • Poll the deviceconnect /connect endpoint for shell
    readiness in the remote-terminal tests, and restart mender-connect
    after the network-heal step in test_in_poor_network_environment to
    avoid its 30-minute auto-recovery backoff cap.
    (QA-1527, QA-1563, QA-1591)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants