fix: poll deviceconnect websocket instead of fixed sleeps by pasinskim · Pull Request #2881 · mendersoftware/integration

pasinskim · 2026-04-15T12:57:44Z

The docker-client fixture is function-scoped (since 59f3afc), so test_in_poor_network_environment starts each run with a freshly bootstrapped client and races the initial deviceconnect session handshake. The fixed 30s heal window after the iptables drop has the same problem. Replace both with a bounded retry loop around get_websocket() so the test waits for actual readiness.

Related to QA-1527 and QA-1563.

mender-test-bot · 2026-04-15T12:57:59Z

@pasinskim, start a full integration test pipeline with:

mentioning me and start integration pipeline

my commands and options

You can prevent me from automatically starting CI pipelines:

if your pull request title starts with "[NoCI] ..."

You can trigger a client pipeline on multiple prs with:

mentioning me and start client pipeline --pr mender/127 --pr mender-connect/255

You can trigger a client pipeline for a specific Mender Client release with:

mentioning me and start client pipeline --release 6.0.x (can be given multiple times)
by default, a pipeline is triggered for each supported release the component is a part of

You can trigger GitHub->GitLab branch sync with:

mentioning me and sync

You can print PR statistics for a repository with:

mentioning me and print fast pr stats (Team stats only)
mentioning me and print full pr stats (Detailed report)
options: --repo <repo>, --team <name>, --all-repos, --exclude-drafts, --exclude-user <user>
mentioning me and print full pr stats --repo mender --all-repos --exclude-drafts

You can cherry pick to a given branch or branches with:

mentioning me and:

 cherry-pick to:
 * 1.0.x
 * 2.0.x

pasinskim · 2026-04-15T12:58:24Z

@mender-test-bot start integration pipeline

mender-test-bot · 2026-04-15T12:58:35Z

Hello 😺 I created a pipeline for you here: Pipeline-2455059455

Build Configuration Matrix

Key	Value
INTEGRATION_REV	pull/2881/head
RUN_TESTS_FULL_INTEGRATION	true

merlin-northern

I am ok with that change, two comments:

there was a reason I mentioned the HEALTHCHECK on the Mender Docker Client: let's try to consider introducing it with the dbus cli to make the container healthy when the mender-auth responds over dbus (as was the conclusion of the discussion).
the pipeline failed: https://gitlab.com/Northern.tech/Mender/integration/-/jobs/13944335034 why was it not reported?

merlin-northern · 2026-04-16T04:38:59Z

+        def open_websocket_with_retry(timeout_s=180, interval_s=5):
+            # The docker-client fixture is function-scoped, so each test starts
+            # with a freshly-bootstrapped client whose deviceconnect session may
+            # not yet be established. Poll get_websocket() until it succeeds.
+            deadline = time.monotonic() + timeout_s
+            last_err = None
+            while time.monotonic() < deadline:
+                try:
+                    return docker_env.devconnect.get_websocket()
+                except Exception as e:
+                    last_err = e
+                    time.sleep(interval_s)
+            raise AssertionError(
+                f"deviceconnect websocket never became available: {last_err}"
+            )


sweet. usually we used a lib for that. but it is ok.

elkoniu

LGTM

p-targowicz · 2026-04-16T07:41:33Z

+            while time.monotonic() < deadline:
+                try:
+                    return docker_env.devconnect.get_websocket()
+                except Exception as e:


LGTM but one thing --> do we want to catch all exceptions ? perhaps if there's some specific error idk we want to kill it instantly ? if not that's fine 🥇

pasinskim · 2026-04-16T10:33:50Z

@mender-test-bot start integration pipeline

mender-test-bot · 2026-04-16T10:34:00Z

Hello 😺 I created a pipeline for you here: Pipeline-2457447851

Build Configuration Matrix

Key	Value
INTEGRATION_REV	pull/2881/head
RUN_TESTS_FULL_INTEGRATION	true

merlin-northern · 2026-04-16T15:02:34Z

+    mender_device.run(
+        "kill -TERM `pidof %s` 2>/dev/null; "
+        "while pidof %s >/dev/null 2>&1; do sleep 0.1; done"
+        % (connect_service_name, connect_service_name)
+    )


this potentially can hang forever, I would introduce some way to bail out.

@merlin-northern thanks. That makes sense. Fixed!

oh, and there is one more thing: touch /stop-respawn, remember?

Yup. Actually how about

# Kill mender-connect and let the entrypoint's supervise loop restart # it automatically with the updated config. mender_device.run("kill `pidof %s` 2>/dev/null" % connect_service_name)

pasinskim · 2026-04-16T15:11:50Z

@mender-test-bot start integration pipeline but please don't fail it this time!

mender-test-bot · 2026-04-16T15:12:00Z

Hello 😺 I created a pipeline for you here: Pipeline-2458219452

Build Configuration Matrix

Key	Value
INTEGRATION_REV	pull/2881/head
RUN_TESTS_FULL_INTEGRATION	true

pasinskim · 2026-04-16T15:22:27Z

@mender-test-bot start integration pipeline but please don't fail it this time!

mender-test-bot · 2026-04-16T15:22:40Z

Hello 😺 I created a pipeline for you here: Pipeline-2458246630

Build Configuration Matrix

Key	Value
INTEGRATION_REV	pull/2881/head
RUN_TESTS_FULL_INTEGRATION	true

pasinskim · 2026-04-16T16:29:23Z

@mender-test-bot start integration pipeline

mender-test-bot · 2026-04-16T16:29:33Z

Hello 😺 I created a pipeline for you here: Pipeline-2458414545

Build Configuration Matrix

Key	Value
INTEGRATION_REV	pull/2881/head
RUN_TESTS_FULL_INTEGRATION	true

pasinskim · 2026-04-28T10:26:02Z

@mender-test-bot start integration pipeline

mender-test-bot · 2026-04-28T10:26:11Z

Hello 😺 I created a pipeline for you here: Pipeline-2484811465

Build Configuration Matrix

Key	Value
INTEGRATION_REV	pull/2881/head
RUN_TESTS_FULL_INTEGRATION	true

pasinskim · 2026-05-27T14:36:24Z

@mender-test-bot start integration pipeline

mender-test-bot · 2026-05-27T14:36:35Z

Hello 😺 I created a pipeline for you here: Pipeline-2556462905

Build Configuration Matrix

Key	Value
INTEGRATION_REV	pull/2881/head
RUN_TESTS_FULL_INTEGRATION	true

merlin-northern · 2026-05-27T19:37:44Z

start integration pipeline

@pasinskim do not forget to start a couple of pipelines in the 4.1.x -- thats where the problems are (I think 4.1.x needs some cherry picks).

pasinskim · 2026-05-27T21:22:08Z

@mender-test-bot start integration pipeline

mender-test-bot · 2026-05-27T21:22:19Z

Hello 😺 I created a pipeline for you here: Pipeline-2557275567

Build Configuration Matrix

Key	Value
INTEGRATION_REV	pull/2881/head
RUN_TESTS_FULL_INTEGRATION	true

p-targowicz · 2026-05-28T09:18:18Z

I think we need to coordinate our merging @pasinskim my PR also touches the same files as yours I don't mind merging yours and then mine what's left untouched but we definitely cant merge both as we have overlapping changes #2900

pasinskim · 2026-05-28T14:34:07Z

@mender-test-bot start integration pipeline

mender-test-bot · 2026-05-28T14:34:18Z

Hello 😺 I created a pipeline for you here: Pipeline-2559274369

Build Configuration Matrix

Key	Value
INTEGRATION_REV	pull/2881/head
RUN_TESTS_FULL_INTEGRATION	true

pasinskim · 2026-05-28T14:34:38Z

start integration pipeline

@pasinskim do not forget to start a couple of pipelines in the 4.1.x -- thats where the problems are (I think 4.1.x needs some cherry picks).

Passed with 4.1.x here: https://gitlab.com/Northern.tech/Mender/integration/-/pipelines/2558962191

pasinskim · 2026-05-28T14:35:25Z

I think we need to coordinate our merging @pasinskim my PR also touches the same files as yours I don't mind merging yours and then mine what's left untouched but we definitely cant merge both as we have overlapping changes #2900

I did run it with 4.1.x. Now will try again with main with the latest changes and we can discuss what to do.

p-targowicz

lgtm

mender-test-bot · 2026-05-29T06:53:55Z

Merging these commits will result in the following changelog entries:

Changelogs

integration (docker-fix)

New changes in integration since master:

Poll the deviceconnect /connect endpoint for shell
readiness in the remote-terminal tests, and restart mender-connect
after the network-heal step in test_in_poor_network_environment to
avoid its 30-minute auto-recovery backoff cap.
(QA-1527, QA-1563, QA-1591)

merlin-northern

I am sorry, this has lots of good work, but one thing is kind of breaking.

merlin-northern · 2026-05-29T08:12:31Z

+        docker_env.device.run(
+            "kill -TERM `pidof mender-connect` 2>/dev/null || true"
+        )


I am sorry but this breaks the idea of this test. we absolutely cannot restart mender-connect in here. by construction it has to heal the connection by itself. this is the main purpose of this test, and the main purpose of the Mender Troubleshooting Add-on -- we cannot leave the device without access to the server. and by the way this is what all the fuss here is all about. 4.1.x misses some features that fixed deviceconnect around error handling and I think that the correct path is to try to cherry-pick. could you give me more time to debug it over the weekend until Monday?

mender-test-bot · 2026-05-29T08:22:15Z

Merging these commits will result in the following changelog entries:

Changelogs

integration (docker-fix)

New changes in integration since master:

Poll the deviceconnect /connect endpoint for shell
readiness in the remote-terminal tests, and restart mender-connect
after the network-heal step in test_in_poor_network_environment to
avoid its 30-minute auto-recovery backoff cap.
(QA-1527, QA-1563, QA-1591)

The docker-client fixture is function-scoped (since 59f3afc), so the deviceconnect tests start each run with a freshly-bootstrapped client and race the deviceconnect session handshake. The original tests papered over this with fixed sleeps (e.g. 128s after iptables -D, 10s after restarting deviceconnect) that are both slow and unreliable: the docker-client's network stack varies, TCP RTO compounds the recovery time on the OUTPUT -j DROP rule (no ICMP/RST is sent back), and the management /connect endpoint returns HTTP 404 ("device disconnected") until the device side is back -- which no fixed sleep can guarantee. Poll the actual readiness instead: retry the full open-websocket + startShell + verify-shell sequence via redo.retriable, catching the 404 (websockets.WebSocketException) and recv TimeoutError, until a working shell is established. Tolerate the narrow "shell from a previous attempt is still running" race that can occur if a retry fires after a successful startShell flapped mid-verify (the device's shell limit is 1). For test_in_poor_network_environment specifically, after iptables -D we restart mender-connect on the device. mender-connect's reconnect backoff caps at 30 minutes (connectionmanager/exponentialbackoff.go in mender-connect) and resetBackoff() is only invoked on a successful connection -- so after a 128s outage, compounded by TCP RTO on the DROP rule, the next reconnect attempt can be minutes away, far beyond a reasonable test budget. Killing the process lets the entrypoint's supervise loop respawn it at attempts=0 so it reconnects on its first cycle. This is a deliberate test-semantics change: the assertion is now that mender-connect re-establishes its deviceconnect session after a restart following the network heal (which is what a watchdog/operator would do in practice), rather than waiting out the autonomous backoff timer. Mirrors the kill+respawn pattern in test_filetransfer.update_limits(). The same retry pattern is applied to test_websocket_reconnect, which suffered an identical 404 race after restarting mender-deviceconnect. Related to QA-1527, QA-1563 and QA-1591. Changelog: Poll the deviceconnect /connect endpoint for shell readiness in the remote-terminal tests, and restart mender-connect after the network-heal step in test_in_poor_network_environment to avoid its 30-minute auto-recovery backoff cap. Signed-off-by: pasinskim <marcin.pasinski@northern.tech>

Remove the non-existent PreserveGroup config key. Mender-connect's PreserveOwner already covers both owner and group via os.Chown(uid, gid). The unknown JSON field was silently ignored. Also move the status code assertion before the file check so that upload failures (e.g. 408 timeout) are diagnosed directly instead of producing a confusing "assert '' == '101 102'" error. Signed-off-by: pasinskim <marcin.pasinski@northern.tech>

The _run() SSH retry loop had three problems: - Slept before the first attempt (1s wasted if already connected) - Exponential backoff with no cap (1,2,4,8,16,32,64,128s...) - Extra time.sleep(60) on ConnectionError on top of the backoff After 7 failures the gap between retries was 128s; with the 60s penalty a single transient SSH drop could stall for minutes. Fix: try first then sleep, cap backoff at 30s, remove the redundant 60s ConnectionError penalty. Retry sequence is now 1,2,4,8,16,30,30... Signed-off-by: pasinskim <marcin.pasinski@northern.tech>

Four issues in deployments.py: 1. check_expected_statistics: slept 200ms before the first check on every iteration. With 59 call sites this adds up. 2. check_not_in_status: had a for/else logic bug where 'continue' targeted the inner for-loop instead of the outer while-loop, making the function silently return success in some edge cases. Replaced with straightforward any() check. 3. abort() and abort_finished_deployment(): both had a blind sleep(5) between the PUT response and the assertion on that same response's status code. This was completely pointless since r.status_code is already set. Saves 5s per call (4 call sites = 20s). Signed-off-by: pasinskim <marcin.pasinski@northern.tech>

_check_log_for_message: used a fixed 10s polling interval with sleep-before-check. If the log message appeared within 1s, the test still waited 10s. Replace with check-first and exponential backoff (2,4,8,10,10...) saving ~8s per call across 7 call sites. requests_retry: only retried on 500/502 but not 503 (Service Unavailable) or 504 (Gateway Timeout), which are common during container startup when the gateway is up but backends are still loading. Adding these prevents false test failures during setup. Signed-off-by: pasinskim <marcin.pasinski@northern.tech>

Convert all 4 filetransfer test classes from function-scoped to class-scoped fixtures so each class spins up the ~21-container Docker Compose stack once instead of per-test. This saves ~80% of the wall time for the filetransfer suite (verified: 6 tests in 97s vs ~580s). Follows the existing _impl + class_persistent_ pattern already used for QEMU client fixtures. Signed-off-by: pasinskim <marcin.pasinski@northern.tech>

Remove test_db_migration.py and its v1 legacy client fixtures from common_setup.py. These tests verified upgrading from Mender v1.7.0 (released 2017) to the current version. This migration path that is no longer supported. The fixtures were only used by this test file. Signed-off-by: pasinskim <marcin.pasinski@northern.tech>

test_monitorclient_remove_old_alerts read the alert store once at a fixed offset after starting mender-monitor and asserted an exact key count. mender-monitor purges expired records on a periodic loop with variable restart latency, so the single timed read raced the purge cycle. The faster mender_device.run() from the SSH backoff change (no longer sleeping 1s before every healthy command) shifted the read ~2s earlier and collapsed the timing margin, surfacing the latent flakiness (seen as '4' == '2' and '1' == '0'). Poll until the store drains to the expected count instead. It drains monotonically (4 -> 2 -> 0), so 1s polling reliably observes each plateau regardless of purge-loop latency. Signed-off-by: pasinskim <marcin.pasinski@northern.tech>

mender-test-bot · 2026-05-29T08:26:01Z

Merging these commits will result in the following changelog entries:

Changelogs

integration (docker-fix)

New changes in integration since master:

Poll the deviceconnect /connect endpoint for shell
readiness in the remote-terminal tests, and restart mender-connect
after the network-heal step in test_in_poor_network_environment to
avoid its 30-minute auto-recovery backoff cap.
(QA-1527, QA-1563, QA-1591)

pasinskim force-pushed the docker-fix branch 2 times, most recently from 8924c9c to 23fae41 Compare April 15, 2026 14:01

pasinskim requested a review from merlin-northern April 15, 2026 14:07

merlin-northern reviewed Apr 16, 2026

View reviewed changes

elkoniu approved these changes Apr 16, 2026

View reviewed changes

p-targowicz approved these changes Apr 16, 2026

View reviewed changes

merlin-northern approved these changes Apr 16, 2026

View reviewed changes

pasinskim force-pushed the docker-fix branch from 8020f46 to 8a001f8 Compare April 16, 2026 15:01

merlin-northern reviewed Apr 16, 2026

View reviewed changes

pasinskim force-pushed the docker-fix branch from 8a001f8 to 5c61c61 Compare April 16, 2026 15:06

pasinskim force-pushed the docker-fix branch from 5c61c61 to 66120ce Compare April 16, 2026 15:21

pasinskim force-pushed the docker-fix branch from 66120ce to 877942f Compare April 16, 2026 16:26

pasinskim force-pushed the docker-fix branch 4 times, most recently from b35e9c8 to d26959c Compare May 8, 2026 13:37

pasinskim force-pushed the docker-fix branch 2 times, most recently from e3ed755 to fb7a02c Compare May 27, 2026 21:19

pasinskim force-pushed the docker-fix branch 2 times, most recently from 70990ad to b6fa95f Compare May 28, 2026 12:52

p-targowicz approved these changes May 29, 2026

View reviewed changes

pasinskim force-pushed the docker-fix branch from b6fa95f to bf84763 Compare May 29, 2026 06:53

merlin-northern requested changes May 29, 2026

View reviewed changes

pasinskim force-pushed the docker-fix branch from bf84763 to 4f528f6 Compare May 29, 2026 08:22

pasinskim added 8 commits May 29, 2026 10:25

pasinskim force-pushed the docker-fix branch from 4f528f6 to 13da8cd Compare May 29, 2026 08:25

Conversation

pasinskim commented Apr 15, 2026

Uh oh!

mender-test-bot commented Apr 15, 2026

Uh oh!

pasinskim commented Apr 15, 2026

Uh oh!

mender-test-bot commented Apr 15, 2026

Uh oh!

merlin-northern left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

merlin-northern Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

elkoniu left a comment

Choose a reason for hiding this comment

Uh oh!

p-targowicz Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pasinskim commented Apr 16, 2026

Uh oh!

mender-test-bot commented Apr 16, 2026

Uh oh!

merlin-northern Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

pasinskim Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

merlin-northern Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pasinskim Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pasinskim commented Apr 16, 2026

Uh oh!

mender-test-bot commented Apr 16, 2026

Uh oh!

pasinskim commented Apr 16, 2026

Uh oh!

mender-test-bot commented Apr 16, 2026

Uh oh!

pasinskim commented Apr 16, 2026

Uh oh!

mender-test-bot commented Apr 16, 2026

Uh oh!

pasinskim commented Apr 28, 2026

Uh oh!

mender-test-bot commented Apr 28, 2026

Uh oh!

pasinskim commented May 27, 2026

Uh oh!

mender-test-bot commented May 27, 2026

Uh oh!

merlin-northern commented May 27, 2026

Uh oh!

pasinskim commented May 27, 2026

Uh oh!

mender-test-bot commented May 27, 2026

Uh oh!

p-targowicz commented May 28, 2026

Uh oh!

pasinskim commented May 28, 2026

Uh oh!

mender-test-bot commented May 28, 2026

Uh oh!

pasinskim commented May 28, 2026

Uh oh!

pasinskim commented May 28, 2026

Uh oh!

p-targowicz left a comment

Choose a reason for hiding this comment

Uh oh!

merlin-northern left a comment •

edited

Loading

p-targowicz Apr 16, 2026 •

edited

Loading

merlin-northern Apr 16, 2026 •

edited

Loading

pasinskim Apr 16, 2026 •

edited

Loading

mender-test-bot commented May 29, 2026 •

edited by atlassian Bot

Loading

mender-test-bot commented May 29, 2026 •

edited by atlassian Bot

Loading

mender-test-bot commented May 29, 2026 •

edited by atlassian Bot

Loading