fix: poll deviceconnect websocket instead of fixed sleeps#2881
fix: poll deviceconnect websocket instead of fixed sleeps#2881pasinskim wants to merge 8 commits into
Conversation
|
@pasinskim, start a full integration test pipeline with:
my commands and optionsYou can prevent me from automatically starting CI pipelines:
You can trigger a client pipeline on multiple prs with:
You can trigger a client pipeline for a specific Mender Client release with:
You can trigger GitHub->GitLab branch sync with:
You can print PR statistics for a repository with:
You can cherry pick to a given branch or branches with:
|
|
@mender-test-bot start integration pipeline |
|
Hello 😺 I created a pipeline for you here: Pipeline-2455059455 Build Configuration Matrix
|
8924c9c to
23fae41
Compare
There was a problem hiding this comment.
I am ok with that change, two comments:
- there was a reason I mentioned the HEALTHCHECK on the Mender Docker Client: let's try to consider introducing it with the dbus cli to make the container healthy when the mender-auth responds over dbus (as was the conclusion of the discussion).
- the pipeline failed: https://gitlab.com/Northern.tech/Mender/integration/-/jobs/13944335034 why was it not reported?
| def open_websocket_with_retry(timeout_s=180, interval_s=5): | ||
| # The docker-client fixture is function-scoped, so each test starts | ||
| # with a freshly-bootstrapped client whose deviceconnect session may | ||
| # not yet be established. Poll get_websocket() until it succeeds. | ||
| deadline = time.monotonic() + timeout_s | ||
| last_err = None | ||
| while time.monotonic() < deadline: | ||
| try: | ||
| return docker_env.devconnect.get_websocket() | ||
| except Exception as e: | ||
| last_err = e | ||
| time.sleep(interval_s) | ||
| raise AssertionError( | ||
| f"deviceconnect websocket never became available: {last_err}" | ||
| ) |
There was a problem hiding this comment.
sweet. usually we used a lib for that. but it is ok.
| while time.monotonic() < deadline: | ||
| try: | ||
| return docker_env.devconnect.get_websocket() | ||
| except Exception as e: |
There was a problem hiding this comment.
LGTM but one thing --> do we want to catch all exceptions ? perhaps if there's some specific error idk we want to kill it instantly ? if not that's fine 🥇
|
@mender-test-bot start integration pipeline |
|
Hello 😺 I created a pipeline for you here: Pipeline-2457447851 Build Configuration Matrix
|
| mender_device.run( | ||
| "kill -TERM `pidof %s` 2>/dev/null; " | ||
| "while pidof %s >/dev/null 2>&1; do sleep 0.1; done" | ||
| % (connect_service_name, connect_service_name) | ||
| ) |
There was a problem hiding this comment.
this potentially can hang forever, I would introduce some way to bail out.
There was a problem hiding this comment.
@merlin-northern thanks. That makes sense. Fixed!
There was a problem hiding this comment.
oh, and there is one more thing: touch /stop-respawn, remember?
There was a problem hiding this comment.
Yup. Actually how about
# Kill mender-connect and let the entrypoint's supervise loop restart
# it automatically with the updated config.
mender_device.run("kill `pidof %s` 2>/dev/null" % connect_service_name)
|
@mender-test-bot start integration pipeline but please don't fail it this time! |
|
Hello 😺 I created a pipeline for you here: Pipeline-2458219452 Build Configuration Matrix
|
|
@mender-test-bot start integration pipeline but please don't fail it this time! |
|
Hello 😺 I created a pipeline for you here: Pipeline-2458246630 Build Configuration Matrix
|
|
@mender-test-bot start integration pipeline |
|
Hello 😺 I created a pipeline for you here: Pipeline-2458414545 Build Configuration Matrix
|
|
@mender-test-bot start integration pipeline |
|
Hello 😺 I created a pipeline for you here: Pipeline-2484811465 Build Configuration Matrix
|
b35e9c8 to
d26959c
Compare
|
@mender-test-bot start integration pipeline |
|
Hello 😺 I created a pipeline for you here: Pipeline-2556462905 Build Configuration Matrix
|
@pasinskim do not forget to start a couple of pipelines in the 4.1.x -- thats where the problems are (I think 4.1.x needs some cherry picks). |
e3ed755 to
fb7a02c
Compare
|
@mender-test-bot start integration pipeline |
|
Hello 😺 I created a pipeline for you here: Pipeline-2557275567 Build Configuration Matrix
|
|
I think we need to coordinate our merging @pasinskim my PR also touches the same files as yours I don't mind merging yours and then mine what's left untouched but we definitely cant merge both as we have overlapping changes #2900 |
70990ad to
b6fa95f
Compare
|
@mender-test-bot start integration pipeline |
|
Hello 😺 I created a pipeline for you here: Pipeline-2559274369 Build Configuration Matrix
|
Passed with |
I did run it with |
|
Merging these commits will result in the following changelog entries: Changelogsintegration (docker-fix)New changes in integration since master: |
merlin-northern
left a comment
There was a problem hiding this comment.
I am sorry, this has lots of good work, but one thing is kind of breaking.
| docker_env.device.run( | ||
| "kill -TERM `pidof mender-connect` 2>/dev/null || true" | ||
| ) |
There was a problem hiding this comment.
I am sorry but this breaks the idea of this test. we absolutely cannot restart mender-connect in here. by construction it has to heal the connection by itself. this is the main purpose of this test, and the main purpose of the Mender Troubleshooting Add-on -- we cannot leave the device without access to the server. and by the way this is what all the fuss here is all about. 4.1.x misses some features that fixed deviceconnect around error handling and I think that the correct path is to try to cherry-pick. could you give me more time to debug it over the weekend until Monday?
|
Merging these commits will result in the following changelog entries: Changelogsintegration (docker-fix)New changes in integration since master: |
The docker-client fixture is function-scoped (since 59f3afc), so the deviceconnect tests start each run with a freshly-bootstrapped client and race the deviceconnect session handshake. The original tests papered over this with fixed sleeps (e.g. 128s after iptables -D, 10s after restarting deviceconnect) that are both slow and unreliable: the docker-client's network stack varies, TCP RTO compounds the recovery time on the OUTPUT -j DROP rule (no ICMP/RST is sent back), and the management /connect endpoint returns HTTP 404 ("device disconnected") until the device side is back -- which no fixed sleep can guarantee. Poll the actual readiness instead: retry the full open-websocket + startShell + verify-shell sequence via redo.retriable, catching the 404 (websockets.WebSocketException) and recv TimeoutError, until a working shell is established. Tolerate the narrow "shell from a previous attempt is still running" race that can occur if a retry fires after a successful startShell flapped mid-verify (the device's shell limit is 1). For test_in_poor_network_environment specifically, after iptables -D we restart mender-connect on the device. mender-connect's reconnect backoff caps at 30 minutes (connectionmanager/exponentialbackoff.go in mender-connect) and resetBackoff() is only invoked on a successful connection -- so after a 128s outage, compounded by TCP RTO on the DROP rule, the next reconnect attempt can be minutes away, far beyond a reasonable test budget. Killing the process lets the entrypoint's supervise loop respawn it at attempts=0 so it reconnects on its first cycle. This is a deliberate test-semantics change: the assertion is now that mender-connect re-establishes its deviceconnect session after a restart following the network heal (which is what a watchdog/operator would do in practice), rather than waiting out the autonomous backoff timer. Mirrors the kill+respawn pattern in test_filetransfer.update_limits(). The same retry pattern is applied to test_websocket_reconnect, which suffered an identical 404 race after restarting mender-deviceconnect. Related to QA-1527, QA-1563 and QA-1591. Changelog: Poll the deviceconnect /connect endpoint for shell readiness in the remote-terminal tests, and restart mender-connect after the network-heal step in test_in_poor_network_environment to avoid its 30-minute auto-recovery backoff cap. Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
Remove the non-existent PreserveGroup config key. Mender-connect's PreserveOwner already covers both owner and group via os.Chown(uid, gid). The unknown JSON field was silently ignored. Also move the status code assertion before the file check so that upload failures (e.g. 408 timeout) are diagnosed directly instead of producing a confusing "assert '' == '101 102'" error. Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
The _run() SSH retry loop had three problems: - Slept before the first attempt (1s wasted if already connected) - Exponential backoff with no cap (1,2,4,8,16,32,64,128s...) - Extra time.sleep(60) on ConnectionError on top of the backoff After 7 failures the gap between retries was 128s; with the 60s penalty a single transient SSH drop could stall for minutes. Fix: try first then sleep, cap backoff at 30s, remove the redundant 60s ConnectionError penalty. Retry sequence is now 1,2,4,8,16,30,30... Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
Four issues in deployments.py: 1. check_expected_statistics: slept 200ms before the first check on every iteration. With 59 call sites this adds up. 2. check_not_in_status: had a for/else logic bug where 'continue' targeted the inner for-loop instead of the outer while-loop, making the function silently return success in some edge cases. Replaced with straightforward any() check. 3. abort() and abort_finished_deployment(): both had a blind sleep(5) between the PUT response and the assertion on that same response's status code. This was completely pointless since r.status_code is already set. Saves 5s per call (4 call sites = 20s). Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
_check_log_for_message: used a fixed 10s polling interval with sleep-before-check. If the log message appeared within 1s, the test still waited 10s. Replace with check-first and exponential backoff (2,4,8,10,10...) saving ~8s per call across 7 call sites. requests_retry: only retried on 500/502 but not 503 (Service Unavailable) or 504 (Gateway Timeout), which are common during container startup when the gateway is up but backends are still loading. Adding these prevents false test failures during setup. Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
Convert all 4 filetransfer test classes from function-scoped to class-scoped fixtures so each class spins up the ~21-container Docker Compose stack once instead of per-test. This saves ~80% of the wall time for the filetransfer suite (verified: 6 tests in 97s vs ~580s). Follows the existing _impl + class_persistent_ pattern already used for QEMU client fixtures. Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
Remove test_db_migration.py and its v1 legacy client fixtures from common_setup.py. These tests verified upgrading from Mender v1.7.0 (released 2017) to the current version. This migration path that is no longer supported. The fixtures were only used by this test file. Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
test_monitorclient_remove_old_alerts read the alert store once at a fixed offset after starting mender-monitor and asserted an exact key count. mender-monitor purges expired records on a periodic loop with variable restart latency, so the single timed read raced the purge cycle. The faster mender_device.run() from the SSH backoff change (no longer sleeping 1s before every healthy command) shifted the read ~2s earlier and collapsed the timing margin, surfacing the latent flakiness (seen as '4' == '2' and '1' == '0'). Poll until the store drains to the expected count instead. It drains monotonically (4 -> 2 -> 0), so 1s polling reliably observes each plateau regardless of purge-loop latency. Signed-off-by: pasinskim <marcin.pasinski@northern.tech>
|
Merging these commits will result in the following changelog entries: Changelogsintegration (docker-fix)New changes in integration since master: |
The docker-client fixture is function-scoped (since 59f3afc), so test_in_poor_network_environment starts each run with a freshly bootstrapped client and races the initial deviceconnect session handshake. The fixed 30s heal window after the iptables drop has the same problem. Replace both with a bounded retry loop around get_websocket() so the test waits for actual readiness.
Related to QA-1527 and QA-1563.