Skip to content

feat: add VM migration integration test#11

Merged
odesenfans merged 4 commits into
mainfrom
worktree-od+vm-migration-test
May 4, 2026
Merged

feat: add VM migration integration test#11
odesenfans merged 4 commits into
mainfrom
worktree-od+vm-migration-test

Conversation

@odesenfans
Copy link
Copy Markdown
Contributor

Summary

  • Adds test_instance_migration — an end-to-end test that creates a VM, verifies SSH on the initial CRN, unlinks that CRN, and asserts the scheduler migrates the instance to a different CRN with SSH restored
  • Adds crn_nodes session fixture to conftest.py for multi-CRN test support (gracefully skips when < 2 CRNs)
  • Requires CRN_COUNT=2 during CRN provisioning

The test defines the contract for scheduler-rs migration support, which is being developed in parallel.

Test plan

  • Run with CRN_COUNT=2 and scheduler-rs migration support enabled
  • Verify test is skipped gracefully with CRN_COUNT=1
  • Verify full test collection passes (pytest --collect-only)

@odesenfans odesenfans marked this pull request as draft April 20, 2026 15:11
@odesenfans odesenfans force-pushed the worktree-od+vm-migration-test branch from 8c9b880 to 0499cd1 Compare April 23, 2026 21:27
@odesenfans odesenfans marked this pull request as ready for review April 29, 2026 21:46
Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid integration test for VM migration. The test flow (create instance, SSH verify, write marker, unlink CRN, verify migration, verify disk persistence) is well-structured with sensible timeouts. The CI workflow correctly generalizes to CRN_COUNT droplets with proper parallelization, and the crn_nodes fixture gracefully skips when fewer than 2 CRNs are available. No bugs or security issues found; a few redundant assertions and minor suggestions below.

tests/test_migration.py (line 186): Redundant assertion. _wait_for_ssh already verifies that the response contains "hello" (line 110: if result.returncode == 0 and "hello" in result.stdout). The assert on line 187 adds no additional coverage. Same pattern at line 235. Consider removing the redundant asserts to keep the test focused on the migration contract rather than re-testing the helper.

tests/test_migration.py (line 225): Redundant assertion. fetch_new_allocation() (line 206-217) already returns only when new_host != initial_crn_host. By the time execution reaches line 225, this invariant is guaranteed. Consider removing or replacing with a comment noting the invariant is enforced by the poll predicate.

tests/test_migration.py (line 56): Minor: _crn_base_url always reconstructs the URL with http:// regardless of the original scheme. If a CRN ever served HTTPS, this would silently downgrade. For a test file this is low-risk, but consider using parsed.scheme instead of hardcoding http.

.github/workflows/pr-tests.yml (line 13): The actions: read permission is noted as needed for gh CLI artifact downloads, but the gh run download / gh run list calls happen on the CCN droplet (inside crn-up.sh), not in the GitHub Actions runner. The GH_TOKEN passed via env at line 230 is sufficient for authenticated gh calls to GitHub's API. The actions: read permission on the workflow itself is harmless but unnecessary — it would only matter if this workflow read artifacts from other workflows (e.g. actions/download-artifact).

.github/workflows/pr-tests.yml (line 134): Minor: the gh CLI repo setup runs serially over SSH before the parallel apt install block. Since the keyring setup + sources.list is just a couple of commands, this is fine in practice. No change needed — just noting that this could be moved inside the parallel ( ... ) & block if CI time becomes a concern.

odesenfans added 2 commits May 3, 2026 13:48
End-to-end test that validates VM cold-migration between CRNs:
create an instance, SSH on the initial CRN and write a marker file,
unlink that CRN, wait for the scheduler to reallocate to a different
CRN, and verify SSH on the new node and that the marker file survived
(disk state preserved across migration).

- tests/test_migration.py: new migration test with helpers
- tests/conftest.py: add crn_nodes fixture for multi-CRN tests
- manifesto.yml: pin scheduler-rs/scheduler-api 0.1.0-rc1 and
  aleph-vm branch od/cold-migration-async (cold-migration-aware)
- .github/workflows/pr-tests.yml: provision 2 CRNs, install gh and
  pass GH_TOKEN on CCN for branch .deb fetch, bump test timeout to
  60 min, and add SSH keepalives so long polling phases don't drop
  the CI->droplet connection with a broken pipe
@odesenfans odesenfans force-pushed the worktree-od+vm-migration-test branch from 51b7d9c to fc2c84c Compare May 3, 2026 11:55
Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-structured integration test that correctly validates the full migration contract: instance creation, SSH baseline, CRN unlink, reallocation to a different CRN, and disk state preservation. The CI workflow changes cleanly generalize from single-CRN to CRN_COUNT with proper parallelization, SSH keepalive config, and GH_TOKEN for branch deployments. No genuine bugs or security issues. A few comments below.

manifesto.yml (line 29): scheduler-rs is downgraded from 0.1.0-rc3 to 0.1.0-rc1. If this is intentional to pin the migration-supporting version, add a comment explaining why rc3 isn't used. If this is a mistake, revert to rc3.

manifesto.yml (line 37): The switch from version: "1.12.0" to branch: od/cold-migration-async is a pre-requisite for this test, not a change belonging on main. Make sure this is reverted before merging, or called out in the PR description.

tests/test_migration.py (line 138): The 900s per-test timeout is tight. Worst-case sequential sum of the polling timeouts: 180 (allocation) + 120 (boot) + 60 (SSH) + 15 (marker write) + 300 (re-allocation) + 180 (reboot) + 60 (re-SSH) + 15 (marker read) = 830s, leaving ~125s buffer. Consider bumping to 960s (16 min) for more headroom, since individual steps often finish early but network hiccups can compound.

tests/test_migration.py (line 48): Minor: _http_get_json silently swallows json.JSONDecodeError (not in the except tuple) — a malformed JSON response would propagate the exception instead of returning None and retrying. Consider adding json.JSONDecodeError to the except clause for robustness, though in practice the scheduler-api returns valid JSON or an HTTP error.

.github/workflows/pr-tests.yml (line 230): Good: GH_TOKEN is passed via the GitHub Actions secrets.GITHUB_TOKEN (not hardcoded) and scoped to the digitalocean environment. The actions: read permission at line 13 is correct for gh run download to access CI artifacts across repos.

tests/conftest.py (line 271): The crn_nodes fixture hardcodes NODESTATUS_ADDR which is also hardcoded in crn-up.sh:47. Consider defining it as a module-level constant in conftest.py and importing it in crn-up.sh's inline Python, or at minimum extracting it into a shared constant to avoid drift.

Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-structured integration test that correctly defines the contract for VM migration. The CI workflow refactoring to support CRN_COUNT is clean, with proper parallelization and cleanup. The new crn_nodes fixture correctly gates on >= 2 CRNs. One minor suggestion: deduplicate shared utilities across test files. No blocking bugs or security issues found.

tests/test_migration.py (line 23): The functions _poll and _http_get_json are duplicated from tests/test_instances.py. Consider moving these into conftest.py as shared utilities (they're already in that file's scope) to avoid drift between the two test files.

tests/test_migration.py (line 119): _wait_for_vm_ssh_port reuses the same CRN execution-list polling pattern as test_instances.py. If the shared helpers are consolidated, this could too.

tests/test_migration.py (line 139): The 900s timeout covers the test but is tight if both VM boots are slow. The summed phase timeouts are ~1070s (180+120+60+300+180+60), which exceeds 900s. Consider bumping to 1200s, or — more defensively — note that aleph_cli calls (upload + create + unlink) are not included in the summed timeout budget since they run outside _poll.

.github/workflows/pr-tests.yml (line 269): Minor: actions/upload-artifact@v4 with a glob pattern crn*-supervisor.txt will match crn0-supervisor.txt, crn1-supervisor.txt, etc. Consider also adding a fallback if-exists note or making the step idempotent — if a CRN IP was never resolved, no file is written for that index, so the glob simply won't match, which is fine.

manifesto.yml (line 36): Pinning aleph-vm to a development branch (od/cold-migration-async) instead of a release version is expected for this PR but worth flagging — this should be reverted to a version: pin once the migration feature is released.

tests/conftest.py (line 268): Good use of pytest.skip inside the fixture. Since this is session-scoped, all tests depending on crn_nodes will be skipped together when < 2 CRNs are present. No issue, just noting this is a clean pattern.

Copy link
Copy Markdown

@foxpatch-aleph foxpatch-aleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is well-structured and correctly implements multi-CRN support for both the CI workflow and the test suite. The migration test cleanly validates the full lifecycle: create instance, SSH on initial CRN, unlink, verify reallocation to a different CRN, and confirm disk state preservation. The conftest fixture correctly skips when fewer than 2 CRNs are available. The CI workflow changes for dynamic CRN provisioning, parallel installs, and per-CRN log collection are consistent with existing patterns. Key improvements include SSH keepalive config for long-running sessions and GH_TOKEN passthrough for branch-based deployments. No blocking bugs or security issues found.

tests/test_migration.py (line 54): _crn_base_url doesn't guard against parsed.hostname being None (e.g., if a malformed allocation URL is returned). This would produce "http://None:4020". Consider adding: if not host: raise ValueError(...). In practice this shouldn't happen with valid scheduler data, but it's a defensive safety net for a function that parses external input.

tests/test_migration.py (line 123): Minor: the dict key lookup for port 22 handles both string and int keys (mapped_ports.get("22") or mapped_ports.get(22)). This is robust — just noting it as a good practice in case the aleph-vm API returns inconsistent key types.

.github/workflows/pr-tests.yml (line 44): Good addition of SSH keepalive config. The 30s interval x 20 max = 10 minute timeout before SSH considers the connection dead, which is appropriate for the migration test's polling phases.

.github/workflows/pr-tests.yml (line 13): The added actions: read permission is needed for gh CLI artifact downloads from branches. Good that this is scoped minimally rather than using broader permissions.

tests/conftest.py (line 273): Good use of pytest.skip in the fixture rather than pytest.fail — this correctly marks dependent tests as SKIPPED rather than FAILED when CRN_COUNT < 2.

deploy/docker-compose.yml (line 268): Good addition of ALEPH_SCHEDULER_MIGRATIONS_ENABLED. This is the key config enabling the scheduler-rs migration behavior that the test depends on.

@odesenfans odesenfans merged commit ff175a5 into main May 4, 2026
1 check passed
@odesenfans odesenfans deleted the worktree-od+vm-migration-test branch May 4, 2026 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants