Skip to content

feat(gateway): add reconciler lease for HA multi-replica deployments#1577

Merged
derekwaynecarr merged 1 commit into
NVIDIA:mainfrom
derekwaynecarr:feat/reconciler-lease
Jun 11, 2026
Merged

feat(gateway): add reconciler lease for HA multi-replica deployments#1577
derekwaynecarr merged 1 commit into
NVIDIA:mainfrom
derekwaynecarr:feat/reconciler-lease

Conversation

@derekwaynecarr

@derekwaynecarr derekwaynecarr commented May 26, 2026

Copy link
Copy Markdown
Collaborator

Summary

Introduce a database-backed reconciler lease so that only one gateway replica runs the watch and reconcile loops in Postgres-backed HA deployments. SQLite (single-replica) deployments skip the lease and run unconditionally as before.

The lease is a lightweight JSON record in the objects table using CAS for cross-replica safety. A lease coordinator on each replica attempts acquisition, runs renewal while holding, and releases on shutdown for fast failover. Watch and reconcile loops now accept a cancellation channel for cooperative shutdown.

Related Issue

Closes #1429

Changes

Testing

  • [ x] mise run pre-commit passes
  • [ x] Unit tests added/updated
  • [ x] E2E tests added/updated (if applicable)

Checklist

  • [x ] Follows Conventional Commits
  • [x ] Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@derekwaynecarr derekwaynecarr requested review from a team, maxamillion and mrunalp as code owners May 26, 2026 21:25
@copy-pr-bot

copy-pr-bot Bot commented May 26, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@derekwaynecarr

Copy link
Copy Markdown
Collaborator Author

This is WIP

@derekwaynecarr derekwaynecarr marked this pull request as draft May 26, 2026 21:26
@derekwaynecarr

Copy link
Copy Markdown
Collaborator Author

Need to figure out what we want to do in CI for HA setups.

@derekwaynecarr

Copy link
Copy Markdown
Collaborator Author

/ok to test 26b9e70

Introduce a database-backed lease that ensures only one gateway replica
runs the watch and reconcile loops. Includes lease primitives with CAS
safety, cooperative cancellation via watch channels, SQLite bypass for
single-replica deployments, and integration tests covering failover,
contention, and CAS chain integrity.

Signed-off-by: Derek Carr <decarr@redhat.com>
@derekwaynecarr derekwaynecarr force-pushed the feat/reconciler-lease branch from 26b9e70 to ac1e51f Compare June 8, 2026 16:55
@derekwaynecarr derekwaynecarr marked this pull request as ready for review June 8, 2026 16:57
@derekwaynecarr

Copy link
Copy Markdown
Collaborator Author

/ok to test ac1e51f

@TaylorMutch TaylorMutch added the test:e2e-kubernetes Requires Kubernetes end-to-end coverage label Jun 8, 2026
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Label test:e2e-kubernetes applied, but pull-request/1577 is at 26b9e70 while the PR head is ac1e51f. A maintainer needs to comment /ok to test ac1e51febee115a734e08341307100f4f512d231 to refresh the mirror. Once the mirror catches up, re-run Branch E2E Checks from the Actions tab.

@TaylorMutch

Copy link
Copy Markdown
Collaborator

/ok to test ac1e51f

@TaylorMutch TaylorMutch added the test:e2e Requires end-to-end coverage label Jun 10, 2026
@github-actions

Copy link
Copy Markdown

Label test:e2e applied for ac1e51f. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute the standard E2E suite after building the required gateway and supervisor images once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

@TaylorMutch

Copy link
Copy Markdown
Collaborator

My agent had this for feedback:


• Findings

  1. Medium: shutdown does not wait for the lease coordinator to finish. /private/tmp/openshell-pr1577-review/crates/openshell-server/src/lib.rs:350 spawns watchers without retaining a JoinHandle, then shutdown only waits for listener tasks before cleanup_on_shutdown() at /private/tmp/
     openshell-pr1577-review/crates/openshell-server/src/lib.rs:446. That makes lease release and loop cancellation best-effort; cleanup can run while reconcile/watch tasks are still active, and process exit can leave the lease until TTL expiry. The coordinator should be tracked and awaited
     before shutdown cleanup.

  2. Medium: the holder releases the DB lease before stopping its child loops. In run_as_holder, shutdown calls lease.release(guard).await before cancel_tx.send(true) and awaiting the watch/reconcile handles at /private/tmp/openshell-pr1577-review/crates/openshell-server/src/compute/
     mod.rs:816. A standby can acquire immediately while the old holder’s reconcile sweep is still running; cancellation is only observed after reconcile_store_with_backend() returns at /private/tmp/openshell-pr1577-review/crates/openshell-server/src/compute/mod.rs:887. Cancel and await the
     child loops before deleting the lease.

@TaylorMutch TaylorMutch self-assigned this Jun 10, 2026
@TaylorMutch

Copy link
Copy Markdown
Collaborator

I did test this locally with postgres and validated this worked as expected, so I am okay to merge this as-is to get the ball rolling forward

@TaylorMutch

TaylorMutch commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Output from my test:


• Validated PR #1577 locally on branch pr-1577.

What passed

  • Created k3d cluster openshell-dev-pr-1577.

  • Deployed bundled Postgres with replicaCount: 2 via mise run helm:skaffold:run.

  • Verified:

    • openshell StatefulSet ready at 2/2.
    • Postgres ready at 1/1.
    • Lease row existed in Postgres with holder openshell-0.
    • Only openshell-0 ran reconcile sweeps; openshell-1 stayed standby.
  • Deleted holder pod openshell-0.

    • Lease moved to openshell-1.
    • openshell-1 began reconcile sweeps.
  • Re-deployed bundled Postgres with replicaCount: 1.

    • openshell-0 acquired the lease and began reconcile sweeps.

Important caveat
Failover worked, but it appeared to wait for TTL expiry rather than immediate graceful release. After deleting/scaling down the current holder, the old holder remained in the lease row until expiry, then the next replica acquired it. That matches the lifecycle concern from the review:
shutdown/release is not strongly coordinated.

@derekwaynecarr

Copy link
Copy Markdown
Collaborator Author

thanks @TaylorMutch , merging so we can move forward next with your PR.

@derekwaynecarr derekwaynecarr merged commit e73745f into NVIDIA:main Jun 11, 2026
61 of 65 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage test:e2e-kubernetes Requires Kubernetes end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(gateway): reconciler lease for HA multi-replica deployments

2 participants