Skip to content

[DPE-9326]: Add HA Failover tests#28

Open
skourta wants to merge 15 commits intodpe-9325-network-hafrom
dpe-9326-failover-tests
Open

[DPE-9326]: Add HA Failover tests#28
skourta wants to merge 15 commits intodpe-9325-network-hafrom
dpe-9326-failover-tests

Conversation

@skourta
Copy link
Copy Markdown
Contributor

@skourta skourta commented Mar 19, 2026

This pull request introduces a coordinated restart lock mechanism for workload restarts, ensures safe handling of IP/hostname changes (especially on VMs), and improves certificate SAN handling for TLS. It also adds new status reporting for unhealthy restarts and refines lock management for start and restart operations. The changes are grouped below by theme.

Restart Lock and Event Handling:

  • Introduced a new RestartLock class (subclassing DataBagLock) to coordinate restarts across units, with associated model fields (restart_member, request_restart_lock) and lock logic. This prevents concurrent restarts and ensures safe rolling restarts. [1] [2] [3]
  • Added a RestartWorkloadEvent and corresponding event source/handler in BaseEvents, which triggers coordinated restarts of Valkey and Sentinel services, checks health post-restart, and sets or clears new unhealthy restart statuses. [1] [2] [3] [4] [5]
  • Updated peer relation and leader election handlers to process both start and restart locks. [1] [2]

TLS Certificate and IP/Hostname Change Handling:

  • Implemented logic to detect when the private IP/hostname changes (on VMs), and trigger certificate regeneration or refresh if SANs have changed, using new methods get_current_sans and certificate_sans_require_update in the TLS manager. [1] [2]
  • On IP/hostname changes, the charm now emits a coordinated restart event (on VMs) to ensure services pick up new certificates and configuration.
  • Added observation of config_changed in TLS event handler for this purpose.

Lock Management Improvements:

  • Refined DataBagLock and StartLock logic: added dynamic timestamp attribute, improved lock selection, and clarified lock-free checks for both start and restart operations. [1] [2] [3] [4] [5]
  • Adjusted start event flow to defer only when necessary and ensure lock acquisition is handled after pre-checks. [1] [2]

Status and Configuration Reporting:

  • Added new status objects for reporting unhealthy states after restarts (VALKEY_UNHEALTHY_RESTART, SENTINEL_UNHEALTHY_RESTART).
  • Minor config logic tweaks for endpoint and replica configuration. [1] [2]

Dependency Update:

  • Added kubernetes as a dependency in pyproject.toml.

@skourta skourta changed the base branch from 9/edge to dpe-9325-network-ha March 19, 2026 16:27
@skourta skourta marked this pull request as ready for review March 19, 2026 21:23
Copy link
Copy Markdown
Contributor

@reneradoi reneradoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor comments, looks good!

environment:
TEST_MODULE: ha/test_failover.py
execute: |
tox run -e integration -- "tests/integration/$TEST_MODULE" --substrate k8s --alluredir="$SPREAD_TASK/allure-results"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to parameterize these tests, otherwise they will just be deployed with tls=off and the other test will be skipped.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a question I wanted to bring up in our 1-1. Should I also add TLS variants to all of these. Currently we do not.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed: If not too much of hassle we can add the TLS run, yes. If it takes too long or is too much work for now, we can also postpone this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants