Conversation
…ation on leader elected
There was a problem hiding this comment.
Pull request overview
This PR extends the charm’s HA/integration testing capabilities (including K8S network-failure simulation) and improves runtime behavior around IP changes by coordinating restarts and refreshing/regenerating TLS certificates when SANs no longer match.
Changes:
- Add HA “network cut” integration tests for VM and K8S substrates (with TLS on/off variants) and supporting Chaos Mesh tooling.
- Enhance TLS/IP-change handling by detecting SAN drift and triggering cert regeneration/refresh plus a coordinated restart workflow.
- Improve lock coordination by introducing a restart lock and related peer state fields; add K8S client dependency for integration helpers.
Reviewed changes
Copilot reviewed 23 out of 24 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
tests/unit/test_charm.py |
Updates unit tests and adds coverage for IP-change behavior without a TLS relation. |
tests/spread/vm/test_network_cut_tls_on.py/task.yaml |
Spread task to run VM HA network cut test with TLS enabled. |
tests/spread/vm/test_network_cut_tls_off.py/task.yaml |
Spread task to run VM HA network cut test with TLS disabled. |
tests/spread/k8s/test_network_cut_tls_on.py/task.yaml |
Spread task to run K8S HA network cut test with TLS enabled. |
tests/spread/k8s/test_network_cut_tls_off.py/task.yaml |
Spread task to run K8S HA network cut test with TLS disabled. |
tests/integration/helpers.py |
Adds TLS-aware CLI execution, more robust primary detection, and a helper to read unit IP. |
tests/integration/ha/test_network_cut.py |
New HA scenario test validating failover/replica counts and TLS SAN updates after IP change. |
tests/integration/ha/helpers/helpers.py |
New HA helper utilities for cutting/restoring network on VM/K8S (Chaos Mesh + kubernetes client). |
tests/integration/ha/helpers/deploy_chaos_mesh.sh |
Script to deploy Chaos Mesh in test namespaces. |
tests/integration/ha/helpers/destroy_chaos_mesh.sh |
Script to remove Chaos Mesh resources after tests. |
tests/integration/ha/helpers/chaos_network_loss.yml |
NetworkChaos manifest template to simulate packet loss. |
tests/integration/ha/conftest.py |
Adds fixture to deploy/destroy Chaos Mesh for K8S runs. |
tests/integration/cw_helpers.py |
Adds TLS toggle to continuous-writes assertion helper. |
tests/integration/continuous_writes.py |
Adds advanced TLS client configuration support for continuous writes. |
tests/integration/conftest.py |
Adds async cleanup fixture for continuous writes. |
src/statuses.py |
Adds new maintenance statuses for unhealthy restarts. |
src/managers/tls.py |
Adds SAN parsing and comparison to decide when certs require regeneration. |
src/managers/config.py |
Ensures replica config uses the current endpoint binding. |
src/events/tls.py |
Observes config-changed to trigger cert refresh/regeneration and restart on IP change. |
src/events/base_events.py |
Introduces restart event + restart lock usage and restart orchestration logic. |
src/core/models.py |
Adds peer fields for restart lock coordination. |
src/common/locks.py |
Adds restart lock and timestamps for lock request/release updates. |
pyproject.toml |
Adds kubernetes dependency for K8S HA test tooling. |
poetry.lock |
Updates lockfile to include new dependencies and Poetry version metadata. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if [ "$(kubectl get clusterrole | grep 'chaos-mesh' | awk '{print $1}' | wc -l)" != "0" ]; then | ||
| echo "deleting clusterroles" | ||
| timeout 30 kubectl delete clusterrole "$(kubectl get clusterrole | grep 'chaos-mesh' | awk '{print $1}')" || : |
reneradoi
left a comment
There was a problem hiding this comment.
Very nice, especially the test coverage and the restart lock! I only have a few comments, but general concept looks good already.
| self.charm.cluster_manager.reload_tls_settings(tls_config) | ||
| self.charm.sentinel_manager.restart_service() | ||
|
|
||
| def _on_config_changed(self, event: ops.ConfigChangedEvent) -> None: |
There was a problem hiding this comment.
Separating concerns is nice and I really like it! After reading the code though I think I need to revert my earlier comment and admit that it makes more sense to have it in the same event handler and method. Reason being that it seems to be overly complicated just to adhere to the structure, and now half of the logic here is not related to TLS concerns. Therefore, I'd say let's move this code to the base_events._on_config_changed() and have a clean implementation there.
There was a problem hiding this comment.
I'm ok moving it back but the current code is to update the certificates on IP change.
src/events/base_events.py
Outdated
| statuses_state=self.charm.state.statuses, | ||
| ) | ||
|
|
||
| self.charm.state.statuses.delete( |
There was a problem hiding this comment.
Does this reset a status that might just previously have been set? Or am I missing something here?
src/events/base_events.py
Outdated
| if event.restart_valkey: | ||
| self.charm.workload.restart(self.charm.workload.valkey_service) | ||
| if event.restart_sentinel: | ||
| self.charm.sentinel_manager.restart_service() | ||
|
|
There was a problem hiding this comment.
We need to add error handling here in case these methods raise, otherwise the event would crash.
Mehdi-Bendriss
left a comment
There was a problem hiding this comment.
Thanks Smail! Good work! I have some comments
|
|
||
| unit_request_lock_atr_name: str | ||
| member_with_lock_atr_name: str | ||
| lock_timestamp: str = "databaglock_timestamp" |
There was a problem hiding this comment.
these iterations are exactly why we wanna have a unified locking mechanism 😬
There was a problem hiding this comment.
Yes, I have touched based with Patricia again last week about using the lib. I will open a PR for it when things settle down a little bit.
src/events/base_events.py
Outdated
| statuses_state=self.charm.state.statuses, | ||
| ) | ||
|
|
||
| self.charm.state.statuses.delete( |
There was a problem hiding this comment.
why are we clearing immediately a running status we just set ?
| """Handle the `config-changed` event.""" | ||
| if ( | ||
| self.charm.state.unit_server.model.private_ip | ||
| and self.charm.state.bind_address != self.charm.state.unit_server.model.private_ip |
There was a problem hiding this comment.
is my understanding correct that this only applies to VM? if yes, since k8s uses FQDNs why not exit the hook immediately?
There was a problem hiding this comment.
Mostly, but we still need to update the certs for k8s
|
|
||
| # restore network to the original primary unit | ||
| logger.info("Restoring network to original primary unit at %s", primary_hostname) | ||
| restore_network_to_unit(substrate, juju.model, machine_name) |
There was a problem hiding this comment.
shouldn't we distinguish between with/without IP change? namely resetting the ingress/egress/priority
| status, APP_NAME, unit_count=NUM_UNITS, idle_period=30 | ||
| ) | ||
| ) | ||
| c_writes.update() |
There was a problem hiding this comment.
Is this for the case where we have an IP change?
There was a problem hiding this comment.
Yes, correct to update the list of endpoints for the cw
| # read ip from cert and check if is a different ip than before if change_ip is True | ||
| certificate_sans = get_sans_from_certificate("./client.pem") | ||
| if change_ip: | ||
| assert primary_ip not in certificate_sans["sans_ip"], ( |
There was a problem hiding this comment.
Let's also add a check here that confirms the endpoints returned by Sentinel do not contain the old primary but contain the new one
| assert number_of_replicas == NUM_UNITS - 1, ( | ||
| f"Expected {NUM_UNITS - 1} connected replicas after network restoration, got {number_of_replicas}." | ||
| ) | ||
|
|
There was a problem hiding this comment.
Let's also add a check here that confirms the endpoints returned by Sentinel contain the old primary
This pull request introduces several improvements focused on handling TLS certificate updates when a unit's IP address changes, enhances lock management logic, and adds support for advanced TLS configuration in integration tests. It also introduces new HA testing infrastructure for Kubernetes environments.
TLS certificate and IP change handling:
_on_ip_changemethod inbase_events.pyand a custom exceptionTLSCertificatesRequireRefreshError. Certificates are now checked for SAN changes and updated if needed, with events emitted for client relations. (src/events/base_events.py,src/managers/tls.py,src/common/exceptions.py, [1] [2] [3] [4]tests/integration/continuous_writes.py, [1] [2] [3]Lock management enhancements:
src/common/locks.py, [1] [2] [3] [4] [5]Kubernetes HA testing infrastructure:
tests/integration/ha/conftest.py,tests/integration/ha/helpers/deploy_chaos_mesh.sh,tests/integration/ha/helpers/chaos_network_loss.yml, [1] [2] [3]Configuration and dependency updates:
kubernetesdependency topyproject.tomlfor K8S-related testing and operations. (pyproject.toml, pyproject.tomlR57)src/managers/config.py, src/managers/cluster.py, [1] [2]These changes improve the robustness of certificate management, lock coordination, and enable advanced HA testing capabilities in Kubernetes environments.