Conversation
Add a lightweight witness/voter charm that participates in Raft consensus to provide quorum in 2-node PostgreSQL clusters without storing any PostgreSQL data. Key components: - Watcher charm with Raft controller integration - Health checking for PostgreSQL endpoints - Relation interface (postgresql_watcher) for PostgreSQL operator - Topology and health check actions Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
| content = secret.get_content(refresh=True) | ||
| return content.get("raft-password") | ||
| except SecretNotFoundError: | ||
| logger.warning(f"Secret {secret_id} not found") |
Check failure
Code scanning / CodeQL
Clear-text logging of sensitive information High
| # Get the secret ID for sharing | ||
| try: | ||
| secret = self.charm.model.get_secret(label=WATCHER_SECRET_LABEL) | ||
| logger.info(f"Got secret for update: {secret}") |
Check failure
Code scanning / CodeQL
Clear-text logging of sensitive information High
| secret = self.charm.model.get_secret(label=WATCHER_SECRET_LABEL) | ||
| logger.info(f"Got secret for update: {secret}") | ||
| secret_id = secret.id | ||
| logger.info(f"Initial secret_id: {secret_id}") |
Check failure
Code scanning / CodeQL
Clear-text logging of sensitive information High
| # the ops library lazily loads the ID. We need the ID to share with the watcher. | ||
| logger.info("Applying secret ID workaround") | ||
| secret_info = secret.get_info() | ||
| logger.info(f"Secret info: {secret_info}, id={secret_info.id}") |
Check failure
Code scanning / CodeQL
Clear-text logging of sensitive information High
| # Use the ID directly from get_info() - it already has the full URI | ||
| secret._id = secret_info.id | ||
| secret_id = secret.id | ||
| logger.info(f"Workaround secret_id: {secret_id}") |
Check failure
Code scanning / CodeQL
Clear-text logging of sensitive information High
| "raft-partner-addrs": json.dumps(sorted(raft_partner_addrs)), | ||
| "raft-port": str(RAFT_PORT), | ||
| } | ||
| logger.info(f"Updating relation app data: {update_data}") |
Check failure
Code scanning / CodeQL
Clear-text logging of sensitive information High
Codecov Report❌ Patch coverage is
❌ Your project check has failed because the head coverage (68.57%) is below the target coverage (70.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## 16/edge #1401 +/- ##
===========================================
- Coverage 70.53% 68.57% -1.97%
===========================================
Files 16 17 +1
Lines 4297 4694 +397
Branches 691 749 +58
===========================================
+ Hits 3031 3219 +188
- Misses 1055 1240 +185
- Partials 211 235 +24 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
… pysyncobj Raft service Add standalone raft_service.py that implements KVStoreTTL-compatible Raft node managed as a systemd service, eliminating the dependency on the charmed-postgresql snap. Remove automatic health checks in favor of on-demand checks via action, since the watcher lacks PostgreSQL credentials. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…tereo mode tests Replace cut_network_from_unit_without_ip_change with cut_network_from_unit in stereo mode integration tests. The iptables-based approach with REJECT was still causing timeouts; removing the interface entirely triggers faster TCP connection failures. Added use_ip_from_inside=True for check_writes since restored units get new IPs. Also adds spread task for stereo mode tests. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Add Raft member proactively during IP change to prevent race conditions where member restarts Patroni before being added to cluster. Implement watcher removal from Raft on relation departure to maintain correct quorum calculations. Add idempotency check before adding watcher to Raft. Use fresh peer IPs for Raft member addition instead of cached values. Update stereo mode tests with iptables-based network isolation and Raft health verification. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…o tests Build the watcher charm automatically if not found and deploy charms sequentially instead of concurrently to improve reliability. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
- Add idempotency check to skip deployment if already in expected state - Clean up unexpected state before redeploying to avoid test pollution - Add wait_for_idle after replica shutdown to allow cluster stabilization Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…fy_raft_cluster_health call - Add use_ip_from_inside=True to test_watcher_network_isolation to handle stale IPs - Fix verify_raft_cluster_health call in test_health_check_action to pass required arguments Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Add __expire_keys and _onTick methods to WatcherKVStoreTTL to match Patroni's KVStoreTTL behavior. When the watcher becomes the Raft leader (e.g., when PostgreSQL primary is network-isolated), it must expire stale leader keys so that a replica can acquire leadership. Without this fix, the watcher would become Raft leader but wouldn't process TTL expirations, causing the old Patroni leader key to remain valid and preventing failover. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Juju action results require hyphenated keys (e.g., 'healthy-count') rather than underscored keys. Fixed the health check action to use proper key format and updated test expectations. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…sues
- Add watcher PostgreSQL user for health check authentication:
- Create 'watcher' user with password via relation secret
- Add pg_hba.conf entry for watcher IP in patroni.yml template
- Pass password from relation secret to health checker
- Fix lint issues:
- Extract S3 initialization to _handle_s3_initialization() to reduce
_on_peer_relation_changed complexity from 11 to 10
- Use absolute paths for subprocess commands (/usr/bin/systemctl, etc.)
- Update type hints to use modern syntax (X | None vs Optional[X])
- Fix line length formatting issues
- Fix unit test failures:
- Add missing mocks in test_update_member_ip for endpoint methods
- Add _units_ips mock in test_update_relation_data_leader
- Fix integration test:
- Add check_watcher_ip parameter to verify_raft_cluster_health()
to handle watcher IP changes after network isolation tests
- Update watcher charm to handle IP changes:
- Add _update_unit_address_if_changed() for IP change detection
- Call from config-changed and update-status events
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
ff68921 to
2c2e8d3
Compare
Remove outdated constraint about deploy order being critical for stereo mode with Raft DCS. Testing confirmed that 2 PostgreSQL units can now be deployed simultaneously without causing split-brain. Also update deprecated relate() calls to integrate(). Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Summary
This PR introduces a proof-of-concept implementation for PostgreSQL stereo mode - a 2-node PostgreSQL cluster configuration with a lightweight watcher/witness charm that provides Raft quorum without running PostgreSQL. This enables automatic failover in 2-node clusters by providing the necessary third vote for consensus.
The implementation includes:
postgresql-watchercharm that participates in Raft consensus using pysyncobjwatcherrelationHow It Works
Deployment
Known Issues / TODOs