[WIP] Stereo mode PoC by marceloneppel · Pull Request #1401 · canonical/postgresql-operator

marceloneppel · 2026-01-27T11:56:52Z

Summary

This PR introduces a proof-of-concept implementation for PostgreSQL stereo mode - a 2-node PostgreSQL cluster configuration with a lightweight watcher/witness charm that provides Raft quorum without running PostgreSQL. This enables automatic failover in 2-node clusters by providing the necessary third vote for consensus.

The implementation includes:

A new postgresql-watcher charm that participates in Raft consensus using pysyncobj
Health monitoring of PostgreSQL endpoints via psycopg2 connections
A systemd service for persistence between charm hook invocations
KVStoreTTL-compatible class for Patroni DCS compatibility with TTL expiry logic for failover
Integration with the main PostgreSQL operator via a new watcher relation

How It Works

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ PostgreSQL  │     │ PostgreSQL  │     │   Watcher   │
│   Primary   │◄───►│   Replica   │◄───►│  (Witness)  │
│  Raft Vote  │     │  Raft Vote  │     │  Raft Vote  │
└─────────────┘     └─────────────┘     └─────────────┘
       │                   │                   │
       └───────────────────┴───────────────────┘
                    Raft Consensus
                   (3 nodes = quorum)

In a 2-node PostgreSQL cluster, both nodes have Raft votes
The watcher provides the 3rd vote needed for quorum
If one PostgreSQL node fails, the remaining node + watcher form quorum (2/3)
Patroni can elect a new leader without manual intervention

Deployment

# Deploy PostgreSQL with 2 units
juju deploy postgresql --channel 16/edge/stereo-mode-poc --base ubuntu@24.04 --num-units 2

# Deploy watcher (ideally in different AZ)
juju deploy postgresql-watcher --channel latest/edge/stereo-mode-poc --base ubuntu@24.04

# Integrate PostgreSQL with watcher
juju integrate postgresql:watcher postgresql-watcher:watcher

# Deploy the test application
juju deploy postgresql-test-app

# Integrate PostgreSQL with test app
juju integrate postgresql postgresql-test-app:database

# Wait for all applications to be active
juju status --watch 5s

# Start continuous writes to test the cluster
juju run postgresql-test-app/0 start-continuous-writes

# Verify writes are increasing (run multiple times to see count increase)
juju run postgresql-test-app/0 show-continuous-writes
# Example output: writes: "38"

juju run postgresql-test-app/0 show-continuous-writes
# Example output: writes: "87"

# You can also check the watcher topology
juju run postgresql-watcher/0 show-topology

# Trigger a manual health check
juju run postgresql-watcher/0 trigger-health-check

Known Issues / TODOs

Draft/WIP status - needs further testing and review
Code scanning flagged clear-text logging of sensitive information
Test coverage below target (46.68% vs 70% target)
Forbidding deployment of the watcher in the same availability zone as PostgreSQL units is still pending

Add a lightweight witness/voter charm that participates in Raft consensus to provide quorum in 2-node PostgreSQL clusters without storing any PostgreSQL data. Key components: - Watcher charm with Raft controller integration - Health checking for PostgreSQL endpoints - Relation interface (postgresql_watcher) for PostgreSQL operator - Topology and health check actions Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

postgresql-watcher/src/charm.py

+            content = secret.get_content(refresh=True)
+            return content.get("raft-password")
+        except SecretNotFoundError:
+            logger.warning(f"Secret {secret_id} not found")


postgresql-watcher/src/raft_controller.py

src/relations/watcher.py

+        # Get the secret ID for sharing
+        try:
+            secret = self.charm.model.get_secret(label=WATCHER_SECRET_LABEL)
+            logger.info(f"Got secret for update: {secret}")


src/relations/watcher.py

+            secret = self.charm.model.get_secret(label=WATCHER_SECRET_LABEL)
+            logger.info(f"Got secret for update: {secret}")
+            secret_id = secret.id
+            logger.info(f"Initial secret_id: {secret_id}")


src/relations/watcher.py

+                # the ops library lazily loads the ID. We need the ID to share with the watcher.
+                logger.info("Applying secret ID workaround")
+                secret_info = secret.get_info()
+                logger.info(f"Secret info: {secret_info}, id={secret_info.id}")


src/relations/watcher.py

+                # Use the ID directly from get_info() - it already has the full URI
+                secret._id = secret_info.id
+                secret_id = secret.id
+                logger.info(f"Workaround secret_id: {secret_id}")


src/relations/watcher.py

+            "raft-partner-addrs": json.dumps(sorted(raft_partner_addrs)),
+            "raft-port": str(RAFT_PORT),
+        }
+        logger.info(f"Updating relation app data: {update_data}")


codecov · 2026-01-27T12:00:11Z

Codecov Report

❌ Patch coverage is 46.68305% with 217 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.57%. Comparing base (44c4c40) to head (ef92a6b).

Files with missing lines	Patch %	Lines
src/relations/watcher.py	53.35%	128 Missing and 18 partials ⚠️
src/cluster.py	6.00%	47 Missing ⚠️
src/charm.py	38.46%	18 Missing and 6 partials ⚠️

❌ Your project check has failed because the head coverage (68.57%) is below the target coverage (70.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff             @@
##           16/edge    #1401      +/-   ##
===========================================
- Coverage    70.53%   68.57%   -1.97%     
===========================================
  Files           16       17       +1     
  Lines         4297     4694     +397     
  Branches       691      749      +58     
===========================================
+ Hits          3031     3219     +188     
- Misses        1055     1240     +185     
- Partials       211      235      +24

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

… pysyncobj Raft service Add standalone raft_service.py that implements KVStoreTTL-compatible Raft node managed as a systemd service, eliminating the dependency on the charmed-postgresql snap. Remove automatic health checks in favor of on-demand checks via action, since the watcher lacks PostgreSQL credentials. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

postgresql-watcher/src/raft_controller.py

+            return
+
+        # Write service file
+        Path(SERVICE_FILE).write_text(service_content)


…tereo mode tests Replace cut_network_from_unit_without_ip_change with cut_network_from_unit in stereo mode integration tests. The iptables-based approach with REJECT was still causing timeouts; removing the interface entirely triggers faster TCP connection failures. Added use_ip_from_inside=True for check_writes since restored units get new IPs. Also adds spread task for stereo mode tests. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

Add Raft member proactively during IP change to prevent race conditions where member restarts Patroni before being added to cluster. Implement watcher removal from Raft on relation departure to maintain correct quorum calculations. Add idempotency check before adding watcher to Raft. Use fresh peer IPs for Raft member addition instead of cached values. Update stereo mode tests with iptables-based network isolation and Raft health verification. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

…o tests Build the watcher charm automatically if not found and deploy charms sequentially instead of concurrently to improve reliability. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

- Add idempotency check to skip deployment if already in expected state - Clean up unexpected state before redeploying to avoid test pollution - Add wait_for_idle after replica shutdown to allow cluster stabilization Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

…fy_raft_cluster_health call - Add use_ip_from_inside=True to test_watcher_network_isolation to handle stale IPs - Fix verify_raft_cluster_health call in test_health_check_action to pass required arguments Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

Add __expire_keys and _onTick methods to WatcherKVStoreTTL to match Patroni's KVStoreTTL behavior. When the watcher becomes the Raft leader (e.g., when PostgreSQL primary is network-isolated), it must expire stale leader keys so that a replica can acquire leadership. Without this fix, the watcher would become Raft leader but wouldn't process TTL expirations, causing the old Patroni leader key to remain valid and preventing failover. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

Juju action results require hyphenated keys (e.g., 'healthy-count') rather than underscored keys. Fixed the health check action to use proper key format and updated test expectations. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

…sues - Add watcher PostgreSQL user for health check authentication: - Create 'watcher' user with password via relation secret - Add pg_hba.conf entry for watcher IP in patroni.yml template - Pass password from relation secret to health checker - Fix lint issues: - Extract S3 initialization to _handle_s3_initialization() to reduce _on_peer_relation_changed complexity from 11 to 10 - Use absolute paths for subprocess commands (/usr/bin/systemctl, etc.) - Update type hints to use modern syntax (X | None vs Optional[X]) - Fix line length formatting issues - Fix unit test failures: - Add missing mocks in test_update_member_ip for endpoint methods - Add _units_ips mock in test_update_relation_data_leader - Fix integration test: - Add check_watcher_ip parameter to verify_raft_cluster_health() to handle watcher IP changes after network isolation tests - Update watcher charm to handle IP changes: - Add _update_unit_address_if_changed() for IP change detection - Call from config-changed and update-status events Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

postgresql-watcher/src/charm.py

+            content = secret.get_content(refresh=True)
+            return content.get("raft-password")
+        except SecretNotFoundError:
+            logger.warning(f"Secret {secret_id} not found")


Remove outdated constraint about deploy order being critical for stereo mode with Raft DCS. Testing confirmed that 2 PostgreSQL units can now be deployed simultaneously without causing split-brain. Also update deprecated relate() calls to integrate(). Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

github-actions bot added the Libraries: Out of sync The charm libs used are out-of-sync label Jan 27, 2026

github-advanced-security bot found potential problems Jan 27, 2026

View reviewed changes

github-advanced-security bot found potential problems Jan 28, 2026

View reviewed changes

marceloneppel added 4 commits January 28, 2026 16:29

fix(tests): auto-build watcher charm and deploy sequentially in stere…

1a0de53

…o tests Build the watcher charm automatically if not found and deploy charms sequentially instead of concurrently to improve reliability. Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

marceloneppel changed the title ~~[WIP] Stereo mode poc~~ [WIP] Stereo mode PoC Jan 30, 2026

marceloneppel added 4 commits February 2, 2026 11:24

github-advanced-security bot found potential problems Feb 2, 2026

View reviewed changes

marceloneppel force-pushed the stereo-mode-poc branch from ff68921 to 2c2e8d3 Compare February 2, 2026 16:22

marceloneppel added 2 commits February 2, 2026 16:29

Merge remote-tracking branch 'origin/16/edge' into stereo-mode-poc

ef92a6b

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>

taurus-forever mentioned this pull request Feb 6, 2026

The charm allows a 2-node cluster but it's not functional after a failover #570

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Stereo mode PoC#1401

[WIP] Stereo mode PoC#1401
marceloneppel wants to merge 12 commits into16/edgefrom
stereo-mode-poc

marceloneppel commented Jan 27, 2026 •

edited

Loading

Uh oh!

Check failure

Uh oh!

Check failure

Check failure

Check failure

Check failure

Check failure

codecov bot commented Jan 27, 2026 •

edited

Loading

Uh oh!

Check failure

Check failure

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marceloneppel commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How It Works

Deployment

Known Issues / TODOs

Uh oh!

Check failure

Uh oh!

Uh oh!

Check failure

Uh oh!

Check failure

Uh oh!

Check failure

Uh oh!

Check failure

Uh oh!

Check failure

Uh oh!

Uh oh!

codecov bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Check failure

Uh oh!

Uh oh!

Check failure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

marceloneppel commented Jan 27, 2026 •

edited

Loading

codecov bot commented Jan 27, 2026 •

edited

Loading