Skip to content

[WIP] Stereo mode PoC#1401

Draft
marceloneppel wants to merge 12 commits into16/edgefrom
stereo-mode-poc
Draft

[WIP] Stereo mode PoC#1401
marceloneppel wants to merge 12 commits into16/edgefrom
stereo-mode-poc

Conversation

@marceloneppel
Copy link
Member

@marceloneppel marceloneppel commented Jan 27, 2026

Summary

This PR introduces a proof-of-concept implementation for PostgreSQL stereo mode - a 2-node PostgreSQL cluster configuration with a lightweight watcher/witness charm that provides Raft quorum without running PostgreSQL. This enables automatic failover in 2-node clusters by providing the necessary third vote for consensus.

The implementation includes:

  • A new postgresql-watcher charm that participates in Raft consensus using pysyncobj
  • Health monitoring of PostgreSQL endpoints via psycopg2 connections
  • A systemd service for persistence between charm hook invocations
  • KVStoreTTL-compatible class for Patroni DCS compatibility with TTL expiry logic for failover
  • Integration with the main PostgreSQL operator via a new watcher relation

How It Works

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ PostgreSQL  │     │ PostgreSQL  │     │   Watcher   │
│   Primary   │◄───►│   Replica   │◄───►│  (Witness)  │
│  Raft Vote  │     │  Raft Vote  │     │  Raft Vote  │
└─────────────┘     └─────────────┘     └─────────────┘
       │                   │                   │
       └───────────────────┴───────────────────┘
                    Raft Consensus
                   (3 nodes = quorum)
  • In a 2-node PostgreSQL cluster, both nodes have Raft votes
  • The watcher provides the 3rd vote needed for quorum
  • If one PostgreSQL node fails, the remaining node + watcher form quorum (2/3)
  • Patroni can elect a new leader without manual intervention

Deployment

# Deploy PostgreSQL with 2 units
juju deploy postgresql --channel 16/edge/stereo-mode-poc --base ubuntu@24.04 --num-units 2

# Deploy watcher (ideally in different AZ)
juju deploy postgresql-watcher --channel latest/edge/stereo-mode-poc --base ubuntu@24.04

# Integrate PostgreSQL with watcher
juju integrate postgresql:watcher postgresql-watcher:watcher

# Deploy the test application
juju deploy postgresql-test-app

# Integrate PostgreSQL with test app
juju integrate postgresql postgresql-test-app:database

# Wait for all applications to be active
juju status --watch 5s

# Start continuous writes to test the cluster
juju run postgresql-test-app/0 start-continuous-writes

# Verify writes are increasing (run multiple times to see count increase)
juju run postgresql-test-app/0 show-continuous-writes
# Example output: writes: "38"

juju run postgresql-test-app/0 show-continuous-writes
# Example output: writes: "87"

# You can also check the watcher topology
juju run postgresql-watcher/0 show-topology

# Trigger a manual health check
juju run postgresql-watcher/0 trigger-health-check

Known Issues / TODOs

  • Draft/WIP status - needs further testing and review
  • Code scanning flagged clear-text logging of sensitive information
  • Test coverage below target (46.68% vs 70% target)
  • Forbidding deployment of the watcher in the same availability zone as PostgreSQL units is still pending

Add a lightweight witness/voter charm that participates in Raft
consensus to provide quorum in 2-node PostgreSQL clusters without
storing any PostgreSQL data.

Key components:
- Watcher charm with Raft controller integration
- Health checking for PostgreSQL endpoints
- Relation interface (postgresql_watcher) for PostgreSQL operator
- Topology and health check actions

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
@github-actions github-actions bot added the Libraries: Out of sync The charm libs used are out-of-sync label Jan 27, 2026
content = secret.get_content(refresh=True)
return content.get("raft-password")
except SecretNotFoundError:
logger.warning(f"Secret {secret_id} not found")

Check failure

Code scanning / CodeQL

Clear-text logging of sensitive information High

This expression logs
sensitive data (secret)
as clear text.
# Get the secret ID for sharing
try:
secret = self.charm.model.get_secret(label=WATCHER_SECRET_LABEL)
logger.info(f"Got secret for update: {secret}")

Check failure

Code scanning / CodeQL

Clear-text logging of sensitive information High

This expression logs
sensitive data (secret)
as clear text.
secret = self.charm.model.get_secret(label=WATCHER_SECRET_LABEL)
logger.info(f"Got secret for update: {secret}")
secret_id = secret.id
logger.info(f"Initial secret_id: {secret_id}")

Check failure

Code scanning / CodeQL

Clear-text logging of sensitive information High

This expression logs
sensitive data (secret)
as clear text.
# the ops library lazily loads the ID. We need the ID to share with the watcher.
logger.info("Applying secret ID workaround")
secret_info = secret.get_info()
logger.info(f"Secret info: {secret_info}, id={secret_info.id}")

Check failure

Code scanning / CodeQL

Clear-text logging of sensitive information High

This expression logs
sensitive data (secret)
as clear text.
# Use the ID directly from get_info() - it already has the full URI
secret._id = secret_info.id
secret_id = secret.id
logger.info(f"Workaround secret_id: {secret_id}")

Check failure

Code scanning / CodeQL

Clear-text logging of sensitive information High

This expression logs
sensitive data (secret)
as clear text.
"raft-partner-addrs": json.dumps(sorted(raft_partner_addrs)),
"raft-port": str(RAFT_PORT),
}
logger.info(f"Updating relation app data: {update_data}")

Check failure

Code scanning / CodeQL

Clear-text logging of sensitive information High

This expression logs
sensitive data (secret)
as clear text.
This expression logs
sensitive data (secret)
as clear text.
@codecov
Copy link

codecov bot commented Jan 27, 2026

Codecov Report

❌ Patch coverage is 46.68305% with 217 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.57%. Comparing base (44c4c40) to head (ef92a6b).

Files with missing lines Patch % Lines
src/relations/watcher.py 53.35% 128 Missing and 18 partials ⚠️
src/cluster.py 6.00% 47 Missing ⚠️
src/charm.py 38.46% 18 Missing and 6 partials ⚠️

❌ Your project check has failed because the head coverage (68.57%) is below the target coverage (70.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff             @@
##           16/edge    #1401      +/-   ##
===========================================
- Coverage    70.53%   68.57%   -1.97%     
===========================================
  Files           16       17       +1     
  Lines         4297     4694     +397     
  Branches       691      749      +58     
===========================================
+ Hits          3031     3219     +188     
- Misses        1055     1240     +185     
- Partials       211      235      +24     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

… pysyncobj Raft service

Add standalone raft_service.py that implements KVStoreTTL-compatible
Raft node managed as a systemd service, eliminating the dependency on
the charmed-postgresql snap. Remove automatic health checks in favor of
on-demand checks via action, since the watcher lacks PostgreSQL credentials.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
return

# Write service file
Path(SERVICE_FILE).write_text(service_content)

Check failure

Code scanning / CodeQL

Clear-text storage of sensitive information High

This expression stores
sensitive data (password)
as clear text.
This expression stores
sensitive data (password)
as clear text.
…tereo mode tests

Replace cut_network_from_unit_without_ip_change with cut_network_from_unit
in stereo mode integration tests. The iptables-based approach with REJECT
was still causing timeouts; removing the interface entirely triggers faster
TCP connection failures. Added use_ip_from_inside=True for check_writes
since restored units get new IPs. Also adds spread task for stereo mode tests.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Add Raft member proactively during IP change to prevent race conditions
where member restarts Patroni before being added to cluster. Implement
watcher removal from Raft on relation departure to maintain correct
quorum calculations. Add idempotency check before adding watcher to Raft.
Use fresh peer IPs for Raft member addition instead of cached values.
Update stereo mode tests with iptables-based network isolation and Raft
health verification.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…o tests

Build the watcher charm automatically if not found and deploy charms
sequentially instead of concurrently to improve reliability.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
- Add idempotency check to skip deployment if already in expected state
- Clean up unexpected state before redeploying to avoid test pollution
- Add wait_for_idle after replica shutdown to allow cluster stabilization

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
@marceloneppel marceloneppel changed the title [WIP] Stereo mode poc [WIP] Stereo mode PoC Jan 30, 2026
…fy_raft_cluster_health call

- Add use_ip_from_inside=True to test_watcher_network_isolation to handle stale IPs
- Fix verify_raft_cluster_health call in test_health_check_action to pass required arguments

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Add __expire_keys and _onTick methods to WatcherKVStoreTTL to match
Patroni's KVStoreTTL behavior. When the watcher becomes the Raft leader
(e.g., when PostgreSQL primary is network-isolated), it must expire
stale leader keys so that a replica can acquire leadership.

Without this fix, the watcher would become Raft leader but wouldn't
process TTL expirations, causing the old Patroni leader key to remain
valid and preventing failover.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Juju action results require hyphenated keys (e.g., 'healthy-count')
rather than underscored keys. Fixed the health check action to use
proper key format and updated test expectations.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…sues

- Add watcher PostgreSQL user for health check authentication:
  - Create 'watcher' user with password via relation secret
  - Add pg_hba.conf entry for watcher IP in patroni.yml template
  - Pass password from relation secret to health checker

- Fix lint issues:
  - Extract S3 initialization to _handle_s3_initialization() to reduce
    _on_peer_relation_changed complexity from 11 to 10
  - Use absolute paths for subprocess commands (/usr/bin/systemctl, etc.)
  - Update type hints to use modern syntax (X | None vs Optional[X])
  - Fix line length formatting issues

- Fix unit test failures:
  - Add missing mocks in test_update_member_ip for endpoint methods
  - Add _units_ips mock in test_update_relation_data_leader

- Fix integration test:
  - Add check_watcher_ip parameter to verify_raft_cluster_health()
    to handle watcher IP changes after network isolation tests

- Update watcher charm to handle IP changes:
  - Add _update_unit_address_if_changed() for IP change detection
  - Call from config-changed and update-status events

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
content = secret.get_content(refresh=True)
return content.get("raft-password")
except SecretNotFoundError:
logger.warning(f"Secret {secret_id} not found")

Check failure

Code scanning / CodeQL

Clear-text logging of sensitive information High

This expression logs
sensitive data (secret)
as clear text.
Remove outdated constraint about deploy order being critical for
stereo mode with Raft DCS. Testing confirmed that 2 PostgreSQL
units can now be deployed simultaneously without causing split-brain.

Also update deprecated relate() calls to integrate().

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Libraries: Out of sync The charm libs used are out-of-sync

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant