Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions src/how-to/administrate/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,8 +101,12 @@
## 12. Migration

- [Migrate team features](migrate-team-features.md)
- [Migrate to postgresql](migrate-to-postgresql.md)

## 13. Wire-utility

- [Wire utility tool](wire-utility-tool.md)

## 14. Postgresql(On prem)
- [Setup postgresql cluster for on-prem environment](postgresql-cluster.md)
<!-- TODO: .. include:: administration/redis.rst -->
365 changes: 365 additions & 0 deletions src/how-to/administrate/migrate-to-postgresql.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,365 @@
# Migrate Galley Data from Cassandra to PostgreSQL

Use this procedure to migrate Galley-managed data from Cassandra to PostgreSQL. The migration path is available starting with `wire-server` `5.24`. This migration is only required if you need channel search and channel management from Team Settings on releases that support PostgreSQL-backed conversation data.

This guide covers these data domains:

- Conversations
- Conversation codes
- Team features

After the migration is complete, PostgreSQL becomes the authoritative store for the migrated domains.

## Before You Start

Make sure all of the following are true before changing any migration settings:

- You are running `wire-server` `5.24` or later, with support for PostgreSQL-backed conversation data.
- PostgreSQL is deployed and reachable from the cluster. If you still need to set it up on your on-prem environment with our custom postgresql cluster, see [PostgreSQL High Availability Cluster - Quick Setup](postgresql-cluster.md).
- `galley` and `background-worker` both have PostgreSQL host, database, user, and password configured.
- The `cassandra-migrations` job for your Wire upgrade has already completed successfully.
- You have enough PostgreSQL connections available for the temporary migration workload.

The `cassandra-migrations` job only prepares schema and metadata. It does not copy conversation data from Cassandra into PostgreSQL. The data copy is performed by `background-worker`.

## PostgreSQL Connection Budget

`postgresqlPool.size` is a per-pod setting. To estimate how many PostgreSQL connections Wire can open, multiply each service pool size by the number of replicas for that service, then sum the results. In a standard Wire deployment sharing the same PostgreSQL primary, this means at least `brig`, `galley`, and `background-worker`.

Use this formula:

```text
total_postgresql_connections =
(brig_pool_size * brig_replicas) +
(galley_pool_size * galley_replicas) +
(background_worker_pool_size * background_worker_replicas)
```

This number is the minimum application-side connection budget you should plan for on the PostgreSQL primary when Wire connects through the read-write service.

### Default starting point

The default pool size of `100` for `brig` and `galley` is intentionally generous. It gives flexibility and is a safe starting point when traffic is unknown or when you want to avoid early pool pressure.

Default calculation:

```text
brig: 100 * 3 = 300
galley: 100 * 3 = 300
background-worker: 5 * 3 = 15
total: 615
```

If you use this layout, set `max_connections` above `605` and keep additional headroom for:

- PostgreSQL administrative sessions
- Monitoring and maintenance jobs
- Temporary migration increases
- Future replica scaling on the Wire side

### Low-traffic starting point

For staging, medium-sized, or other low-traffic environments, the default of `100` for `brig` and `galley` is often more than needed. A common reduced-pool starting point is:

```text
brig: 10 * 3 = 30
galley: 10 * 3 = 30
background-worker: 5 * 1 = 5
total: 65
```

With this reduced layout, `max_connections = 100` is often a reasonable starting point because it leaves headroom above the 65 application-side connections. On self-managed PostgreSQL, set this in the PostgreSQL server configuration. In the provided Wire Ansible deployment, that means the `postgresql.conf` template.

### Tune from observed traffic

Do not stop at the default or reduced example. Watch the actual pool usage for at least 24 hours, then adjust both `postgresqlPool.size` and PostgreSQL `max_connections` from the observed traffic pattern.

Useful metrics for sizing:

- `wire_hasql_pool_session_count`: shows how many sessions are open over time
- `wire_hasql_pool_in_use`: shows how many connections are actively checked out
- `wire_hasql_pool_ready_for_use`: shows whether idle connections are available
- `wire_hasql_pool_session_failure_count`: should stay at `0`; increases indicate pool pressure or connectivity issues
- `rate(wire_hasql_pool_connection_established_count[5m])`: shows connection churn and can reveal undersized or unstable pools

Signs that the reduced sizing is sufficient:

- `wire_hasql_pool_session_count` stays well below the configured pool size for each pod
- `wire_hasql_pool_in_use` stays low and `wire_hasql_pool_ready_for_use` stays available
- `wire_hasql_pool_session_failure_count` remains `0`
- There are no acquisition timeout errors in service logs

Signs that you should increase the pool size and recalculate `max_connections`:

- `wire_hasql_pool_session_count` regularly approaches the configured pool size
- `wire_hasql_pool_in_use` remains high or `wire_hasql_pool_ready_for_use` frequently drops to `0`
- `wire_hasql_pool_session_failure_count` increases
- You see acquisition timeout errors or sustained connection churn

After changing a service pool size, recalculate the total connection budget and raise PostgreSQL `max_connections` accordingly.

For managed PostgreSQL and on-prem installations, always set `max_connections` above the total calculated for the pool sizes you actually chose.

## Migration States

Each domain is controlled with `postgresMigration.<domain>` and can be in one of these states:

- `cassandra`: reads and writes stay on Cassandra
- `migration-to-postgresql`: new writes go to both Cassandra and PostgreSQL
- `postgresql`: reads and writes use PostgreSQL only

For each domain, the migration always follows the same sequence:

1. Enable dual-write by setting `postgresMigration.<domain>: migration-to-postgresql`.
2. Start the backfill by setting the matching `migrate*` flag on `background-worker`.
3. Cut over by setting `postgresMigration.<domain>: postgresql` and turning the `migrate*` flag off again.

Once a domain is moved to `migration-to-postgresql`, do not set it back to `cassandra`.

## Important Rules

### Keep `galley` and `background-worker` aligned

`background-worker.config.postgresMigration.<domain>` must always match `galley.config.postgresMigration.<domain>`.

### Plan extra PostgreSQL capacity for the migration window

The steady-state pool size is often too small for the backfill step. If you see connection acquisition timeouts during migration, increase `background-worker.config.postgresqlPool.size` and `acquisitionTimeout` before retrying.

### Migrate one domain at a time

Do not migrate conversations, conversation codes, and team features in the same deployment. Finish one domain completely before starting the next one.

## Base Configuration

Start from a safe baseline where PostgreSQL is configured but Cassandra is still authoritative.

```yaml
galley:
config:
postgresql:
host: postgresql-external-rw
port: "5432"
user: wire-server
dbname: wire-server
postgresMigration:
conversation: cassandra
conversationCodes: cassandra
teamFeatures: cassandra

background-worker:
config:
postgresql:
host: postgresql-external-rw
port: "5432"
user: wire-server
dbname: wire-server
postgresqlPool:
size: 5
acquisitionTimeout: 10s
agingTimeout: 1d
idlenessTimeout: 10m
postgresMigration:
conversation: cassandra
conversationCodes: cassandra
teamFeatures: cassandra
migrateConversations: false
migrateConversationCodes: false
migrateTeamFeatures: false
```

Deploy this first and verify both services are healthy.

## Migration Procedure

Apply the following procedure to one domain at a time.

### Step 1: Enable dual-write

Set the selected domain to `migration-to-postgresql` in both `galley` and `background-worker`.

Example for conversations:

```yaml
galley:
config:
postgresMigration:
conversation: migration-to-postgresql
conversationCodes: cassandra
teamFeatures: cassandra

background-worker:
config:
postgresMigration:
conversation: migration-to-postgresql
conversationCodes: cassandra
teamFeatures: cassandra
migrateConversations: false
```

After the rollout:

- `galley` should restart cleanly.
- New writes for that domain should be written to both Cassandra and PostgreSQL.
- No backfill should run yet.

### Step 2: Start the backfill

Enable the matching migration flag on `background-worker`.

Flags by domain:

- Conversations: `migrateConversations: true`
- Conversation codes: `migrateConversationCodes: true`
- Team features: `migrateTeamFeatures: true`

Example for conversations:

```yaml
background-worker:
config:
migrateConversations: true
postgresqlPool:
size: 10
acquisitionTimeout: 30s
migrateConversationsOptions:
pageSize: 10000
parallelism: 2
```

`migrateConversationsOptions` is only used for conversation migration. Conversation codes and team features do not use this block.

### Step 3: Monitor the migration

Use logs and Prometheus metrics to confirm progress.

Check `background-worker` logs:

```bash
kubectl logs -f deploy/background-worker -n default
```

Useful log patterns:

- `finished migration`
- `error occurred`
- `estimatedRows`

Useful Prometheus metrics:

| Metric | Meaning |
| --- | --- |
| `wire_local_convs_migration_finished` | Local conversation migration is complete when the value is `1` |
| `wire_user_remote_convs_migration_finished` | Remote conversation index migration is complete when the value is `1` |
| `wire_team_features_migration_finished` | Team features migration is complete when the value is `1` |
| `wire_hasql_pool_ready_for_use` | PostgreSQL pool is healthy when each pod reports ready connections |
| `wire_hasql_pool_session_failure_count` | Should remain `0` |

There is no dedicated Prometheus completion metric for conversation codes. Validate that migration through logs.

### Step 4: Cut over to PostgreSQL

When the migration has finished, set the selected domain to `postgresql` in both services and disable the matching migration flag.

Example for conversations:

```yaml
galley:
config:
postgresMigration:
conversation: postgresql
conversationCodes: cassandra
teamFeatures: cassandra

background-worker:
config:
postgresMigration:
conversation: postgresql
conversationCodes: cassandra
teamFeatures: cassandra
migrateConversations: false
postgresqlPool:
size: 5
acquisitionTimeout: 10s
```

After this rollout, the selected domain reads from PostgreSQL only.

## Recommended Domain Order

Migrate domains in this order:

1. Conversations
2. Conversation codes
3. Team features

This keeps the largest and most operationally sensitive migration first, when your rollback options are still best for the remaining domains.

## Final Configuration

When all domains have been migrated, both services should point all supported Galley data to PostgreSQL.

```yaml
galley:
config:
postgresql:
host: postgresql-external-rw
port: "5432"
user: wire-server
dbname: wire-server
postgresMigration:
conversation: postgresql
conversationCodes: postgresql
teamFeatures: postgresql

background-worker:
config:
postgresql:
host: postgresql-external-rw
port: "5432"
user: wire-server
dbname: wire-server
postgresqlPool:
size: 5
acquisitionTimeout: 10s
agingTimeout: 1d
idlenessTimeout: 10m
postgresMigration:
conversation: postgresql
conversationCodes: postgresql
teamFeatures: postgresql
migrateConversations: false
migrateConversationCodes: false
migrateTeamFeatures: false
```

## Post-Migration Checks

After the last cutover:

- Confirm `galley` and `background-worker` pods are healthy.
- Confirm `wire_hasql_pool_session_failure_count` stays at `0`.
- Confirm channel search and Team Settings channel management work as expected.
- Confirm no migration flags remain set to `true`.

## Troubleshooting

### Migration does not start

Check the migration flag names carefully. For example, `migrateConversations` is correct, while `migrateConversation` is ignored.

### Pods fail to start with a storage-location parse error

This usually means a `postgresMigration` value was written as a boolean instead of a string. Use only:

- `cassandra`
- `migration-to-postgresql`
- `postgresql`

### PostgreSQL acquisition timeouts appear during migration

Increase `background-worker.config.postgresqlPool.size` and `acquisitionTimeout`, then redeploy `background-worker`.

### No PostgreSQL pool metrics appear for `background-worker`

`background-worker` may not emit `wire_hasql_pool_*` metrics until it has attempted to use PostgreSQL. This is expected before the migration flag is enabled.
Loading