From b05751286f007628edc1c36693ceb73dd5149812 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Wed, 14 Jan 2026 07:11:40 +0000 Subject: [PATCH 01/13] Initial plan From e79df89bf1d8f9936ffad1c91c47106f92360cb7 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Wed, 14 Jan 2026 07:15:32 +0000 Subject: [PATCH 02/13] docs: add litestream replication design Co-authored-by: bcho <1975118+bcho@users.noreply.github.com> --- docs/litestream-replication-design.md | 103 ++++++++++++++++++++++++++ 1 file changed, 103 insertions(+) create mode 100644 docs/litestream-replication-design.md diff --git a/docs/litestream-replication-design.md b/docs/litestream-replication-design.md new file mode 100644 index 0000000..3bdfc7e --- /dev/null +++ b/docs/litestream-replication-design.md @@ -0,0 +1,103 @@ +# Design: Embed Litestream for SQLite replication + +## Background + +`sqlite-rest` currently opens a local SQLite database file and serves RESTful access to it. There is no built-in durability story beyond a single node. [Litestream](https://litestream.io/) provides streaming WAL replication and restore for SQLite. Litestream ships a Go library that can be embedded to continuously replicate a database file to durable object storage and restore it at startup. This document proposes how to integrate that library without changing the external REST API. + +## Goals + +- Offer optional replication for the served SQLite database using the Litestream Go library. +- Provide an opt-in configuration surface (CLI flags/env) to: + - Restore a database from a configured Litestream replica before the server starts handling traffic. + - Continuously replicate WAL/snapshots to one or more replicas (initially a single replica). +- Align lifecycle with the existing `serve` command: replication should start/stop with the process and respect graceful shutdown. +- Expose basic observability for replication health (log + Prometheus counters/gauges). + +## Non-goals + +- Implementing multi-writer/leader election; replication is single-writer with read-only restores. +- Changing the REST API surface or authentication model. +- Building a full Litestream CLI wrapper (only the embedded library flows we need). + +## Current state and constraints + +- The server opens the database via `openDB` using a DSN passed to `serve`. +- Metrics and pprof servers already share the process lifecycle and respect the same `done` channel. +- Docker image and CLI use a single database file on local disk; WAL mode is implicitly enabled by the SQLite driver. + +## Proposed approach + +### High-level flow + +1. **Configuration** (new `ReplicationOptions`): + - `--replication-enabled` (bool, default false). + - `--replication-replica-url` (string, required when enabled; supports Litestream URLs like `s3://bucket/path` or `file:///...` for local testing). + - `--replication-snapshot-interval` / `--replication-retention` (optional tuning, passed through to Litestream). + - `--replication-restore-from` (optional override to restore from a different replica URL). + - Env var mirrors for container use (e.g., `SQLITEREST_REPLICATION_ENABLED`, etc.). + +2. **Restore before serving**: + - If enabled, run a Litestream restore for the configured database path **before** opening the DB handle used by `sqlite-rest`. + - Restore should be idempotent (skip when the local DB is already ahead) and respect a configurable `--replication-restore-interval` / `--replication-restore-lag` window to avoid long restores on healthy primaries. + +3. **Start replication alongside the server**: + - After opening the DB (once restore is done), create a Litestream replicator instance bound to the same database path and replica URL. + - Start replication in a goroutine using the same `done` channel used by the HTTP/metrics/pprof servers for coordinated shutdown. + - Ensure the replicator stops cleanly on context cancellation and flushes pending WAL frames. + +4. **Observability**: + - Log key lifecycle events (restore start/finish, replicate start/stop, errors). + - Add Prometheus metrics (e.g., `replication_last_snapshot_timestamp`, `replication_bytes_replicated_total`, `replication_errors_total`, `replication_lag_seconds`) populated via Litestream stats callbacks or polling the replicator state. + +5. **Failure handling**: + - If restore fails: abort startup with a clear error. + - If replication fails at runtime: surface errors via logs/metrics but keep the HTTP server running; rely on process restarts or admin action to recover. + +### API surface changes + +- Extend `ServerOptions` (or adjacent option struct) with `ReplicationOptions` and bind new CLI flags on `serve`. +- Keep defaults disabled to avoid changing existing deployments. +- No changes to request handlers or DB query path. + +### Configuration mapping + +- **S3**: use Litestream’s S3 replica driver; accept AWS creds via standard env vars (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`) and allow custom endpoint for MinIO. +- **File**: support `file://` URLs for local/dev validation. +- Future: allow multiple replicas by accepting repeated `--replication-replica-url` flags; initial scope is a single replica to minimize surface area. + +### Lifecycle integration sketch + +``` +restoreIfNeeded(ctx, dbPath, restoreURL, restoreOpts) +db := openDB(...) +replicator := newReplicator(dbPath, replicaURL, tuneOpts) +go replicator.Start(ctx) // ctx tied to serve command cancellation +go metricsServer.Start(ctx) +go pprofServer.Start(ctx) +server.Start(ctx.Done()) +``` + +### Testing strategy (future implementation) + +- Unit: flag parsing → `ReplicationOptions` defaults/validation. +- Integration (temporary files): start a litestream replicator pointing to a `file://` replica, perform writes via HTTP handlers, assert replica files advance (e.g., WAL or snapshot count). +- Restore path: seed replica, delete local DB, start server with `--replication-enabled --replication-restore-from `, assert DB is restored before serving. +- Metrics: expose fake replicator stats and assert Prometheus gauges/counters are set. + +## Migration & compatibility + +- Replication is opt-in; existing CLI invocations keep current behavior. +- Docker image remains the same; enabling replication requires supplying new flags/env and storage credentials. + +## Open questions + +- Should we expose multiple replicas at launch or keep single-replica until requested? +- How strict should startup be when replication is enabled but the remote is unreachable (fail fast vs. warn and continue)? +- What are the sensible defaults for snapshot/retention to balance durability and cost? + +## Implementation plan (for future PRs) + +1. Add `ReplicationOptions` with CLI/env bindings and validation. +2. Add restore step before `openDB` in `serve`. +3. Wire Litestream replicator lifecycle to the server context and add metrics/logging. +4. Add targeted tests and minimal docs/README snippet for enabling replication. From 913d09da8f017c92cfb50846190728654a95d7a8 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Wed, 14 Jan 2026 07:16:30 +0000 Subject: [PATCH 03/13] docs: clarify replication startup default Co-authored-by: bcho <1975118+bcho@users.noreply.github.com> --- docs/litestream-replication-design.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/litestream-replication-design.md b/docs/litestream-replication-design.md index 3bdfc7e..2362cad 100644 --- a/docs/litestream-replication-design.md +++ b/docs/litestream-replication-design.md @@ -92,7 +92,7 @@ server.Start(ctx.Done()) ## Open questions - Should we expose multiple replicas at launch or keep single-replica until requested? -- How strict should startup be when replication is enabled but the remote is unreachable (fail fast vs. warn and continue)? +- How strict should startup be when replication is enabled but the remote is unreachable? **Recommendation:** fail fast by default to avoid running without configured durability, with an explicit `--replication-allow-degraded-start` escape hatch if operators need to accept the data-loss risk. - What are the sensible defaults for snapshot/retention to balance durability and cost? ## Implementation plan (for future PRs) From bbb612ee80960967601f340aeea4d3bd93efea6f Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Wed, 14 Jan 2026 07:17:04 +0000 Subject: [PATCH 04/13] docs: refine multi-replica config note Co-authored-by: bcho <1975118+bcho@users.noreply.github.com> --- docs/litestream-replication-design.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/litestream-replication-design.md b/docs/litestream-replication-design.md index 2362cad..ce125a8 100644 --- a/docs/litestream-replication-design.md +++ b/docs/litestream-replication-design.md @@ -63,7 +63,7 @@ - **S3**: use Litestream’s S3 replica driver; accept AWS creds via standard env vars (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`) and allow custom endpoint for MinIO. - **File**: support `file://` URLs for local/dev validation. -- Future: allow multiple replicas by accepting repeated `--replication-replica-url` flags; initial scope is a single replica to minimize surface area. +- Future: allow multiple replicas via a single comma-separated flag (e.g., `--replication-replica-urls`) or config file entry instead of repeated flags; initial scope is a single replica to minimize surface area. ### Lifecycle integration sketch From 322f752a0cc7f0a5d4d220473c7ac85815c4ecaa Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Wed, 14 Jan 2026 07:17:41 +0000 Subject: [PATCH 05/13] docs: align replica flag naming Co-authored-by: bcho <1975118+bcho@users.noreply.github.com> --- docs/litestream-replication-design.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/litestream-replication-design.md b/docs/litestream-replication-design.md index ce125a8..21758c1 100644 --- a/docs/litestream-replication-design.md +++ b/docs/litestream-replication-design.md @@ -31,7 +31,7 @@ 1. **Configuration** (new `ReplicationOptions`): - `--replication-enabled` (bool, default false). - - `--replication-replica-url` (string, required when enabled; supports Litestream URLs like `s3://bucket/path` or `file:///...` for local testing). +- `--replication-replica-urls` (comma-separated, required when enabled; supports Litestream URLs like `s3://bucket/path` or `file:///...` for local testing; initial implementation can accept a single entry). - `--replication-snapshot-interval` / `--replication-retention` (optional tuning, passed through to Litestream). - `--replication-restore-from` (optional override to restore from a different replica URL). - Env var mirrors for container use (e.g., `SQLITEREST_REPLICATION_ENABLED`, etc.). From 46325d512e4a52395fcae12858b41a15acd99d31 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Wed, 14 Jan 2026 07:18:16 +0000 Subject: [PATCH 06/13] docs: clarify replica flag scope and error handling Co-authored-by: bcho <1975118+bcho@users.noreply.github.com> --- docs/litestream-replication-design.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/litestream-replication-design.md b/docs/litestream-replication-design.md index 21758c1..f0d9a06 100644 --- a/docs/litestream-replication-design.md +++ b/docs/litestream-replication-design.md @@ -31,7 +31,7 @@ 1. **Configuration** (new `ReplicationOptions`): - `--replication-enabled` (bool, default false). -- `--replication-replica-urls` (comma-separated, required when enabled; supports Litestream URLs like `s3://bucket/path` or `file:///...` for local testing; initial implementation can accept a single entry). +- `--replication-replica-url` (string, required when enabled; supports Litestream URLs like `s3://bucket/path` or `file:///...` for local testing; multi-replica support would likely rename this to `--replication-replica-urls` or move to a config file). - `--replication-snapshot-interval` / `--replication-retention` (optional tuning, passed through to Litestream). - `--replication-restore-from` (optional override to restore from a different replica URL). - Env var mirrors for container use (e.g., `SQLITEREST_REPLICATION_ENABLED`, etc.). @@ -63,7 +63,7 @@ - **S3**: use Litestream’s S3 replica driver; accept AWS creds via standard env vars (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`) and allow custom endpoint for MinIO. - **File**: support `file://` URLs for local/dev validation. -- Future: allow multiple replicas via a single comma-separated flag (e.g., `--replication-replica-urls`) or config file entry instead of repeated flags; initial scope is a single replica to minimize surface area. +- Future: allow multiple replicas by expanding the flag surface (e.g., adding `--replication-replica-urls` or a config file section); initial scope is a single replica to minimize surface area. ### Lifecycle integration sketch @@ -75,6 +75,8 @@ go replicator.Start(ctx) // ctx tied to serve command cancellation go metricsServer.Start(ctx) go pprofServer.Start(ctx) server.Start(ctx.Done()) +// Error handling: monitor replicator error channel/state changes; log and increment metrics, +// and optionally trigger process shutdown if replication is marked as required. ``` ### Testing strategy (future implementation) From f16d06539963b2a4eaa25513f507618164492b03 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Wed, 14 Jan 2026 07:18:58 +0000 Subject: [PATCH 07/13] docs: add go syntax hint to lifecycle sketch Co-authored-by: bcho <1975118+bcho@users.noreply.github.com> --- docs/litestream-replication-design.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/litestream-replication-design.md b/docs/litestream-replication-design.md index f0d9a06..501ee7a 100644 --- a/docs/litestream-replication-design.md +++ b/docs/litestream-replication-design.md @@ -67,7 +67,7 @@ ### Lifecycle integration sketch -``` +```go restoreIfNeeded(ctx, dbPath, restoreURL, restoreOpts) db := openDB(...) replicator := newReplicator(dbPath, replicaURL, tuneOpts) From 8aaf3aec7269cf37ba71758d380493ffe4b41669 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Wed, 14 Jan 2026 07:19:52 +0000 Subject: [PATCH 08/13] docs: expand restore config and error handling Co-authored-by: bcho <1975118+bcho@users.noreply.github.com> --- docs/litestream-replication-design.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/docs/litestream-replication-design.md b/docs/litestream-replication-design.md index 501ee7a..b745a7b 100644 --- a/docs/litestream-replication-design.md +++ b/docs/litestream-replication-design.md @@ -31,9 +31,11 @@ 1. **Configuration** (new `ReplicationOptions`): - `--replication-enabled` (bool, default false). -- `--replication-replica-url` (string, required when enabled; supports Litestream URLs like `s3://bucket/path` or `file:///...` for local testing; multi-replica support would likely rename this to `--replication-replica-urls` or move to a config file). + - `--replication-replica-url` (string, required when enabled; supports Litestream URLs like `s3://bucket/path` or `file:///...` for local testing; multi-replica support would likely rename this to `--replication-replica-urls` or move to a config file). - `--replication-snapshot-interval` / `--replication-retention` (optional tuning, passed through to Litestream). - `--replication-restore-from` (optional override to restore from a different replica URL). + - `--replication-restore-interval` (duration, default `0` meaning latest; limits how far back to search for a snapshot when restoring). + - `--replication-restore-lag` (duration, default `0` meaning no lag allowed; can be set to tolerate small staleness before triggering a restore). - Env var mirrors for container use (e.g., `SQLITEREST_REPLICATION_ENABLED`, etc.). 2. **Restore before serving**: @@ -76,7 +78,8 @@ go metricsServer.Start(ctx) go pprofServer.Start(ctx) server.Start(ctx.Done()) // Error handling: monitor replicator error channel/state changes; log and increment metrics, -// and optionally trigger process shutdown if replication is marked as required. +// and optionally trigger process shutdown if replication is marked as required. On error channel +// receive, cancel the shared context to shut down servers when degraded starts are disallowed. ``` ### Testing strategy (future implementation) From 229cb9e750acfd89f6959f82be7391411b6ebe14 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Wed, 14 Jan 2026 07:20:41 +0000 Subject: [PATCH 09/13] docs: note s3 perms and rename degraded flag Co-authored-by: bcho <1975118+bcho@users.noreply.github.com> --- docs/litestream-replication-design.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/litestream-replication-design.md b/docs/litestream-replication-design.md index b745a7b..82ac60c 100644 --- a/docs/litestream-replication-design.md +++ b/docs/litestream-replication-design.md @@ -63,7 +63,7 @@ ### Configuration mapping -- **S3**: use Litestream’s S3 replica driver; accept AWS creds via standard env vars (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`) and allow custom endpoint for MinIO. +- **S3**: use Litestream’s S3 replica driver; accept AWS creds via standard env vars (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`) and allow custom endpoint for MinIO. Document minimal IAM needs (typically `s3:PutObject`, `s3:GetObject`, `s3:ListBucket`, and `s3:DeleteObject` for the configured prefix) so operators can keep replication credentials least-privileged. - **File**: support `file://` URLs for local/dev validation. - Future: allow multiple replicas by expanding the flag surface (e.g., adding `--replication-replica-urls` or a config file section); initial scope is a single replica to minimize surface area. @@ -97,7 +97,7 @@ server.Start(ctx.Done()) ## Open questions - Should we expose multiple replicas at launch or keep single-replica until requested? -- How strict should startup be when replication is enabled but the remote is unreachable? **Recommendation:** fail fast by default to avoid running without configured durability, with an explicit `--replication-allow-degraded-start` escape hatch if operators need to accept the data-loss risk. +- How strict should startup be when replication is enabled but the remote is unreachable? **Recommendation:** fail fast by default to avoid running without configured durability, with an explicit `--replication-allow-degraded` escape hatch if operators need to accept the data-loss risk. - What are the sensible defaults for snapshot/retention to balance durability and cost? ## Implementation plan (for future PRs) From 6afce55f6274a69d6e4babb7a1c75dea6b8b548e Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Wed, 14 Jan 2026 07:21:16 +0000 Subject: [PATCH 10/13] docs: clarify restore lag and degraded handling Co-authored-by: bcho <1975118+bcho@users.noreply.github.com> --- docs/litestream-replication-design.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/litestream-replication-design.md b/docs/litestream-replication-design.md index 82ac60c..7ccffa0 100644 --- a/docs/litestream-replication-design.md +++ b/docs/litestream-replication-design.md @@ -35,7 +35,7 @@ - `--replication-snapshot-interval` / `--replication-retention` (optional tuning, passed through to Litestream). - `--replication-restore-from` (optional override to restore from a different replica URL). - `--replication-restore-interval` (duration, default `0` meaning latest; limits how far back to search for a snapshot when restoring). - - `--replication-restore-lag` (duration, default `0` meaning no lag allowed; can be set to tolerate small staleness before triggering a restore). + - `--replication-restore-lag` (duration, default `0` meaning no lag allowed; used during startup restore decisions to tolerate a small amount of staleness between the local DB and the replica before forcing a restore). - Env var mirrors for container use (e.g., `SQLITEREST_REPLICATION_ENABLED`, etc.). 2. **Restore before serving**: @@ -78,8 +78,9 @@ go metricsServer.Start(ctx) go pprofServer.Start(ctx) server.Start(ctx.Done()) // Error handling: monitor replicator error channel/state changes; log and increment metrics, -// and optionally trigger process shutdown if replication is marked as required. On error channel -// receive, cancel the shared context to shut down servers when degraded starts are disallowed. +// and optionally trigger process shutdown if replication is marked as required (i.e., when +// `--replication-allow-degraded` is false). On error channel receive, cancel the shared context +// to shut down servers when degraded starts are disallowed. ``` ### Testing strategy (future implementation) From f01868e027496b36af351636d02014995e86bd77 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Wed, 14 Jan 2026 07:33:11 +0000 Subject: [PATCH 11/13] docs: shift to litestream config and clarify divergence Co-authored-by: bcho <1975118+bcho@users.noreply.github.com> --- docs/litestream-replication-design.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/docs/litestream-replication-design.md b/docs/litestream-replication-design.md index 7ccffa0..d1f3b0d 100644 --- a/docs/litestream-replication-design.md +++ b/docs/litestream-replication-design.md @@ -31,16 +31,18 @@ 1. **Configuration** (new `ReplicationOptions`): - `--replication-enabled` (bool, default false). - - `--replication-replica-url` (string, required when enabled; supports Litestream URLs like `s3://bucket/path` or `file:///...` for local testing; multi-replica support would likely rename this to `--replication-replica-urls` or move to a config file). - - `--replication-snapshot-interval` / `--replication-retention` (optional tuning, passed through to Litestream). - - `--replication-restore-from` (optional override to restore from a different replica URL). + - `--replication-config` (string, path to Litestream YAML config; preferred path to keep sqlite-rest changes minimal and delegate detailed tuning like snapshot/retention/replicas to Litestream). + - `--replication-restore-from` (optional override to restore from a different replica URL; if omitted, use the primary replica from the Litestream config). - `--replication-restore-interval` (duration, default `0` meaning latest; limits how far back to search for a snapshot when restoring). - `--replication-restore-lag` (duration, default `0` meaning no lag allowed; used during startup restore decisions to tolerate a small amount of staleness between the local DB and the replica before forcing a restore). - - Env var mirrors for container use (e.g., `SQLITEREST_REPLICATION_ENABLED`, etc.). + - `--replication-allow-degraded` (bool, default false; when false, runtime replication errors or failed restores will stop the process). + - Env var mirrors for container use (e.g., `SQLITEREST_REPLICATION_ENABLED`, `SQLITEREST_REPLICATION_CONFIG`, etc.). + - Recommended CLI UX: keep flags minimal (`--replication-enabled`, `--replication-config`, optional `--replication-restore-from` and `--replication-allow-degraded`) and leave all other Litestream knobs to the config file. 2. **Restore before serving**: - If enabled, run a Litestream restore for the configured database path **before** opening the DB handle used by `sqlite-rest`. - Restore should be idempotent (skip when the local DB is already ahead) and respect a configurable `--replication-restore-interval` / `--replication-restore-lag` window to avoid long restores on healthy primaries. + - Divergence handling: if the local WAL lineage differs from the remote replica (e.g., split-brain), default to fail-fast and require operator action (e.g., force-restore from the chosen replica or re-seed) to avoid serving inconsistent data. An explicit `--replication-allow-degraded` plus a force-restore knob can opt into overwriting local state. 3. **Start replication alongside the server**: - After opening the DB (once restore is done), create a Litestream replicator instance bound to the same database path and replica URL. @@ -65,7 +67,7 @@ - **S3**: use Litestream’s S3 replica driver; accept AWS creds via standard env vars (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`) and allow custom endpoint for MinIO. Document minimal IAM needs (typically `s3:PutObject`, `s3:GetObject`, `s3:ListBucket`, and `s3:DeleteObject` for the configured prefix) so operators can keep replication credentials least-privileged. - **File**: support `file://` URLs for local/dev validation. -- Future: allow multiple replicas by expanding the flag surface (e.g., adding `--replication-replica-urls` or a config file section); initial scope is a single replica to minimize surface area. +- Future: allow multiple replicas (multiple remote destinations for the same SQLite DB) by expanding the config surface (e.g., via Litestream config file); initial scope is a single replica to minimize surface area. ### Lifecycle integration sketch @@ -92,7 +94,7 @@ server.Start(ctx.Done()) ## Migration & compatibility -- Replication is opt-in; existing CLI invocations keep current behavior. +- Replication is opt-in and disabled by default; existing CLI invocations keep current behavior. Detailed replication tuning stays in the Litestream config file to minimize sqlite-rest surface changes. - Docker image remains the same; enabling replication requires supplying new flags/env and storage credentials. ## Open questions From d41a7dd2514369327ef5d056f622d6bdf6a39a3f Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Wed, 14 Jan 2026 07:40:02 +0000 Subject: [PATCH 12/13] docs: simplify flags and fail-fast defaults Co-authored-by: bcho <1975118+bcho@users.noreply.github.com> --- docs/litestream-replication-design.md | 17 +++++------------ 1 file changed, 5 insertions(+), 12 deletions(-) diff --git a/docs/litestream-replication-design.md b/docs/litestream-replication-design.md index d1f3b0d..f78698d 100644 --- a/docs/litestream-replication-design.md +++ b/docs/litestream-replication-design.md @@ -33,16 +33,13 @@ - `--replication-enabled` (bool, default false). - `--replication-config` (string, path to Litestream YAML config; preferred path to keep sqlite-rest changes minimal and delegate detailed tuning like snapshot/retention/replicas to Litestream). - `--replication-restore-from` (optional override to restore from a different replica URL; if omitted, use the primary replica from the Litestream config). - - `--replication-restore-interval` (duration, default `0` meaning latest; limits how far back to search for a snapshot when restoring). - - `--replication-restore-lag` (duration, default `0` meaning no lag allowed; used during startup restore decisions to tolerate a small amount of staleness between the local DB and the replica before forcing a restore). - - `--replication-allow-degraded` (bool, default false; when false, runtime replication errors or failed restores will stop the process). - Env var mirrors for container use (e.g., `SQLITEREST_REPLICATION_ENABLED`, `SQLITEREST_REPLICATION_CONFIG`, etc.). - - Recommended CLI UX: keep flags minimal (`--replication-enabled`, `--replication-config`, optional `--replication-restore-from` and `--replication-allow-degraded`) and leave all other Litestream knobs to the config file. + - Recommended CLI UX: keep flags minimal (`--replication-enabled`, `--replication-config`, optional `--replication-restore-from`) and leave all other Litestream knobs to the config file. 2. **Restore before serving**: - If enabled, run a Litestream restore for the configured database path **before** opening the DB handle used by `sqlite-rest`. - - Restore should be idempotent (skip when the local DB is already ahead) and respect a configurable `--replication-restore-interval` / `--replication-restore-lag` window to avoid long restores on healthy primaries. - - Divergence handling: if the local WAL lineage differs from the remote replica (e.g., split-brain), default to fail-fast and require operator action (e.g., force-restore from the chosen replica or re-seed) to avoid serving inconsistent data. An explicit `--replication-allow-degraded` plus a force-restore knob can opt into overwriting local state. + - Restore should be idempotent (skip when the local DB is already ahead) and rely on Litestream config knobs (snapshot interval/retention/restore lag) for tuning. + - Divergence handling: if the local WAL lineage differs from the remote replica (e.g., split-brain), default to fail-fast and require operator action (e.g., force-restore from the chosen replica or re-seed) to avoid serving inconsistent data. 3. **Start replication alongside the server**: - After opening the DB (once restore is done), create a Litestream replicator instance bound to the same database path and replica URL. @@ -79,10 +76,8 @@ go replicator.Start(ctx) // ctx tied to serve command cancellation go metricsServer.Start(ctx) go pprofServer.Start(ctx) server.Start(ctx.Done()) -// Error handling: monitor replicator error channel/state changes; log and increment metrics, -// and optionally trigger process shutdown if replication is marked as required (i.e., when -// `--replication-allow-degraded` is false). On error channel receive, cancel the shared context -// to shut down servers when degraded starts are disallowed. +// Error handling: monitor replicator error channel/state changes; log and increment metrics. +// On error channel receive, cancel the shared context to shut down servers (fail-fast default). ``` ### Testing strategy (future implementation) @@ -99,8 +94,6 @@ server.Start(ctx.Done()) ## Open questions -- Should we expose multiple replicas at launch or keep single-replica until requested? -- How strict should startup be when replication is enabled but the remote is unreachable? **Recommendation:** fail fast by default to avoid running without configured durability, with an explicit `--replication-allow-degraded` escape hatch if operators need to accept the data-loss risk. - What are the sensible defaults for snapshot/retention to balance durability and cost? ## Implementation plan (for future PRs) From 61a3a12eb2da4bedab3900769aef80fb9310b0c3 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Wed, 14 Jan 2026 07:59:06 +0000 Subject: [PATCH 13/13] docs: remove restore-from flag and tighten ux Co-authored-by: bcho <1975118+bcho@users.noreply.github.com> --- docs/litestream-replication-design.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/litestream-replication-design.md b/docs/litestream-replication-design.md index f78698d..2793202 100644 --- a/docs/litestream-replication-design.md +++ b/docs/litestream-replication-design.md @@ -32,9 +32,8 @@ 1. **Configuration** (new `ReplicationOptions`): - `--replication-enabled` (bool, default false). - `--replication-config` (string, path to Litestream YAML config; preferred path to keep sqlite-rest changes minimal and delegate detailed tuning like snapshot/retention/replicas to Litestream). - - `--replication-restore-from` (optional override to restore from a different replica URL; if omitted, use the primary replica from the Litestream config). - Env var mirrors for container use (e.g., `SQLITEREST_REPLICATION_ENABLED`, `SQLITEREST_REPLICATION_CONFIG`, etc.). - - Recommended CLI UX: keep flags minimal (`--replication-enabled`, `--replication-config`, optional `--replication-restore-from`) and leave all other Litestream knobs to the config file. + - Recommended CLI UX: keep flags minimal (`--replication-enabled`, `--replication-config`) and leave all other Litestream knobs to the config file. 2. **Restore before serving**: - If enabled, run a Litestream restore for the configured database path **before** opening the DB handle used by `sqlite-rest`.