From b05751286f007628edc1c36693ceb73dd5149812 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Wed, 14 Jan 2026 07:11:40 +0000
Subject: [PATCH 01/13] Initial plan


From e79df89bf1d8f9936ffad1c91c47106f92360cb7 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Wed, 14 Jan 2026 07:15:32 +0000
Subject: [PATCH 02/13] docs: add litestream replication design

Co-authored-by: bcho <1975118+bcho@users.noreply.github.com>
---
 docs/litestream-replication-design.md | 103 ++++++++++++++++++++++++++
 1 file changed, 103 insertions(+)
 create mode 100644 docs/litestream-replication-design.md

diff --git a/docs/litestream-replication-design.md b/docs/litestream-replication-design.md
new file mode 100644
index 0000000..3bdfc7e
--- /dev/null
+++ b/docs/litestream-replication-design.md
@@ -0,0 +1,103 @@
+# Design: Embed Litestream for SQLite replication
+
+## Background
+
+`sqlite-rest` currently opens a local SQLite database file and serves RESTful access to it. There is no built-in durability story beyond a single node. [Litestream](https://litestream.io/) provides streaming WAL replication and restore for SQLite. Litestream ships a Go library that can be embedded to continuously replicate a database file to durable object storage and restore it at startup. This document proposes how to integrate that library without changing the external REST API.
+
+## Goals
+
+- Offer optional replication for the served SQLite database using the Litestream Go library.
+- Provide an opt-in configuration surface (CLI flags/env) to:
+  - Restore a database from a configured Litestream replica before the server starts handling traffic.
+  - Continuously replicate WAL/snapshots to one or more replicas (initially a single replica).
+- Align lifecycle with the existing `serve` command: replication should start/stop with the process and respect graceful shutdown.
+- Expose basic observability for replication health (log + Prometheus counters/gauges).
+
+## Non-goals
+
+- Implementing multi-writer/leader election; replication is single-writer with read-only restores.
+- Changing the REST API surface or authentication model.
+- Building a full Litestream CLI wrapper (only the embedded library flows we need).
+
+## Current state and constraints
+
+- The server opens the database via `openDB` using a DSN passed to `serve`.
+- Metrics and pprof servers already share the process lifecycle and respect the same `done` channel.
+- Docker image and CLI use a single database file on local disk; WAL mode is implicitly enabled by the SQLite driver.
+
+## Proposed approach
+
+### High-level flow
+
+1. **Configuration** (new `ReplicationOptions`):
+   - `--replication-enabled` (bool, default false).
+   - `--replication-replica-url` (string, required when enabled; supports Litestream URLs like `s3://bucket/path` or `file:///...` for local testing).
+   - `--replication-snapshot-interval` / `--replication-retention` (optional tuning, passed through to Litestream).
+   - `--replication-restore-from` (optional override to restore from a different replica URL).
+   - Env var mirrors for container use (e.g., `SQLITEREST_REPLICATION_ENABLED`, etc.).
+
+2. **Restore before serving**:
+   - If enabled, run a Litestream restore for the configured database path **before** opening the DB handle used by `sqlite-rest`.
+   - Restore should be idempotent (skip when the local DB is already ahead) and respect a configurable `--replication-restore-interval` / `--replication-restore-lag` window to avoid long restores on healthy primaries.
+
+3. **Start replication alongside the server**:
+   - After opening the DB (once restore is done), create a Litestream replicator instance bound to the same database path and replica URL.
+   - Start replication in a goroutine using the same `done` channel used by the HTTP/metrics/pprof servers for coordinated shutdown.
+   - Ensure the replicator stops cleanly on context cancellation and flushes pending WAL frames.
+
+4. **Observability**:
+   - Log key lifecycle events (restore start/finish, replicate start/stop, errors).
+   - Add Prometheus metrics (e.g., `replication_last_snapshot_timestamp`, `replication_bytes_replicated_total`, `replication_errors_total`, `replication_lag_seconds`) populated via Litestream stats callbacks or polling the replicator state.
+
+5. **Failure handling**:
+   - If restore fails: abort startup with a clear error.
+   - If replication fails at runtime: surface errors via logs/metrics but keep the HTTP server running; rely on process restarts or admin action to recover.
+
+### API surface changes
+
+- Extend `ServerOptions` (or adjacent option struct) with `ReplicationOptions` and bind new CLI flags on `serve`.
+- Keep defaults disabled to avoid changing existing deployments.
+- No changes to request handlers or DB query path.
+
+### Configuration mapping
+
+- **S3**: use Litestream’s S3 replica driver; accept AWS creds via standard env vars (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`) and allow custom endpoint for MinIO.
+- **File**: support `file://` URLs for local/dev validation.
+- Future: allow multiple replicas by accepting repeated `--replication-replica-url` flags; initial scope is a single replica to minimize surface area.
+
+### Lifecycle integration sketch
+
+```
+restoreIfNeeded(ctx, dbPath, restoreURL, restoreOpts)
+db := openDB(...)
+replicator := newReplicator(dbPath, replicaURL, tuneOpts)
+go replicator.Start(ctx) // ctx tied to serve command cancellation
+go metricsServer.Start(ctx)
+go pprofServer.Start(ctx)
+server.Start(ctx.Done())
+```
+
+### Testing strategy (future implementation)
+
+- Unit: flag parsing → `ReplicationOptions` defaults/validation.
+- Integration (temporary files): start a litestream replicator pointing to a `file://` replica, perform writes via HTTP handlers, assert replica files advance (e.g., WAL or snapshot count).
+- Restore path: seed replica, delete local DB, start server with `--replication-enabled --replication-restore-from <replica>`, assert DB is restored before serving.
+- Metrics: expose fake replicator stats and assert Prometheus gauges/counters are set.
+
+## Migration & compatibility
+
+- Replication is opt-in; existing CLI invocations keep current behavior.
+- Docker image remains the same; enabling replication requires supplying new flags/env and storage credentials.
+
+## Open questions
+
+- Should we expose multiple replicas at launch or keep single-replica until requested?
+- How strict should startup be when replication is enabled but the remote is unreachable (fail fast vs. warn and continue)?
+- What are the sensible defaults for snapshot/retention to balance durability and cost?
+
+## Implementation plan (for future PRs)
+
+1. Add `ReplicationOptions` with CLI/env bindings and validation.
+2. Add restore step before `openDB` in `serve`.
+3. Wire Litestream replicator lifecycle to the server context and add metrics/logging.
+4. Add targeted tests and minimal docs/README snippet for enabling replication.

From 913d09da8f017c92cfb50846190728654a95d7a8 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Wed, 14 Jan 2026 07:16:30 +0000
Subject: [PATCH 03/13] docs: clarify replication startup default

Co-authored-by: bcho <1975118+bcho@users.noreply.github.com>
---
 docs/litestream-replication-design.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/litestream-replication-design.md b/docs/litestream-replication-design.md
index 3bdfc7e..2362cad 100644
--- a/docs/litestream-replication-design.md
+++ b/docs/litestream-replication-design.md
@@ -92,7 +92,7 @@ server.Start(ctx.Done())
 ## Open questions
 
 - Should we expose multiple replicas at launch or keep single-replica until requested?
-- How strict should startup be when replication is enabled but the remote is unreachable (fail fast vs. warn and continue)?
+- How strict should startup be when replication is enabled but the remote is unreachable? **Recommendation:** fail fast by default to avoid running without configured durability, with an explicit `--replication-allow-degraded-start` escape hatch if operators need to accept the data-loss risk.
 - What are the sensible defaults for snapshot/retention to balance durability and cost?
 
 ## Implementation plan (for future PRs)

From bbb612ee80960967601f340aeea4d3bd93efea6f Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Wed, 14 Jan 2026 07:17:04 +0000
Subject: [PATCH 04/13] docs: refine multi-replica config note

Co-authored-by: bcho <1975118+bcho@users.noreply.github.com>
---
 docs/litestream-replication-design.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/litestream-replication-design.md b/docs/litestream-replication-design.md
index 2362cad..ce125a8 100644
--- a/docs/litestream-replication-design.md
+++ b/docs/litestream-replication-design.md
@@ -63,7 +63,7 @@
 
 - **S3**: use Litestream’s S3 replica driver; accept AWS creds via standard env vars (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`) and allow custom endpoint for MinIO.
 - **File**: support `file://` URLs for local/dev validation.
-- Future: allow multiple replicas by accepting repeated `--replication-replica-url` flags; initial scope is a single replica to minimize surface area.
+- Future: allow multiple replicas via a single comma-separated flag (e.g., `--replication-replica-urls`) or config file entry instead of repeated flags; initial scope is a single replica to minimize surface area.
 
 ### Lifecycle integration sketch
 

From 322f752a0cc7f0a5d4d220473c7ac85815c4ecaa Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Wed, 14 Jan 2026 07:17:41 +0000
Subject: [PATCH 05/13] docs: align replica flag naming

Co-authored-by: bcho <1975118+bcho@users.noreply.github.com>
---
 docs/litestream-replication-design.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/litestream-replication-design.md b/docs/litestream-replication-design.md
index ce125a8..21758c1 100644
--- a/docs/litestream-replication-design.md
+++ b/docs/litestream-replication-design.md
@@ -31,7 +31,7 @@
 
 1. **Configuration** (new `ReplicationOptions`):
    - `--replication-enabled` (bool, default false).
-   - `--replication-replica-url` (string, required when enabled; supports Litestream URLs like `s3://bucket/path` or `file:///...` for local testing).
+- `--replication-replica-urls` (comma-separated, required when enabled; supports Litestream URLs like `s3://bucket/path` or `file:///...` for local testing; initial implementation can accept a single entry).
    - `--replication-snapshot-interval` / `--replication-retention` (optional tuning, passed through to Litestream).
    - `--replication-restore-from` (optional override to restore from a different replica URL).
    - Env var mirrors for container use (e.g., `SQLITEREST_REPLICATION_ENABLED`, etc.).

From 46325d512e4a52395fcae12858b41a15acd99d31 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Wed, 14 Jan 2026 07:18:16 +0000
Subject: [PATCH 06/13] docs: clarify replica flag scope and error handling

Co-authored-by: bcho <1975118+bcho@users.noreply.github.com>
---
 docs/litestream-replication-design.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/docs/litestream-replication-design.md b/docs/litestream-replication-design.md
index 21758c1..f0d9a06 100644
--- a/docs/litestream-replication-design.md
+++ b/docs/litestream-replication-design.md
@@ -31,7 +31,7 @@
 
 1. **Configuration** (new `ReplicationOptions`):
    - `--replication-enabled` (bool, default false).
-- `--replication-replica-urls` (comma-separated, required when enabled; supports Litestream URLs like `s3://bucket/path` or `file:///...` for local testing; initial implementation can accept a single entry).
+- `--replication-replica-url` (string, required when enabled; supports Litestream URLs like `s3://bucket/path` or `file:///...` for local testing; multi-replica support would likely rename this to `--replication-replica-urls` or move to a config file).
    - `--replication-snapshot-interval` / `--replication-retention` (optional tuning, passed through to Litestream).
    - `--replication-restore-from` (optional override to restore from a different replica URL).
    - Env var mirrors for container use (e.g., `SQLITEREST_REPLICATION_ENABLED`, etc.).
@@ -63,7 +63,7 @@
 
 - **S3**: use Litestream’s S3 replica driver; accept AWS creds via standard env vars (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`) and allow custom endpoint for MinIO.
 - **File**: support `file://` URLs for local/dev validation.
-- Future: allow multiple replicas via a single comma-separated flag (e.g., `--replication-replica-urls`) or config file entry instead of repeated flags; initial scope is a single replica to minimize surface area.
+- Future: allow multiple replicas by expanding the flag surface (e.g., adding `--replication-replica-urls` or a config file section); initial scope is a single replica to minimize surface area.
 
 ### Lifecycle integration sketch
 
@@ -75,6 +75,8 @@ go replicator.Start(ctx) // ctx tied to serve command cancellation
 go metricsServer.Start(ctx)
 go pprofServer.Start(ctx)
 server.Start(ctx.Done())
+// Error handling: monitor replicator error channel/state changes; log and increment metrics,
+// and optionally trigger process shutdown if replication is marked as required.
 ```
 
 ### Testing strategy (future implementation)

From f16d06539963b2a4eaa25513f507618164492b03 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Wed, 14 Jan 2026 07:18:58 +0000
Subject: [PATCH 07/13] docs: add go syntax hint to lifecycle sketch

Co-authored-by: bcho <1975118+bcho@users.noreply.github.com>
---
 docs/litestream-replication-design.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/litestream-replication-design.md b/docs/litestream-replication-design.md
index f0d9a06..501ee7a 100644
--- a/docs/litestream-replication-design.md
+++ b/docs/litestream-replication-design.md
@@ -67,7 +67,7 @@
 
 ### Lifecycle integration sketch
 
-```
+```go
 restoreIfNeeded(ctx, dbPath, restoreURL, restoreOpts)
 db := openDB(...)
 replicator := newReplicator(dbPath, replicaURL, tuneOpts)

From 8aaf3aec7269cf37ba71758d380493ffe4b41669 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Wed, 14 Jan 2026 07:19:52 +0000
Subject: [PATCH 08/13] docs: expand restore config and error handling

Co-authored-by: bcho <1975118+bcho@users.noreply.github.com>
---
 docs/litestream-replication-design.md | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/docs/litestream-replication-design.md b/docs/litestream-replication-design.md
index 501ee7a..b745a7b 100644
--- a/docs/litestream-replication-design.md
+++ b/docs/litestream-replication-design.md
@@ -31,9 +31,11 @@
 
 1. **Configuration** (new `ReplicationOptions`):
    - `--replication-enabled` (bool, default false).
-- `--replication-replica-url` (string, required when enabled; supports Litestream URLs like `s3://bucket/path` or `file:///...` for local testing; multi-replica support would likely rename this to `--replication-replica-urls` or move to a config file).
+   - `--replication-replica-url` (string, required when enabled; supports Litestream URLs like `s3://bucket/path` or `file:///...` for local testing; multi-replica support would likely rename this to `--replication-replica-urls` or move to a config file).
    - `--replication-snapshot-interval` / `--replication-retention` (optional tuning, passed through to Litestream).
    - `--replication-restore-from` (optional override to restore from a different replica URL).
+   - `--replication-restore-interval` (duration, default `0` meaning latest; limits how far back to search for a snapshot when restoring).
+   - `--replication-restore-lag` (duration, default `0` meaning no lag allowed; can be set to tolerate small staleness before triggering a restore).
    - Env var mirrors for container use (e.g., `SQLITEREST_REPLICATION_ENABLED`, etc.).
 
 2. **Restore before serving**:
@@ -76,7 +78,8 @@ go metricsServer.Start(ctx)
 go pprofServer.Start(ctx)
 server.Start(ctx.Done())
 // Error handling: monitor replicator error channel/state changes; log and increment metrics,
-// and optionally trigger process shutdown if replication is marked as required.
+// and optionally trigger process shutdown if replication is marked as required. On error channel
+// receive, cancel the shared context to shut down servers when degraded starts are disallowed.
 ```
 
 ### Testing strategy (future implementation)

From 229cb9e750acfd89f6959f82be7391411b6ebe14 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Wed, 14 Jan 2026 07:20:41 +0000
Subject: [PATCH 09/13] docs: note s3 perms and rename degraded flag

Co-authored-by: bcho <1975118+bcho@users.noreply.github.com>
---
 docs/litestream-replication-design.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/litestream-replication-design.md b/docs/litestream-replication-design.md
index b745a7b..82ac60c 100644
--- a/docs/litestream-replication-design.md
+++ b/docs/litestream-replication-design.md
@@ -63,7 +63,7 @@
 
 ### Configuration mapping
 
-- **S3**: use Litestream’s S3 replica driver; accept AWS creds via standard env vars (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`) and allow custom endpoint for MinIO.
+- **S3**: use Litestream’s S3 replica driver; accept AWS creds via standard env vars (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`) and allow custom endpoint for MinIO. Document minimal IAM needs (typically `s3:PutObject`, `s3:GetObject`, `s3:ListBucket`, and `s3:DeleteObject` for the configured prefix) so operators can keep replication credentials least-privileged.
 - **File**: support `file://` URLs for local/dev validation.
 - Future: allow multiple replicas by expanding the flag surface (e.g., adding `--replication-replica-urls` or a config file section); initial scope is a single replica to minimize surface area.
 
@@ -97,7 +97,7 @@ server.Start(ctx.Done())
 ## Open questions
 
 - Should we expose multiple replicas at launch or keep single-replica until requested?
-- How strict should startup be when replication is enabled but the remote is unreachable? **Recommendation:** fail fast by default to avoid running without configured durability, with an explicit `--replication-allow-degraded-start` escape hatch if operators need to accept the data-loss risk.
+- How strict should startup be when replication is enabled but the remote is unreachable? **Recommendation:** fail fast by default to avoid running without configured durability, with an explicit `--replication-allow-degraded` escape hatch if operators need to accept the data-loss risk.
 - What are the sensible defaults for snapshot/retention to balance durability and cost?
 
 ## Implementation plan (for future PRs)

From 6afce55f6274a69d6e4babb7a1c75dea6b8b548e Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Wed, 14 Jan 2026 07:21:16 +0000
Subject: [PATCH 10/13] docs: clarify restore lag and degraded handling

Co-authored-by: bcho <1975118+bcho@users.noreply.github.com>
---
 docs/litestream-replication-design.md | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/docs/litestream-replication-design.md b/docs/litestream-replication-design.md
index 82ac60c..7ccffa0 100644
--- a/docs/litestream-replication-design.md
+++ b/docs/litestream-replication-design.md
@@ -35,7 +35,7 @@
    - `--replication-snapshot-interval` / `--replication-retention` (optional tuning, passed through to Litestream).
    - `--replication-restore-from` (optional override to restore from a different replica URL).
    - `--replication-restore-interval` (duration, default `0` meaning latest; limits how far back to search for a snapshot when restoring).
-   - `--replication-restore-lag` (duration, default `0` meaning no lag allowed; can be set to tolerate small staleness before triggering a restore).
+   - `--replication-restore-lag` (duration, default `0` meaning no lag allowed; used during startup restore decisions to tolerate a small amount of staleness between the local DB and the replica before forcing a restore).
    - Env var mirrors for container use (e.g., `SQLITEREST_REPLICATION_ENABLED`, etc.).
 
 2. **Restore before serving**:
@@ -78,8 +78,9 @@ go metricsServer.Start(ctx)
 go pprofServer.Start(ctx)
 server.Start(ctx.Done())
 // Error handling: monitor replicator error channel/state changes; log and increment metrics,
-// and optionally trigger process shutdown if replication is marked as required. On error channel
-// receive, cancel the shared context to shut down servers when degraded starts are disallowed.
+// and optionally trigger process shutdown if replication is marked as required (i.e., when
+// `--replication-allow-degraded` is false). On error channel receive, cancel the shared context
+// to shut down servers when degraded starts are disallowed.
 ```
 
 ### Testing strategy (future implementation)

From f01868e027496b36af351636d02014995e86bd77 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Wed, 14 Jan 2026 07:33:11 +0000
Subject: [PATCH 11/13] docs: shift to litestream config and clarify divergence

Co-authored-by: bcho <1975118+bcho@users.noreply.github.com>
---
 docs/litestream-replication-design.md | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/docs/litestream-replication-design.md b/docs/litestream-replication-design.md
index 7ccffa0..d1f3b0d 100644
--- a/docs/litestream-replication-design.md
+++ b/docs/litestream-replication-design.md
@@ -31,16 +31,18 @@
 
 1. **Configuration** (new `ReplicationOptions`):
    - `--replication-enabled` (bool, default false).
-   - `--replication-replica-url` (string, required when enabled; supports Litestream URLs like `s3://bucket/path` or `file:///...` for local testing; multi-replica support would likely rename this to `--replication-replica-urls` or move to a config file).
-   - `--replication-snapshot-interval` / `--replication-retention` (optional tuning, passed through to Litestream).
-   - `--replication-restore-from` (optional override to restore from a different replica URL).
+   - `--replication-config` (string, path to Litestream YAML config; preferred path to keep sqlite-rest changes minimal and delegate detailed tuning like snapshot/retention/replicas to Litestream).
+   - `--replication-restore-from` (optional override to restore from a different replica URL; if omitted, use the primary replica from the Litestream config).
    - `--replication-restore-interval` (duration, default `0` meaning latest; limits how far back to search for a snapshot when restoring).
    - `--replication-restore-lag` (duration, default `0` meaning no lag allowed; used during startup restore decisions to tolerate a small amount of staleness between the local DB and the replica before forcing a restore).
-   - Env var mirrors for container use (e.g., `SQLITEREST_REPLICATION_ENABLED`, etc.).
+   - `--replication-allow-degraded` (bool, default false; when false, runtime replication errors or failed restores will stop the process).
+   - Env var mirrors for container use (e.g., `SQLITEREST_REPLICATION_ENABLED`, `SQLITEREST_REPLICATION_CONFIG`, etc.).
+   - Recommended CLI UX: keep flags minimal (`--replication-enabled`, `--replication-config`, optional `--replication-restore-from` and `--replication-allow-degraded`) and leave all other Litestream knobs to the config file.
 
 2. **Restore before serving**:
    - If enabled, run a Litestream restore for the configured database path **before** opening the DB handle used by `sqlite-rest`.
    - Restore should be idempotent (skip when the local DB is already ahead) and respect a configurable `--replication-restore-interval` / `--replication-restore-lag` window to avoid long restores on healthy primaries.
+   - Divergence handling: if the local WAL lineage differs from the remote replica (e.g., split-brain), default to fail-fast and require operator action (e.g., force-restore from the chosen replica or re-seed) to avoid serving inconsistent data. An explicit `--replication-allow-degraded` plus a force-restore knob can opt into overwriting local state.
 
 3. **Start replication alongside the server**:
    - After opening the DB (once restore is done), create a Litestream replicator instance bound to the same database path and replica URL.
@@ -65,7 +67,7 @@
 
 - **S3**: use Litestream’s S3 replica driver; accept AWS creds via standard env vars (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`) and allow custom endpoint for MinIO. Document minimal IAM needs (typically `s3:PutObject`, `s3:GetObject`, `s3:ListBucket`, and `s3:DeleteObject` for the configured prefix) so operators can keep replication credentials least-privileged.
 - **File**: support `file://` URLs for local/dev validation.
-- Future: allow multiple replicas by expanding the flag surface (e.g., adding `--replication-replica-urls` or a config file section); initial scope is a single replica to minimize surface area.
+- Future: allow multiple replicas (multiple remote destinations for the same SQLite DB) by expanding the config surface (e.g., via Litestream config file); initial scope is a single replica to minimize surface area.
 
 ### Lifecycle integration sketch
 
@@ -92,7 +94,7 @@ server.Start(ctx.Done())
 
 ## Migration & compatibility
 
-- Replication is opt-in; existing CLI invocations keep current behavior.
+- Replication is opt-in and disabled by default; existing CLI invocations keep current behavior. Detailed replication tuning stays in the Litestream config file to minimize sqlite-rest surface changes.
 - Docker image remains the same; enabling replication requires supplying new flags/env and storage credentials.
 
 ## Open questions

From d41a7dd2514369327ef5d056f622d6bdf6a39a3f Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Wed, 14 Jan 2026 07:40:02 +0000
Subject: [PATCH 12/13] docs: simplify flags and fail-fast defaults

Co-authored-by: bcho <1975118+bcho@users.noreply.github.com>
---
 docs/litestream-replication-design.md | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/docs/litestream-replication-design.md b/docs/litestream-replication-design.md
index d1f3b0d..f78698d 100644
--- a/docs/litestream-replication-design.md
+++ b/docs/litestream-replication-design.md
@@ -33,16 +33,13 @@
    - `--replication-enabled` (bool, default false).
    - `--replication-config` (string, path to Litestream YAML config; preferred path to keep sqlite-rest changes minimal and delegate detailed tuning like snapshot/retention/replicas to Litestream).
    - `--replication-restore-from` (optional override to restore from a different replica URL; if omitted, use the primary replica from the Litestream config).
-   - `--replication-restore-interval` (duration, default `0` meaning latest; limits how far back to search for a snapshot when restoring).
-   - `--replication-restore-lag` (duration, default `0` meaning no lag allowed; used during startup restore decisions to tolerate a small amount of staleness between the local DB and the replica before forcing a restore).
-   - `--replication-allow-degraded` (bool, default false; when false, runtime replication errors or failed restores will stop the process).
    - Env var mirrors for container use (e.g., `SQLITEREST_REPLICATION_ENABLED`, `SQLITEREST_REPLICATION_CONFIG`, etc.).
-   - Recommended CLI UX: keep flags minimal (`--replication-enabled`, `--replication-config`, optional `--replication-restore-from` and `--replication-allow-degraded`) and leave all other Litestream knobs to the config file.
+   - Recommended CLI UX: keep flags minimal (`--replication-enabled`, `--replication-config`, optional `--replication-restore-from`) and leave all other Litestream knobs to the config file.
 
 2. **Restore before serving**:
    - If enabled, run a Litestream restore for the configured database path **before** opening the DB handle used by `sqlite-rest`.
-   - Restore should be idempotent (skip when the local DB is already ahead) and respect a configurable `--replication-restore-interval` / `--replication-restore-lag` window to avoid long restores on healthy primaries.
-   - Divergence handling: if the local WAL lineage differs from the remote replica (e.g., split-brain), default to fail-fast and require operator action (e.g., force-restore from the chosen replica or re-seed) to avoid serving inconsistent data. An explicit `--replication-allow-degraded` plus a force-restore knob can opt into overwriting local state.
+   - Restore should be idempotent (skip when the local DB is already ahead) and rely on Litestream config knobs (snapshot interval/retention/restore lag) for tuning.
+   - Divergence handling: if the local WAL lineage differs from the remote replica (e.g., split-brain), default to fail-fast and require operator action (e.g., force-restore from the chosen replica or re-seed) to avoid serving inconsistent data.
 
 3. **Start replication alongside the server**:
    - After opening the DB (once restore is done), create a Litestream replicator instance bound to the same database path and replica URL.
@@ -79,10 +76,8 @@ go replicator.Start(ctx) // ctx tied to serve command cancellation
 go metricsServer.Start(ctx)
 go pprofServer.Start(ctx)
 server.Start(ctx.Done())
-// Error handling: monitor replicator error channel/state changes; log and increment metrics,
-// and optionally trigger process shutdown if replication is marked as required (i.e., when
-// `--replication-allow-degraded` is false). On error channel receive, cancel the shared context
-// to shut down servers when degraded starts are disallowed.
+// Error handling: monitor replicator error channel/state changes; log and increment metrics.
+// On error channel receive, cancel the shared context to shut down servers (fail-fast default).
 ```
 
 ### Testing strategy (future implementation)
@@ -99,8 +94,6 @@ server.Start(ctx.Done())
 
 ## Open questions
 
-- Should we expose multiple replicas at launch or keep single-replica until requested?
-- How strict should startup be when replication is enabled but the remote is unreachable? **Recommendation:** fail fast by default to avoid running without configured durability, with an explicit `--replication-allow-degraded` escape hatch if operators need to accept the data-loss risk.
 - What are the sensible defaults for snapshot/retention to balance durability and cost?
 
 ## Implementation plan (for future PRs)

From 61a3a12eb2da4bedab3900769aef80fb9310b0c3 Mon Sep 17 00:00:00 2001
From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com>
Date: Wed, 14 Jan 2026 07:59:06 +0000
Subject: [PATCH 13/13] docs: remove restore-from flag and tighten ux

Co-authored-by: bcho <1975118+bcho@users.noreply.github.com>
---
 docs/litestream-replication-design.md | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/docs/litestream-replication-design.md b/docs/litestream-replication-design.md
index f78698d..2793202 100644
--- a/docs/litestream-replication-design.md
+++ b/docs/litestream-replication-design.md
@@ -32,9 +32,8 @@
 1. **Configuration** (new `ReplicationOptions`):
    - `--replication-enabled` (bool, default false).
    - `--replication-config` (string, path to Litestream YAML config; preferred path to keep sqlite-rest changes minimal and delegate detailed tuning like snapshot/retention/replicas to Litestream).
-   - `--replication-restore-from` (optional override to restore from a different replica URL; if omitted, use the primary replica from the Litestream config).
    - Env var mirrors for container use (e.g., `SQLITEREST_REPLICATION_ENABLED`, `SQLITEREST_REPLICATION_CONFIG`, etc.).
-   - Recommended CLI UX: keep flags minimal (`--replication-enabled`, `--replication-config`, optional `--replication-restore-from`) and leave all other Litestream knobs to the config file.
+   - Recommended CLI UX: keep flags minimal (`--replication-enabled`, `--replication-config`) and leave all other Litestream knobs to the config file.
 
 2. **Restore before serving**:
    - If enabled, run a Litestream restore for the configured database path **before** opening the DB handle used by `sqlite-rest`.