This ROADMAP tracks progress through verifiable Gates and sub-task checklists — not date commitments. The project identity is MIT-licensed PostgreSQL Kubernetes Operator. We target production-grade operational quality without forking, embedding, or wrapping external operator runtimes.
| Marker | Meaning |
|---|---|
[x] |
Code and tests exist; e2e or unit tests guard regressions. |
[~] |
Partial — e.g. CRD field only, helper not wired in, or e2e missing. |
[ ] |
Not started (design or PoC only). |
The Verify row on each sub-task quotes the verification command or e2e file.
- External design is fair game — public operator design documents and distributed-SQL papers inform our internal design, only as references.
- External systems must not ship inside this product — external sharding extensions, third-party operator CRDs, external HA agents, and third-party distributed-SQL backends are excluded from the runtime artifact.
- Implement as a new service — the operator manager, instance manager, sharding metadata, router, and backup orchestration are written in this repository under permissive-license-compatible dependencies.
- Production-grade quality bar — the target level for HA / backup / restore / upgrade / observability / security UX. Not a claim of using any specific external product.
| Item | State | Evidence |
|---|---|---|
| Project / chart name | postgres-operator |
GitHub repo, Helm chart, and GitOps path are aligned |
| License | MIT | LICENSE, ADR-0003 |
| Latest published release | 0.3.0-alpha.16 (Helm) / 0.3.0-alpha.17 (live deploy) |
Helm index keiailab.github.io/postgres-operator tops out at 0.3.0-alpha.16; 0.4.0-beta.1 is the source chart appVersion, not published to GHCR/Helm/OLM as of 2026-06-04. "Level 4 Deep Insights" is a code capability, not a release claim. |
| OLM bundle | bundle/manifests/ aligned with 8 CRDs + alm-examples + CSV descriptions |
operator-sdk bundle validate --select-optional suite=operatorframework is clean (T26) |
| Declarative DB surface | Pooler / PostgresDatabase / PostgresUser / ScheduledBackup / ImageCatalog / ClusterImageCatalog / externalClusters / replica cluster | T22 / T24 / T25 cycles completed; live kind smoke automation (T27) in progress |
| Local 4-layer gate | L1 lefthook pre-commit + L2 pre-push + L3 make validate/audit + L4 PR evidence | ADR-0009 / RFC-0002; version-drift assertion and bundle validate are automated (T26) |
| Production deployment | operator-only (Flux 0.3.0-alpha.17) |
0 live PostgresCluster instances as of 2026-06-04 — the operator manager runs but manages no workload. Day-0 single-shard Ready (PostgreSQL 18.3) was demonstrated on a throwaway namespace, not in production. |
| GHCR runtime image | Publicly pullable | ghcr.io/keiailab/pg:18 restarts with no pull secret |
| HA replicas | Partial (Replicas field only) |
api/v1alpha1/postgrescluster_types.go |
| Backup / restore | Partially implemented | BackupJob phase transitions + ScheduledBackup CRD/controller + RestorePIT call path + pgBackRest command-runner plugin + K8s sidecar exec path. Actual restore drill is still pending. |
| 1.0.0 GA | Not yet | HA / backup / chaos / soak still required |
Goal: a user can deploy the operator + a single-shard Postgres cluster via GitOps.
- CRD
PostgresClusterdefinition —api/v1alpha1/postgrescluster_types.go(RFC-0001 v2 schema). - CRD
BackupJobdefinition (Phase 1 spec) —api/v1alpha1/backupjob_types.go. -
PostgresClusterReconcilerbuilds desired state (ConfigMap / headless Service / StatefulSet) —internal/controller/postgrescluster_controller.go. - Status phase transitions (Provisioning → Ready) —
internal/controller/status.go,aggregate_status.go. - Pod readiness tracking — reconciler endpoint watch.
- ArgoCD
Synced/Healthy— verified on production (platform-data-postgres-operator). - GHCR public pull —
ghcr.io/keiailab/pg:18restarts with no pull secret. - Day-0 e2e —
test/e2e/e2e_test.go,postgrescluster_e2e_test.go. - Verify: ArgoCD
Synced/Healthy+ Pod1/1Running +psql -c 'select version()'.
Goal: usable as a single-PostgreSQL production database, with HA.
-
Replicasfield (0–15 async replicas) —postgrescluster_types.go. - STS scale mapping — reconciler.
- Primary-delete e2e baseline —
test/e2e/failover_e2e_test.go. - Automatic PDB creation —
internal/controller/pdb.go. - PVC fencing (split-brain fail-fast) —
internal/controller/failover/pvc_fence_runbook.go(DecidePVCFence순수 결정 함수 + 4 reason: MultiAttach/SplitBrain/StaleLease/PromotionRace) +docs/runbooks/pvc-fence.md(158 lines, 8 section, 자동 적용/해제/사후 분석 SOP). 5 sub-test PASS (TestPVCFenceRunbook, D.1.1, 2026-05-19). - Automatic failover logic — new directory
internal/controller/failover/.- Primary failure detection —
internal/controller/failover/detection.go(DetectPrimaryFailure+SelectPromotionCandidate, pure functions, 4FailureReasonenums, 9 unit tests, PR #38). - Standby promotion (
pg_ctl promoteor logical-replication promotion) —internal/controller/failover/promotion.go(BuildPromotionPlan+Promoterinterface +PromoteFromDecisionhelper, 4-step plan: RemoveStandbySignal / PgCtlPromote / WaitNotInRecovery / UpdateInstanceRole; 6 unit tests; PR #39).internal/controller/failover_promoter.goimplements the replica-Podpostgres-container exec and the promotedinstance-statusannotation patch. - Post-Ready primary-failure status surface —
status.phase=Degraded+FailoverReady=False+ promotion-candidate message. - Replica rejoin (
pg_basebackuporpg_rewind) — first-bootpg_basebackup+ existing-PGDATA old-primary marker generalization + current-primary endpoint main env +pg_rewindcommand-runner + HBA normal-connection auth + freshpg_basebackupfallback all done. Live A.1 basebackup drill PASS (T31, 2026-05-17, commits 09abbb5/dca3fa0):quickstart-shard-0-1standby PVC delete + in-pod PGDATA wipe + Pod kill → reconciler init container 가 freshpg_basebackup실행 →pg_stat_replication{application_name=quickstart-shard-0-1, state=streaming, sync_state=async, lag=0}회복. STS PVC retentionRetain회피 path 까지 evidence. A.2 pg_rewind live drill 은 별 task (SMOKE_FAILOVER operator-driven promotion 라이브 trigger 회귀 —docs/g1-ha-election-fact-fix영역 위임). Post-failover auto-rejoin gap (#205, fix PR #206, 2026-06-04): 위 drill 은 초기 seeding 만 cover — primary restart 후 standby 가 streaming 재개 못 하는 케이스가 라이브 발견되어 operator auto re-seed 를 구현. 라이브 stuck 은 비결정적 (2026-06-04 failover 에선 standby 자연 회복). - Synchronous replication —
spec.postgresql.synchronous.{method,number,dataDurability}+ CELnumber<=shards.replicas+ANY/FIRST N (...)rendering +required/preferredquorum policy + standbyapplication_namewiring + ConfigMap-hash rolling reconcile all done. Live B.1~B.3 RPO=0 drill PASS (T31, 2026-05-17, commit dca3fa0):synchronous_standby_names='ANY 1 ("quickstart-shard-0-1","quickstart-shard-0-0")'적용 →sync/quorum replica count=1→ 1000-row commit 후commit_lsn=0/3DA43A0 / flush_lsn=0/3DA43A0(pg_wal_lsn_diff=0) → RPO=0 직접 증명. drill 함수:hack/smoke.sh::drill_sync(SMOKE_SYNC=1). B.4 sync standby kill scenario 는 opt-in (SMOKE_SYNC_KILL=1). - [~] HA election distributed lock (K8s Lease) —
internal/controller/failover/lease.go(FailoverLeaseName+LeaseConfig+NewLease/Run/IsLeader, thin adapter overinternal/instance/election.Realper §2 Simplicity; 2 unit tests with fake clientset verify single-leader + handoff).test/e2e/ha_lease_election_test.go신규 작성 (D.2.2): operator manager 2 replica scale → Lease holderIdentity 1 Pod 검증 → leader kill → handoff (LeaseDuration 15s 이내) → failover-lease ↔ manager-lease 분리 검증.//go:build e2ePASS. 라이브 multi-replica drill 은 cluster mesh 복원 후 별 turn (2026-05-19).
- Primary failure detection —
- Backup / restore controller implementation —
internal/controller/backupjob_controller.goreconcile switch + Phase 전환 + ScheduledBackup cron + restore PIT call path + executionMode=job/sidecar 양쪽 + 3 plugin (pgBackRest + WAL-G + Barman) 등록. 자식 6 sub-task 모두 [x] (Phase transitions / ScheduledBackup / RestorePIT / executionMode=job / Plugin invocation / Sidecar mode). 8 + 5 unit-test 보유 (D.4.1 parent 마감, 2026-05-19).-
BackupJob.Phasetransitions (Pending → Running → Succeeded/Failed) —internal/controller/backupjob_controller.goreconcile switch + 8 unit tests. -
ScheduledBackupCRD / controller — 6-field cron schedule → atomicBackupJobcreation;suspend/immediate/ownerReference/concurrencyguards; 5 unit tests. -
BackupJob.spec.type=restore→BackupPlugin.RestorePIT(targetTime)call path + requiredtargetTimevalidation. -
BackupJob.spec.executionMode=job→ ownedbatch/v1.Jobcreate + observe;jobTemplatestandard env injection. - Plugin invocation — pgBackRest + WAL-G (
internal/plugin/backup/walg/) + Barman (internal/plugin/backup/barman/) 3 BackupPlugin 구현 완성. 양 plugin: BackupPlugin + BackupCommandPlugin interface 만족 + Runner pluggable + Validate (WAL-G: WALG_* prefix 필수, Barman: server identifier) + BackupCommand/RestoreCommand + ParseBackupResult regex. WAL-G: 12 sub-test PASS, Barman: 13 sub-test PASS (D.3.1, 2026-05-19). - Sidecar mode branch — pgBackRest argv delivered via K8s
pods/execto the ready primary Pod'spostgrescontainer.
-
- [~] PITR restore —
BackupRestoreSpec.TargetTime-driven pgBackRestrestore --type=time --target=...call path + sidecar exec path both present.test/e2e/pitr_restore_e2e_test.go신규 작성 (D.3.2): full backup → marker 'before' + 시점 기록 → 'after' insert → restore type=time targetTime → 'before' 존재 + 'after' 부재 + pg_stat_database checksum_failures=0.//go:build e2e빌드 PASS. 라이브 kind drill 은 cluster mesh 복원 후 별 turn (2026-05-19). - Upgrade rollback runbook —
docs/runbooks/upgrade.md206 lines (11 section: 4 분류 매트릭스 + pre-upgrade 9-item 체크리스트 + ImageCatalog 절차 + patch/minor major/major upgrade 3 절차 + operator binary upgrade + rollback 3 분기 + 사후 검증 SOP + e2e + references). D.2.3 verify (≥150) PASS, 2026-05-19. - RTO / RPO measurement + recording —
docs/runbooks/ha.md(SLO RTO≤60s + RPO=0 + verify steps) (PR #54) - Verify: after primary delete, a replica is promoted within N seconds +
pg_is_in_recovery()=false+ 0 data loss; after a fresh-cluster restore, data checksums match.
Goal: cover the production-grade operational surface.
-
/metricsbaseline exposure (port 8443) —internal/controller/metrics.go,cmd/main.go. - TLS path setup (certificate mount +
ssl=on) —internal/controller/builders.go:renderPostgresConf(),tls.go. - Topology spread integration —
internal/controller/topology_spread.go. - PVC online resize —
internal/controller/pvc_resize.go. - Cascade-delete guard —
internal/controller/cascade_delete_test.go. - [~] cert-manager integration — mount path only; issuance mechanism still TBD.
- Automatic PrometheusRule generation — Helm metrics Service / ServiceMonitor / PrometheusRule rendering + real
postgres_operator_backupjob_phasemetric driving BackupJob failure alerts. Verify PASS:helm template charts/postgres-operator --set metrics.enabled=true --set metrics.prometheusRule.enabled=true \| grep -cE "alert:"= 8 alerts (ReconcileFailureRate / LeaderElectionLost / ReplicationLagHigh / ConnectionsHigh / PrimaryDown / BackupFailed / LocksHigh / WorkqueueDepthHigh) ≥ 8 (D.5.2, 2026-05-19).- Replication-lag warning — instance status
LagBytes→postgres_operator_postgrescluster_replication_lag_bytes+ HelmPostgresReplicationLagHigh. - Pooler failure / saturation warnings —
postgres_operator_pooler_phase{phase="Failed"}+ render verification ofcnpg_pgbouncer_*exporter-metric-driven collection-failure / client-waiting / max-wait alerts (metric prefix retained for ecosystem exporter compatibility). - Disk pressure —
kubelet_volume_stats_*data-PVC alert. - Backup failure —
postgres_operator_backupjob_phase{phase="Failed"}.
- Replication-lag warning — instance status
- [~] Grafana dashboards — Helm dashboard ConfigMap rendering done (
postgres-operator-cluster-overview.json,postgres-operator-pooler.json); live Grafana import / panel verification still pending. - [~] Connection pooler (PgBouncer) —
PoolerCRD + ConfigMap / Deployment / Service reconcile (first slice).test/e2e/pooler_e2e_test.go신규 작성 (D.5.4 + D.5.5): Deployment 2/2 Ready + Service psql SELECT 1 + PAUSE/RESUME 토글 + exporter/metricspgbouncer_pools 노출.//go:build e2ePASS. 라이브 kind drill 은 cluster mesh 복원 후 별 turn (2026-05-19).- CRD
Pooler.spec.{cluster, instances, type, pgbouncer.poolMode, pgbouncer.parameters}added. - Separate PgBouncer Deployment / Service / ConfigMap created +
userlist.txtSecret fail-closed validation. - Default PgBouncer readiness / liveness / startup probes + exporter
/metricsreadiness / liveness probes. - PgBouncer parameter allowlist + operator-owned-key fail-closed validation.
- Automatic topology spread + PodDisruptionBudget when
instances > 1. - Stronger rolling-update defaults —
maxUnavailable=0,maxSurge=1,minReadySeconds=5. - Pooler parity surface —
deploymentStrategy,serviceAccountName, statusbackendTargets/configHash. -
pg_hba→ PgBouncerpg_hba.confrendering + operator-owned validation ofauth_type=hba/auth_hba_file. - User-supplied server / client TLS Secret rendering + Secret/key fail-closed validation.
-
type=rofull ready-replica host-list rendering +server_round_robin=1+server_login_retry=2defaults. - [~] PgBouncer exporter — explicit sidecar +
metricsServicePort + PodMonitor selector label/sample + PrometheusRule alert render verification on standard PgBouncer metric prefixes; live Prometheus scrape / Grafana verification still pending. - Built-in auth user automation (T27 ⑤) —
keiailab_pooler_pgbouncerLOGIN role +<pooler-name>-builtin-authSecret auto-provisioned whenauthSecretRefis empty. - Built-in auth password rotation (T27 ⑥) —
postgres.keiailab.io/rotate-pooler-password=trueannotation triggers in-placeALTER ROLE+ Secret update + status timestamp; ConfigHash now includes userlist for auto-reload. - Built-in TLS auto-issuance (T29) —
internal/postgres/tls_auto.go(IssueSelfSignedRSA-2048 + x509 self-signed CA + ServerAuth+ClientAuth ExtKeyUsage +ShouldRenew30d skew). 9 sub-test PASS (TestIssueSelfSigned+TestShouldRenew). cert-manager 부재 환경 대응 (in-process 발급, D.6.1, 2026-05-19). - Paused PAUSE/RESUME reconciliation —
spec.paused→ PgBouncerSIGUSR1/SIGUSR2,status.paused, Pod annotation audit. - Pooler Service
psqlsmoke — 2026-05-12SMOKE_POOLER=1 ./hack/smoke.sh --keepon kind passed (quickstart+ Pooler ServiceSELECT 1 = 1, PAUSE blocks new clients with timeout, RESUME re-enablesSELECT 1 = 1, Deployment2/2). - In-place PgBouncer config reload — patching
pgbouncer.parameterswaits for the ConfigMapconfig.sha256projection, sendsSIGHUPto ready Pods, and audits the Pod hash annotation while preserving Deployment generation and Pod names.
- CRD
- User / DB / RBAC declarative.
- [~] CRD
PostgresDatabase—spec.cluster/name/owner/ensure/tablespace/extensions/schemas/fdws/servers/privileges+ ready-primarypsqlreconcile +status.applied+databaseReclaimPolicy=deletefinalizer + database/schema privilege grant/revoke implemented.test/e2e/postgresdatabase_e2e_test.go신규 작성 (D.5.6): CR apply → status.applied=true / pg_database 검증 / extension+schema 적용 / reclaim=delete finalizer DROP.//go:build e2ePASS. 라이브 kind drill 은 cluster mesh 복원 후 별 turn (2026-05-19). - [~] CRD
PostgresUser—spec.cluster/name/ensure/login/superuser/createdb/createrole/replication/bypassrls/inherit/connectionLimit/inRoles/passwordSecretRef/disablePassword/validUntil+ ready-primarypsqlreconcile +status.applied/passwordSecretResourceVersionimplemented; membershipREVOKE+ password Secret username match +disablePasswordfail-closed + referenced-Secret update watch +PostgresCluster.status.managedRolesStatusaggregation done.test/e2e/postgresuser_e2e_test.go신규 작성 (D.5.7): 초기 role 생성 → pg_roles 검증 + 초기 password connect → Secret patch → 갱신 password connect PASS + 이전 password 거부 → CR 삭제 DROP ROLE.//go:build e2ePASS. 라이브 kind drill 은 cluster mesh 복원 후 별 turn (2026-05-19). - Role/permission reconcile —
PostgresUserrole flags + membershipGRANT/REVOKE+ cluster-level managed-role status + database-object privilege model (internal/postgres/grants.goBuildGrantSQL/BuildRevokeSQL/BuildDefaultPrivilegesSQL— 5 ObjectClass DATABASE/SCHEMA/TABLE/SEQUENCE/FUNCTION + PG 18 allowed privilege set + WITH GRANT OPTION + ALTER DEFAULT PRIVILEGES + double-quote escape + 결정성 보장). 13 sub-test PASS (TestObjectGrants, D.5.8, 2026-05-19).
- [~] CRD
- Upgrade smoke —
test/e2e/version_upgrade_e2e_test.go175 lines//go:build e2e(PG 17 → 18 rolling upgrade + 3 가설 검증: A STS image update / B spec.postgresVersion 보존 / C Pod rotation Phase=Running 복귀 + Unsupported version reject 시나리오 (15 patch → controller IsSupported 거부, STS image 18 유지)). 본 e2e 가 internal/version/matrix.go 의 stable 매트릭스 (16/17/18) 와 정합. 라이브 kind 실행은 cluster mesh 복원 후 별 turn. 본 verify P-D 의 "14→15→16" 가정은 PG 18+ 최소 정책 (ARCHITECTURE L122) 와 불일치 — 16/17/18 진본으로 정정 (D.6.3, 2026-05-19). - Security defaults hardening —
internal/controller/security_defaults.go(PodSecurityRestrictedLabelsPSA v1.29+ restricted enforce/audit/warn +RestrictedSecurityContextAllowPrivEsc=false/Privileged=false/ROfs=true/NonRoot=true/Caps=ALL drop/Seccomp=RuntimeDefault +BuildDefaultDenyNetworkPolicies4-5 policy: default-deny + allow-intra (replication) + allow-client (Pooler ns) + allow-egress (DNS) + 옵션 allow-metrics monitoring scrape). 3 test/5 sub-test PASS (D.6.4, 2026-05-19). - [~] ImageCatalog / ClusterImageCatalog — CRD +
spec.imageCatalogRef.{apiGroup,kind,name,major}+ catalog image → StatefulSet init/main container image + image-hash annotation rollout-drift tracking + catalog watch / envtest done.test/e2e/imagecatalog_e2e_test.go신규 작성 (D.5.9): ImageCatalog apply (17+18) → STS image 17 + Ready → patch major 18 → STS image rollout + image-hash annotation drift 추적.//go:build e2ePASS. 라이브 kind drill 은 cluster mesh 복원 후 별 turn 잔여 (extension-image volume mount + official digest catalog 도 후속, 2026-05-19). - [~] Replica clusters / externalClusters —
externalClusters[].connectionParameters+password+sslKey/sslCert/sslRootCert+bootstrap.pg_basebackup.source+replica.enabled/sourcesurface, streaming standalone replica bootstrap, ordinal-0 externalpg_basebackup,standby.signal/primary_conninfo, password passfile + TLS client/root cert conninfo, persistent-follower election that blocks local promotion, and fail-closed status all verified.test/e2e/external_clusters_drill_e2e_test.go신규 작성 (D.5.10): source → replica cluster (replica.enabled=true) → in_recovery=t 유지 + source data streaming + primary lease holder 차단 (fail-closed).//go:build e2ePASS. WAL-archive hybrid + distributed-topology demotion + 라이브 cross-cluster drill 은 별 turn (2026-05-19). - [~] Declarative hibernation — hibernation annotation
cnpg.io/hibernation=on/off(retained for ecosystem-tool compatibility), shard StatefulSet/PVC-template preservation +replicas=0, native routerreplicas=0,status.phase=Hibernated, hibernation condition, all envtest-verified.SMOKE_HIBERNATION=1path PVC marker preservation + rehydration round-trip.test/e2e/hibernation_e2e_test.go신규 작성 (D.5.11): marker INSERT → hibernation=on → STS replicas=0 + Phase=Hibernated + PVC 보존 → hibernation=off → Ready 복귀 + marker 'keep-me' 보존.//go:build e2ePASS. 라이브 kind drill 은 cluster mesh 복원 후 별 turn (2026-05-19). - Release smoke test —
scripts/release-smoke-test.sh6-stage (1/6 GH Release tag+assets / 2/6 GHCR image manifest / 3/6 GitHub Pages / 4/6 helm index / 5/6 helm pull+template default+all-features / 6/6 trivy post-publish HIGH+CRITICAL fixed only). baseline grep verify PASS (6/6 stage 모두 출력) (D.6.5, 2026-05-19). - Verify: PrometheusRule / Grafana dashboard rendering,
psqlaccess through the Pooler Service, live PgBouncer exporter scrape, and an upgrade rolling restart succeed.
Goal: implement sharding metadata in-house, without any external sharding runtime.
Status correction (2026-06-04):
internal/router/*(vindex / metadata_store / scatter) was removed as dead code in #124 (3900 lines). The[x]items below describe the pre-removal state and are pending re-implementation — verified absent on the live branch.
-
ShardingModefield (none/native) —postgrescluster_types.go. Constants + Spec round-trip guarded byTestShardingMode(api/v1alpha1/postgrescluster_types_test.go); enum validation is enforced at the apiserver via the+kubebuilder:validation:Enum=none;nativemarker. RFC 0001 §3.1 / RFC 0002. -
ShardsSpec(initial shard count / replicas / storage) —postgrescluster_types.go. Field round-trip +DeepCopyslice independence +Replicas=0(HA-off dev) guarded byTestShardsSpec(api/v1alpha1/postgrescluster_types_test.go). RFC 0001 §3.1. - Sharding plugin interface —
internal/plugin/sharding/api.go. Compile-time interface freeze +Registryregister/get/Names round-trip +Capabilitiesadvertisement +ErrUnsupportedsentinel guarded byTestShardingPluginumbrella (internal/plugin/sharding/api_test.go). RFC 0001~0005 / RFC 0004 (router architecture). -
ShardRangeCRD —api/v1alpha1/shardrange_types.go+config/crd/bases/postgres.keiailab.io_shardranges.yaml(RFC 0002, offline yaml parse PASS,make manifests통과).- Hash-range / list / range policy branching —
internal/router/vindex.go(ResolveShard순수 평가 + 4 vindex 분기: hash/range working + consistent-hash/lookupErrVindexUnsupporteddeferred + 3 hash function murmur3/fnv/crc32 +ValidateNoOverlapoverlap detection + 자체 murmur3 구현 외부 dep 0). 9 sub-test PASS (TestResolveShard, D.8.2, 2026-05-19). pg-router reconciler integration 은 cmd/pg-router/ PoC 후속. - Metadata store (Postgres system catalog) —
internal/router/metadata_store.go:Storeinterface (Migrate/Upsert/List/Delete/CurrentVersion) +PostgresStoresql.DB구현 +SchemaMigrationsversioned DDL (v1 namespace+tables+index, v2 placement hints columns) + transactional Upsert ON CONFLICT generation+1 + sorted List + Validation (empty cluster/keyspace/Lo/Hi/ShardID 거부). sidecar 미선택 사유 (PG ACID+replication+backup 활용 + operator 기존 SQL path 통합 + 운영 표면 추가 0) 본문 codify. 9 sub-test PASS (TestPostgresStoresqlmock 기반, D.8.3, 2026-05-19).
- Hash-range / list / range policy branching —
-
pg-routerservice PoC — newcmd/pg-router/.- SQL parser (libpg_query or homegrown).
- Shard-placement lookup.
- Connection routing (libpq passthrough).
- Manual shard placement —
internal/router/placement.go(PlacementSpec{ShardID, PreferredZone, PreferredNode, Weight} +ValidatePlacement중복/empty/negative 거부). D.8.8 의 placement intent layer (2026-05-19). - GitOps drift guard —
internal/router/placement.go(DetectPlacementDrift6 reason: Missing/Extra/ZoneMismatch/NodeMismatch/NotReady/RangeUncovered + 결정적 정렬 +HasDrifthelper). ShardRange.ranges[].shard ↔ PlacementSpec ↔ ObservedShard 3-way cross-check. 6 sub-test + 4 ValidatePlacement sub-test PASS (D.8.8, 2026-05-19). - Verify: queries through
pg-routeron a 2-shard cluster are routed to the correct shard.
Goal: split / rebalance without data loss.
-
ShardSplitJobCRD —api/v1alpha1/shardsplitjob_types.go(~180 lines): ShardSplitJobSpec (Cluster/Keyspace/Direction/Sources/Targets/CutoverWindow/CDCMaxLag/AllowForwardOnly) + ShardSplitTarget (ShardID/Ranges/Placement) + ShardSplitJobStatus (Phase 11-enum/ObservedGeneration/StartedAt/CompletedAt/CurrentLagBytes/CutoverStartedAt/SnapshotLSN/FailureReason/Conditions) + ShardSplitDirection 2-enum (split/merge) + zz_generated_shardsplitjob.go deepcopy. 5 sub-test PASS (TestShardSplitJob, D.9.1, 2026-05-19). 라이브 CRD apply 는 mesh 복원 후 별 turn. - 7-step e2e scenario —
internal/controller/shardsplit/: Step interface freeze + 7 step 구체 구현 (StepSnapshotWAL/Bootstrap/InitialCopy/CDCCatchup/Cutover/RoutingUpdate/Cleanup) + Dependencies interface (8 method: Snapshot/BootstrapTarget/InitialCopy/StartCDC/CDCLag/Cutover/UpdateRouting/CleanupSource) +RunAllorchestrator (state machine + phase transition + 자동 Failed 처리). 14 sub-test PASS (TestStepRun11 +TestRunAll_*5: HappyPath/SnapshotFailure/CDCNotReady/NilJob/PendingPhaseInit). 실 K8s/SQL Dependencies 구현은 multi-month sprint (D.9.2 마감, 2026-05-19).- 1. Snapshot + WAL capture —
StepSnapshotWAL.Run(Dependencies.Snapshot → status.SnapshotLSN 기록, startedAt 설정, D.9.3). - 2. Bootstrap the target shard —
StepBootstrap.Run(모든 target 에 Dependencies.BootstrapTarget 호출, D.9.4). - 3. Initial copy —
StepInitialCopy.Run(SnapshotLSN precondition 검증 + 각 target 에 Dependencies.InitialCopy, D.9.5). - 4. CDC catch-up —
StepCDCCatchup.Run(Dependencies.StartCDC + CDCLag 측정 → status.CurrentLagBytes 갱신, D.9.6). - 5. Cutover (minimal write-block window) —
StepCutover.Run(CDCReadyForCutoverprecondition + status.CutoverStartedAt 기록 + Dependencies.Cutover with window, D.9.7). - 6. Routing update —
StepRoutingUpdate.Run(Dependencies.UpdateRouting — ShardRange CRD ranges + metadata store atomic 갱신, D.9.8). - 7. Source cleanup —
StepCleanup.Run(Dependencies.CleanupSource + status.CompletedAt 기록, D.9.9).
- 1. Snapshot + WAL capture —
- Cutover rollback / forward-only verification —
internal/controller/shardsplit/steps.goRollbackAllowed(job)정책 함수: Cleanup/Completed 불가 / AllowForwardOnly + Cutover/RoutingUpdate 불가 / 그 외 가능.ValidateTransition가 post-cutover Aborted 차단 +IsTerminal3 phase 분류. 7 sub-test PASS (TestStateMachine, D.9.10, 2026-05-19). - Verify: data integrity during split (checksum) + cutover-window measurement + rollback feasibility.
Goal: clearly bound cross-shard query / transaction support.
- Scatter-gather query path —
internal/router/scatter.go실 구현: fan-out goroutine + ShardExecutor pluggable interface (실 libpq passthrough 외부 구현 위임) + FailFast/BestEffort 2 정책 + MergeConcat/MergeOrderBy 2 전략 + context cancellation. 9 sub-test PASS (TestScatterGather, D.10.1, 2026-05-19). wire-protocol v3 forwarding 자체는 pg-router PoC (D.8.4) 후속. - 2PC / saga distributed-transaction choice — ADR-0015 결정 (2PC primary + saga deferred) +
internal/tx/2pc.go실 in-memory state machine 구현: Begin/Enlist/Prepare/Commit/Rollback + State (Active/Prepared/Committed/RolledBack/InDoubt) + parallel goroutine prepare + 부분실패 자동 rollback + InDoubt 표시 + GID/TxID 결정적 발급. 8 sub-test PASS (TestTwoPhaseCommit, D.10.2, 2026-05-19). tx log persistence (etcd) + Lease election 통합은 D.2.2 후속. - Isolation matrix documented — which isolation levels hold under which conditions. Evidence:
docs/sql/isolation-matrix.md(D.10.3). - [~] Benchmarks — sysbench / pgbench variants (
test/bench/pgbench.sh+sysbench.sh+docs/perf/baseline.mdskeleton; pending live measurement). - Verify: per-isolation-level anomaly / no-anomaly table + benchmark numbers.
Goal: commercial-grade quality.
- e2e baseline —
test/e2e/. - Long-running soak — ≥ 7 days, no downtime. (NON-GOAL single session) (NON-GOAL for single session — 7-day wall clock required)
- Chaos engineering — pod kill / network partition / disk pressure. (multi-day drill) (multi-day chaos drill required)
- Restore rehearsal — periodic automated backup-restore + verification. (monthly cron drill — out of single session)
- Upgrade matrix — N → N+1 / N → N+2 / minor patches —
test/e2e/version_upgrade_e2e_test.go가 PG 17→18 rolling upgrade + Unsupported 15 reject 양쪽 매트릭스 cover. internal/version/matrix.go stable 매트릭스 (16/17/18) 와 정합. GH Actions 금지 (RFC-0002) 정합 — 로컬make test-e2e-version-upgrade실행. D.6.3 dependency satisfied (D.11.4, 2026-05-19). - SBOM + signing —
scripts/sbom-attach.sh126 lines (syft SPDX-JSON SBOM 생성 → cosign sign image → cosign attest --type spdxjson → cosign verify + verify-attestation, COSIGN_KEY 또는 keyless OIDC 분기, IMAGE_OPERATOR + 옵션 IMAGE_PG 양쪽). RFC-0002 정합 (GH Actions 없이 release tag push 시 manual or local 실행). bash syntax PASS (D.11.5, 2026-05-19). - Docs / runbooks complete.
- HA / backup / restore / upgrade / security / migration runbooks —
docs/runbooks/{ha,backup,restore,upgrade,security,migration,pvc-fence}.md7 runbook 모두 존재 (6 의무 + pvc-fence 본 turn 추가). upgrade 본 turn 206 lines 확장 (D.2.3). verifyls docs/runbooks/{ha,backup,restore,upgrade,security,migration}.mdPASS (D.11.7, 2026-05-19).
- HA / backup / restore / upgrade / security / migration runbooks —
- Verify: 7-day soak passes + N chaos scenarios pass + SBOM attached + every runbook exists.
- ❌ Repackaging an external PostgreSQL operator.
- ❌ External sharding-extension built-in features (external sharding extensions are design references, not runtime dependencies).
- ❌ A general-purpose Plugin SDK product story (retired from the v0.x archive).
- ❌ GitHub Actions as a required release gate — see RFC 0002 (org-wide). Delegated to the local 4-layer gate.
- ❌ Date-based roadmap deadlines — see the org-wide
workflow.md. - ❌ Marketing HA / backup features as
production-readybefore they are verified.
| Date | Change |
|---|---|
| 2026-05-16 | G3 §Sharding foundation: flipped ShardingMode / ShardsSpec / Sharding plugin interface [~] → [x] with unit-test coverage (TestShardingMode, TestShardsSpec, TestShardingPlugin). Plans 2026-05-14-4-operators-100pct/P-D §D.7. |
| 2026-05-12 | Backup/restore gap closed: added ScheduledBackup CRD/controller, BackupJob creation on cron firing, BackupJob.spec.type=restore → RestorePIT call path, executionMode=job runner Job lifecycle, pgBackRest command-runner plugin registration, and the sidecar pod-exec path. |
| 2026-05-12 | Observability gap closed: added Helm metrics Service / ServiceMonitor / PrometheusRule + postgres_operator_backupjob_phase Prometheus metric. |
| 2026-05-11 | G1 §Backup/Restore BackupJob.Phase transitions (Pending → Running → Succeeded/Failed) implemented + 8 unit tests — [x] (ralph-loop iter#3). |
| 2026-05-11 | Full rewrite — introduced Gate-scoped sub-task checklists, buffer indicators, and removed any date-style language. |
| 2026-05-07 | Released 0.3.0-alpha.3, switched to public GHCR pull, removed legacy staging operator, and made the "no embedded external systems" principle explicit. |
© 2026 keiailab · MIT · keiailab.com