feat: surface alignment + memory persistence + progressive delivery by yairfalse · Pull Request #35 · false-systems/nopea

yairfalse · 2026-03-08T01:26:50Z

Summary

Surface Alignment: Extract Nopea.Surface facade — CLI, MCP, and HTTP now delegate to a shared module for consistent behaviour and graceful degradation. Adds explain, health, services, promote, rollback commands across all surfaces. MCP grows from 5 to 8 tools.
Memory Persistence: Graph persists to .nopea/graph.etf as version-prefixed ETF binary. Restore chain: ETS snapshot → disk → fresh graph. Survives full application restarts.
Progressive Delivery: Canary/blue_green deploys honestly return :progressing instead of :completed. Per-rollout Progressive.Monitor GenServer polls Kulta Rollout CRD status, supports manual promote/rollback, self-terminates on completion/failure/timeout (1h max).

Test plan

305 tests pass, 0 failures
0 Credo issues
Surface test: all 6 functions tested with Cache/Memory running
Memory persistence test: persist, restore, corrupt file handling, terminate callback
Progressive monitor test: start, promote, rollback, polling, list_active
Deploy progressive test: canary → :progressing, blue_green → :progressing, direct → :completed
Existing deploy tests updated: canary/blue_green assert :progressing
MCP test updated: 8 tools listed

🤖 Generated with Claude Code

- Add missing Logger metadata keys to config (count, keys, retry, max_retries, type, endpoint, path, manifest_count) - Extract handle_send_failure/4 from Events.Emitter.handle_info to reduce nesting depth from 3 to 2 - Remove redundant last with clause in Occurrence.persist — the final File.write becomes the with body's return value directly - Replace length(list) >= 1 with list != [] in query_test.exs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Unified facade backing all user-facing interfaces — CLI, MCP, and HTTP now delegate to Surface for consistent behaviour and graceful degradation when optional subsystems aren't running. - Extract Nopea.Surface with status/context/history/explain/health/services - Rewire CLI to Surface, add explain/health/services/mcp commands - Rewire MCP to Surface, add nopea_services tool (6 tools total) - Add GET /api/status/:service, /api/explain/:service, /api/services - Fix HTTP router to use Deploy.deploy/1 instead of Deploy.run/1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Graph survives restarts — persisted as version-prefixed ETF binary. Restore chain: ETS snapshot → disk → fresh graph. - Add workdir to Memory GenServer state - Persist after record_deploy, decay, and terminate/2 - Version byte prefix (<<1>>) for future format migrations - :erlang.binary_to_term with [:safe] prevents code execution - Update existing tests to use tmp_dir for isolation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Canary and blue_green deploys now honestly report :progressing instead of :completed. A per-rollout Monitor polls the Kulta Rollout CRD and supports manual promote/rollback from all surfaces. - Add :progressing status to Result, add Result.progressing/5 - Add Progressive.Rollout struct for rollout lifecycle tracking - Add patch_resource to K8s Behaviour for promote/rollback - Implement Progressive.Monitor GenServer (poll/promote/rollback) - Add Progressive.Supervisor (DynamicSupervisor) to OTP tree - Integrate progressive flow into Deploy pipeline - Add promote/rollback to Surface, CLI, MCP, and HTTP - MCP now exposes 8 tools (added nopea_promote, nopea_rollback) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Introduces a shared Nopea.Surface façade so CLI/MCP/HTTP surfaces delegate to consistent domain logic, adds on-disk persistence for the Memory graph, and implements progressive delivery monitoring for canary/blue-green deploys.

Changes:

Add Nopea.Surface and update CLI/MCP/HTTP routes to delegate to it (plus new surface commands/tools/endpoints).
Persist Memory graph to disk (.nopea/graph.etf) with a restore chain (Cache snapshot → disk → fresh graph).
Implement progressive delivery rollout monitoring (DynamicSupervisor + per-rollout GenServer) and return :progressing for canary/blue-green deploys.

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
test/nopea/surface_test.exs	Adds tests for the new `Nopea.Surface` façade behaviors.
test/nopea/progressive/monitor_test.exs	Exercises progressive rollout monitoring, promote/rollback, polling, and list_active.
test/nopea/memory_test.exs	Updates Memory tests to isolate persistence via `tmp_dir` workdir.
test/nopea/memory_persistence_test.exs	Adds coverage for disk persistence/restore and corrupt-file handling.
test/nopea/memory/query_test.exs	Minor assertion adjustments for non-empty results.
test/nopea/mcp_test.exs	Updates MCP tool list expectations to 8 tools.
test/nopea/deploy_test.exs	Updates canary/blue_green expectations to `:progressing`.
test/nopea/deploy_progressive_test.exs	Adds integration-style tests for progressive deploy behavior and monitor startup.
test/nopea/application_test.exs	Adjusts Memory startup tests to use `tmp_dir` workdir.
lib/nopea/surface.ex	New shared façade for status/context/history/explain/health/services/promote/rollback.
lib/nopea/progressive/supervisor.ex	Adds DynamicSupervisor to manage rollout monitor processes.
lib/nopea/progressive/rollout.ex	Introduces rollout state struct/type for progressive delivery tracking.
lib/nopea/progressive/monitor.ex	Adds per-rollout GenServer polling Kulta Rollout CRD and supporting promote/rollback.
lib/nopea/occurrence.ex	Refactors persistence flow for ETF writing in occurrences.
lib/nopea/memory.ex	Adds disk persistence + restore for Memory graph; supports `workdir:` and terminate persistence.
lib/nopea/mcp.ex	Adds new tools (services/promote/rollback) and delegates tool logic to `Nopea.Surface`.
lib/nopea/k8s/behaviour.ex	Extends K8s behaviour with `patch_resource/5`.
lib/nopea/k8s.ex	Implements `patch_resource/5` using `K8s.Client.patch`.
lib/nopea/events/emitter.ex	Refactors retry failure handling into a helper function.
lib/nopea/deploy/result.ex	Adds `:progressing` status and a constructor for progressive deploy results.
lib/nopea/deploy.ex	Returns `:progressing` for canary/blue_green, starts monitor, and emits progressing logs.
lib/nopea/cli.ex	Adds/rewires CLI commands and delegates to `Nopea.Surface` (incl. `mcp`).
lib/nopea/application.ex	Adds progressive supervisor to the supervision tree (gated with deploy supervisor flag).
lib/nopea/api/router.ex	Adds new HTTP endpoints and routes deploy through `Deploy.deploy/1`; delegates to `Nopea.Surface`.
config/config.exs	Expands allowed logger metadata keys.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-08T01:33:52Z

+      {:ok, {applied, :progressing}} ->
+        duration_ms = duration_ms(start_time)
+        result = Result.progressing(deploy_id, spec, strategy, applied, duration_ms)
+        record_outcome(result, context)


When a progressive strategy returns :progressing, this branch still calls record_outcome/2, which writes status: :progressing into Memory + Cache and generates an occurrence. The monitor later records a terminal outcome only to Memory, so Cache (and /api/status, services, etc.) can remain stuck at :progressing and Memory analytics will double-count (progressing + completed/failed) for a single deploy.

Consider not calling record_outcome/2 here (or make it an explicit “in-progress” record that doesn’t overwrite service_state), and instead update Cache/service state + occurrence when the rollout reaches a terminal phase.

Suggested change

record_outcome(result, context)

Fixed in 3d0ea1e — removed record_outcome(result, context) from the :progressing branch. The Monitor now owns the full lifecycle: it records the outcome (and updates Cache) when the rollout reaches a terminal phase.

Copilot · 2026-03-08T01:33:53Z

+  defp record_outcome(rollout) do
+    if Process.whereis(Nopea.Memory) do
+      status = if rollout.phase in [:completed, :promoted], do: :completed, else: :failed
+
+      Nopea.Memory.record_deploy(%{
+        service: rollout.service,
+        namespace: rollout.namespace,
+        status: status,
+        error: if(status == :failed, do: :rollout_failed),
+        duration_ms: DateTime.diff(rollout.updated_at, rollout.started_at, :millisecond),
+        concurrent_deploys: []
+      })
+    end


record_outcome/1 only writes the terminal result to Memory. Since Deploy.run/1 stores the initial :progressing state in Cache/service_state, the cache-backed surfaces can remain permanently :progressing unless the monitor also updates Cache (and the cached deployment record) on completion/failure/timeout.

Consider mirroring the Deploy.record_outcome cache update logic here (or factoring it into a shared helper) so terminal rollout outcomes are reflected across all surfaces.

Fixed in 3d0ea1e — record_outcome/1 in Monitor now updates Cache with the terminal status via Nopea.Cache.put_service_state/2, matching the pattern used in Deploy.record_outcome/2.

Copilot · 2026-03-08T01:33:53Z

+
+    with :ok <- File.mkdir_p(Path.dirname(path)) do
+      binary = @graph_version <> :erlang.term_to_binary(graph)
+      File.write(path, binary)
+    end
+  rescue
+    error ->
+      Logger.warning("Failed to persist graph to disk",
+        error: inspect(error)
+      )


persist_to_disk/2 doesn’t log or handle File.write/2 failures (it only rescues exceptions). If File.write/2 returns {:error, reason}, persistence silently fails and callers ignore the returned error, risking undetected data loss.

Handle the File.write/2 return value (e.g., log {:error, reason}) so operational failures are visible.

Suggested change

with :ok <- File.mkdir_p(Path.dirname(path)) do

binary = @graph_version <> :erlang.term_to_binary(graph)

File.write(path, binary)

end

rescue

error ->

Logger.warning("Failed to persist graph to disk",

error: inspect(error)

)

binary = @graph_version <> :erlang.term_to_binary(graph)

with :ok <- File.mkdir_p(Path.dirname(path)),

:ok <- File.write(path, binary) do

:ok

else

{:error, reason} ->

Logger.warning("Failed to persist graph to disk",

error: inspect(reason)

)

{:error, reason}

end

rescue

error ->

Logger.warning("Failed to persist graph to disk",

error: inspect(error)

)

{:error, error}

Fixed in 3d0ea1e — File.write/2 is now chained into the with clause, and failures are logged via Logger.warning with the error reason. The rescue clause also logs instead of silently swallowing.

Copilot · 2026-03-08T01:33:53Z

+  post "/api/promote/:deploy_id" do
+    case Nopea.Surface.promote(deploy_id) do
+      {:ok, rollout} -> json(conn, 200, Map.from_struct(rollout))
+      {:error, :not_found} -> json(conn, 404, %{error: "not_found"})
+      {:error, reason} -> json(conn, 500, %{error: inspect(reason)})
+    end
+  end
+
+  post "/api/rollback/:deploy_id" do
+    case Nopea.Surface.rollback(deploy_id) do
+      {:ok, rollout} -> json(conn, 200, Map.from_struct(rollout))
+      {:error, :not_found} -> json(conn, 404, %{error: "not_found"})
+      {:error, reason} -> json(conn, 500, %{error: inspect(reason)})
+    end


These endpoints return 500 with inspect(reason) in the JSON body. This can leak internal details (atoms, exception messages, potentially upstream responses) to API clients.

Prefer logging reason server-side and returning a stable, client-safe error payload (similar to handle_deploy/1’s Internal server error).

Fixed in 3d0ea1e — error responses now return a generic "Internal server error" message. The actual reason is logged server-side via Logger.error with the deploy_id for correlation.

Copilot · 2026-03-08T01:33:53Z

+  post "/api/promote/:deploy_id" do
+    case Nopea.Surface.promote(deploy_id) do
+      {:ok, rollout} -> json(conn, 200, Map.from_struct(rollout))
+      {:error, :not_found} -> json(conn, 404, %{error: "not_found"})
+      {:error, reason} -> json(conn, 500, %{error: inspect(reason)})
+    end
+  end
+
+  post "/api/rollback/:deploy_id" do
+    case Nopea.Surface.rollback(deploy_id) do
+      {:ok, rollout} -> json(conn, 200, Map.from_struct(rollout))
+      {:error, :not_found} -> json(conn, 404, %{error: "not_found"})
+      {:error, reason} -> json(conn, 500, %{error: inspect(reason)})
+    end


The POST /api/promote/:deploy_id and POST /api/rollback/:deploy_id routes perform privileged rollout operations without any authentication or authorization checks. If the Nopea HTTP API is reachable from an untrusted network, an attacker could arbitrarily promote or roll back progressive rollouts and thus modify the state of the Kubernetes cluster. Add strong authentication/authorization in front of these endpoints (or ensure they are only reachable behind an authenticated proxy) so that only authorized operators can trigger rollout changes.

Agreed — auth is needed before production use. This PR focuses on the core progressive delivery mechanics. Auth middleware (API key or mTLS) is a follow-up concern that should apply uniformly to all mutating endpoints, not just promote/rollback. Will track separately.

…s, sanitize HTTP errors - Remove record_outcome from :progressing branch to prevent Cache getting stuck at :progressing status - Add Cache update to Monitor.record_outcome so final rollout status is persisted - Chain File.write into with clause in persist_to_disk with proper error logging - Replace inspect(reason) in HTTP error responses with generic "Internal server error" and server-side Logger.error Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

yairfalse and others added 4 commits March 6, 2026 01:55

yairfalse requested a review from Copilot March 8, 2026 01:27

Copilot started reviewing on behalf of yairfalse March 8, 2026 01:27 View session

Copilot AI reviewed Mar 8, 2026

View reviewed changes

yairfalse merged commit 1ded75d into main Mar 10, 2026
0 of 2 checks passed

yairfalse deleted the feat/surface-persistence-progressive branch March 10, 2026 23:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: surface alignment + memory persistence + progressive delivery#35

feat: surface alignment + memory persistence + progressive delivery#35
yairfalse merged 5 commits into
mainfrom
feat/surface-persistence-progressive

yairfalse commented Mar 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

yairfalse Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

yairfalse Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

yairfalse Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

yairfalse Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

yairfalse Mar 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yairfalse commented Mar 8, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

yairfalse Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

yairfalse Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

yairfalse Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

yairfalse Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

yairfalse Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants