feat: surface alignment + memory persistence + progressive delivery#35
Conversation
- Add missing Logger metadata keys to config (count, keys, retry, max_retries, type, endpoint, path, manifest_count) - Extract handle_send_failure/4 from Events.Emitter.handle_info to reduce nesting depth from 3 to 2 - Remove redundant last with clause in Occurrence.persist — the final File.write becomes the with body's return value directly - Replace length(list) >= 1 with list != [] in query_test.exs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Unified facade backing all user-facing interfaces — CLI, MCP, and HTTP now delegate to Surface for consistent behaviour and graceful degradation when optional subsystems aren't running. - Extract Nopea.Surface with status/context/history/explain/health/services - Rewire CLI to Surface, add explain/health/services/mcp commands - Rewire MCP to Surface, add nopea_services tool (6 tools total) - Add GET /api/status/:service, /api/explain/:service, /api/services - Fix HTTP router to use Deploy.deploy/1 instead of Deploy.run/1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Graph survives restarts — persisted as version-prefixed ETF binary. Restore chain: ETS snapshot → disk → fresh graph. - Add workdir to Memory GenServer state - Persist after record_deploy, decay, and terminate/2 - Version byte prefix (<<1>>) for future format migrations - :erlang.binary_to_term with [:safe] prevents code execution - Update existing tests to use tmp_dir for isolation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Canary and blue_green deploys now honestly report :progressing instead of :completed. A per-rollout Monitor polls the Kulta Rollout CRD and supports manual promote/rollback from all surfaces. - Add :progressing status to Result, add Result.progressing/5 - Add Progressive.Rollout struct for rollout lifecycle tracking - Add patch_resource to K8s Behaviour for promote/rollback - Implement Progressive.Monitor GenServer (poll/promote/rollback) - Add Progressive.Supervisor (DynamicSupervisor) to OTP tree - Integrate progressive flow into Deploy pipeline - Add promote/rollback to Surface, CLI, MCP, and HTTP - MCP now exposes 8 tools (added nopea_promote, nopea_rollback) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Introduces a shared Nopea.Surface façade so CLI/MCP/HTTP surfaces delegate to consistent domain logic, adds on-disk persistence for the Memory graph, and implements progressive delivery monitoring for canary/blue-green deploys.
Changes:
- Add
Nopea.Surfaceand update CLI/MCP/HTTP routes to delegate to it (plus new surface commands/tools/endpoints). - Persist Memory graph to disk (
.nopea/graph.etf) with a restore chain (Cache snapshot → disk → fresh graph). - Implement progressive delivery rollout monitoring (DynamicSupervisor + per-rollout GenServer) and return
:progressingfor canary/blue-green deploys.
Reviewed changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| test/nopea/surface_test.exs | Adds tests for the new Nopea.Surface façade behaviors. |
| test/nopea/progressive/monitor_test.exs | Exercises progressive rollout monitoring, promote/rollback, polling, and list_active. |
| test/nopea/memory_test.exs | Updates Memory tests to isolate persistence via tmp_dir workdir. |
| test/nopea/memory_persistence_test.exs | Adds coverage for disk persistence/restore and corrupt-file handling. |
| test/nopea/memory/query_test.exs | Minor assertion adjustments for non-empty results. |
| test/nopea/mcp_test.exs | Updates MCP tool list expectations to 8 tools. |
| test/nopea/deploy_test.exs | Updates canary/blue_green expectations to :progressing. |
| test/nopea/deploy_progressive_test.exs | Adds integration-style tests for progressive deploy behavior and monitor startup. |
| test/nopea/application_test.exs | Adjusts Memory startup tests to use tmp_dir workdir. |
| lib/nopea/surface.ex | New shared façade for status/context/history/explain/health/services/promote/rollback. |
| lib/nopea/progressive/supervisor.ex | Adds DynamicSupervisor to manage rollout monitor processes. |
| lib/nopea/progressive/rollout.ex | Introduces rollout state struct/type for progressive delivery tracking. |
| lib/nopea/progressive/monitor.ex | Adds per-rollout GenServer polling Kulta Rollout CRD and supporting promote/rollback. |
| lib/nopea/occurrence.ex | Refactors persistence flow for ETF writing in occurrences. |
| lib/nopea/memory.ex | Adds disk persistence + restore for Memory graph; supports workdir: and terminate persistence. |
| lib/nopea/mcp.ex | Adds new tools (services/promote/rollback) and delegates tool logic to Nopea.Surface. |
| lib/nopea/k8s/behaviour.ex | Extends K8s behaviour with patch_resource/5. |
| lib/nopea/k8s.ex | Implements patch_resource/5 using K8s.Client.patch. |
| lib/nopea/events/emitter.ex | Refactors retry failure handling into a helper function. |
| lib/nopea/deploy/result.ex | Adds :progressing status and a constructor for progressive deploy results. |
| lib/nopea/deploy.ex | Returns :progressing for canary/blue_green, starts monitor, and emits progressing logs. |
| lib/nopea/cli.ex | Adds/rewires CLI commands and delegates to Nopea.Surface (incl. mcp). |
| lib/nopea/application.ex | Adds progressive supervisor to the supervision tree (gated with deploy supervisor flag). |
| lib/nopea/api/router.ex | Adds new HTTP endpoints and routes deploy through Deploy.deploy/1; delegates to Nopea.Surface. |
| config/config.exs | Expands allowed logger metadata keys. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| {:ok, {applied, :progressing}} -> | ||
| duration_ms = duration_ms(start_time) | ||
| result = Result.progressing(deploy_id, spec, strategy, applied, duration_ms) | ||
| record_outcome(result, context) |
There was a problem hiding this comment.
When a progressive strategy returns :progressing, this branch still calls record_outcome/2, which writes status: :progressing into Memory + Cache and generates an occurrence. The monitor later records a terminal outcome only to Memory, so Cache (and /api/status, services, etc.) can remain stuck at :progressing and Memory analytics will double-count (progressing + completed/failed) for a single deploy.
Consider not calling record_outcome/2 here (or make it an explicit “in-progress” record that doesn’t overwrite service_state), and instead update Cache/service state + occurrence when the rollout reaches a terminal phase.
| record_outcome(result, context) |
There was a problem hiding this comment.
Fixed in 3d0ea1e — removed record_outcome(result, context) from the :progressing branch. The Monitor now owns the full lifecycle: it records the outcome (and updates Cache) when the rollout reaches a terminal phase.
| defp record_outcome(rollout) do | ||
| if Process.whereis(Nopea.Memory) do | ||
| status = if rollout.phase in [:completed, :promoted], do: :completed, else: :failed | ||
|
|
||
| Nopea.Memory.record_deploy(%{ | ||
| service: rollout.service, | ||
| namespace: rollout.namespace, | ||
| status: status, | ||
| error: if(status == :failed, do: :rollout_failed), | ||
| duration_ms: DateTime.diff(rollout.updated_at, rollout.started_at, :millisecond), | ||
| concurrent_deploys: [] | ||
| }) | ||
| end |
There was a problem hiding this comment.
record_outcome/1 only writes the terminal result to Memory. Since Deploy.run/1 stores the initial :progressing state in Cache/service_state, the cache-backed surfaces can remain permanently :progressing unless the monitor also updates Cache (and the cached deployment record) on completion/failure/timeout.
Consider mirroring the Deploy.record_outcome cache update logic here (or factoring it into a shared helper) so terminal rollout outcomes are reflected across all surfaces.
There was a problem hiding this comment.
Fixed in 3d0ea1e — record_outcome/1 in Monitor now updates Cache with the terminal status via Nopea.Cache.put_service_state/2, matching the pattern used in Deploy.record_outcome/2.
|
|
||
| with :ok <- File.mkdir_p(Path.dirname(path)) do | ||
| binary = @graph_version <> :erlang.term_to_binary(graph) | ||
| File.write(path, binary) | ||
| end | ||
| rescue | ||
| error -> | ||
| Logger.warning("Failed to persist graph to disk", | ||
| error: inspect(error) | ||
| ) |
There was a problem hiding this comment.
persist_to_disk/2 doesn’t log or handle File.write/2 failures (it only rescues exceptions). If File.write/2 returns {:error, reason}, persistence silently fails and callers ignore the returned error, risking undetected data loss.
Handle the File.write/2 return value (e.g., log {:error, reason}) so operational failures are visible.
| with :ok <- File.mkdir_p(Path.dirname(path)) do | |
| binary = @graph_version <> :erlang.term_to_binary(graph) | |
| File.write(path, binary) | |
| end | |
| rescue | |
| error -> | |
| Logger.warning("Failed to persist graph to disk", | |
| error: inspect(error) | |
| ) | |
| binary = @graph_version <> :erlang.term_to_binary(graph) | |
| with :ok <- File.mkdir_p(Path.dirname(path)), | |
| :ok <- File.write(path, binary) do | |
| :ok | |
| else | |
| {:error, reason} -> | |
| Logger.warning("Failed to persist graph to disk", | |
| error: inspect(reason) | |
| ) | |
| {:error, reason} | |
| end | |
| rescue | |
| error -> | |
| Logger.warning("Failed to persist graph to disk", | |
| error: inspect(error) | |
| ) | |
| {:error, error} |
There was a problem hiding this comment.
Fixed in 3d0ea1e — File.write/2 is now chained into the with clause, and failures are logged via Logger.warning with the error reason. The rescue clause also logs instead of silently swallowing.
| post "/api/promote/:deploy_id" do | ||
| case Nopea.Surface.promote(deploy_id) do | ||
| {:ok, rollout} -> json(conn, 200, Map.from_struct(rollout)) | ||
| {:error, :not_found} -> json(conn, 404, %{error: "not_found"}) | ||
| {:error, reason} -> json(conn, 500, %{error: inspect(reason)}) | ||
| end | ||
| end | ||
|
|
||
| post "/api/rollback/:deploy_id" do | ||
| case Nopea.Surface.rollback(deploy_id) do | ||
| {:ok, rollout} -> json(conn, 200, Map.from_struct(rollout)) | ||
| {:error, :not_found} -> json(conn, 404, %{error: "not_found"}) | ||
| {:error, reason} -> json(conn, 500, %{error: inspect(reason)}) | ||
| end |
There was a problem hiding this comment.
These endpoints return 500 with inspect(reason) in the JSON body. This can leak internal details (atoms, exception messages, potentially upstream responses) to API clients.
Prefer logging reason server-side and returning a stable, client-safe error payload (similar to handle_deploy/1’s Internal server error).
There was a problem hiding this comment.
Fixed in 3d0ea1e — error responses now return a generic "Internal server error" message. The actual reason is logged server-side via Logger.error with the deploy_id for correlation.
| post "/api/promote/:deploy_id" do | ||
| case Nopea.Surface.promote(deploy_id) do | ||
| {:ok, rollout} -> json(conn, 200, Map.from_struct(rollout)) | ||
| {:error, :not_found} -> json(conn, 404, %{error: "not_found"}) | ||
| {:error, reason} -> json(conn, 500, %{error: inspect(reason)}) | ||
| end | ||
| end | ||
|
|
||
| post "/api/rollback/:deploy_id" do | ||
| case Nopea.Surface.rollback(deploy_id) do | ||
| {:ok, rollout} -> json(conn, 200, Map.from_struct(rollout)) | ||
| {:error, :not_found} -> json(conn, 404, %{error: "not_found"}) | ||
| {:error, reason} -> json(conn, 500, %{error: inspect(reason)}) | ||
| end |
There was a problem hiding this comment.
The POST /api/promote/:deploy_id and POST /api/rollback/:deploy_id routes perform privileged rollout operations without any authentication or authorization checks. If the Nopea HTTP API is reachable from an untrusted network, an attacker could arbitrarily promote or roll back progressive rollouts and thus modify the state of the Kubernetes cluster. Add strong authentication/authorization in front of these endpoints (or ensure they are only reachable behind an authenticated proxy) so that only authorized operators can trigger rollout changes.
There was a problem hiding this comment.
Agreed — auth is needed before production use. This PR focuses on the core progressive delivery mechanics. Auth middleware (API key or mTLS) is a follow-up concern that should apply uniformly to all mutating endpoints, not just promote/rollback. Will track separately.
…s, sanitize HTTP errors - Remove record_outcome from :progressing branch to prevent Cache getting stuck at :progressing status - Add Cache update to Monitor.record_outcome so final rollout status is persisted - Chain File.write into with clause in persist_to_disk with proper error logging - Replace inspect(reason) in HTTP error responses with generic "Internal server error" and server-side Logger.error Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Surface Alignment: Extract
Nopea.Surfacefacade — CLI, MCP, and HTTP now delegate to a shared module for consistent behaviour and graceful degradation. Addsexplain,health,services,promote,rollbackcommands across all surfaces. MCP grows from 5 to 8 tools.Memory Persistence: Graph persists to
.nopea/graph.etfas version-prefixed ETF binary. Restore chain: ETS snapshot → disk → fresh graph. Survives full application restarts.Progressive Delivery: Canary/blue_green deploys honestly return
:progressinginstead of:completed. Per-rolloutProgressive.MonitorGenServer polls Kulta Rollout CRD status, supports manual promote/rollback, self-terminates on completion/failure/timeout (1h max).Test plan
🤖 Generated with Claude Code