Skip to content

feat: surface alignment + memory persistence + progressive delivery#35

Merged
yairfalse merged 5 commits into
mainfrom
feat/surface-persistence-progressive
Mar 10, 2026
Merged

feat: surface alignment + memory persistence + progressive delivery#35
yairfalse merged 5 commits into
mainfrom
feat/surface-persistence-progressive

Conversation

@yairfalse
Copy link
Copy Markdown
Collaborator

Summary

  • Surface Alignment: Extract Nopea.Surface facade — CLI, MCP, and HTTP now delegate to a shared module for consistent behaviour and graceful degradation. Adds explain, health, services, promote, rollback commands across all surfaces. MCP grows from 5 to 8 tools.

  • Memory Persistence: Graph persists to .nopea/graph.etf as version-prefixed ETF binary. Restore chain: ETS snapshot → disk → fresh graph. Survives full application restarts.

  • Progressive Delivery: Canary/blue_green deploys honestly return :progressing instead of :completed. Per-rollout Progressive.Monitor GenServer polls Kulta Rollout CRD status, supports manual promote/rollback, self-terminates on completion/failure/timeout (1h max).

Test plan

  • 305 tests pass, 0 failures
  • 0 Credo issues
  • Surface test: all 6 functions tested with Cache/Memory running
  • Memory persistence test: persist, restore, corrupt file handling, terminate callback
  • Progressive monitor test: start, promote, rollback, polling, list_active
  • Deploy progressive test: canary → :progressing, blue_green → :progressing, direct → :completed
  • Existing deploy tests updated: canary/blue_green assert :progressing
  • MCP test updated: 8 tools listed

🤖 Generated with Claude Code

yairfalse and others added 4 commits March 6, 2026 01:55
- Add missing Logger metadata keys to config (count, keys, retry,
  max_retries, type, endpoint, path, manifest_count)
- Extract handle_send_failure/4 from Events.Emitter.handle_info to
  reduce nesting depth from 3 to 2
- Remove redundant last with clause in Occurrence.persist — the final
  File.write becomes the with body's return value directly
- Replace length(list) >= 1 with list != [] in query_test.exs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Unified facade backing all user-facing interfaces — CLI, MCP, and HTTP
now delegate to Surface for consistent behaviour and graceful degradation
when optional subsystems aren't running.

- Extract Nopea.Surface with status/context/history/explain/health/services
- Rewire CLI to Surface, add explain/health/services/mcp commands
- Rewire MCP to Surface, add nopea_services tool (6 tools total)
- Add GET /api/status/:service, /api/explain/:service, /api/services
- Fix HTTP router to use Deploy.deploy/1 instead of Deploy.run/1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Graph survives restarts — persisted as version-prefixed ETF binary.
Restore chain: ETS snapshot → disk → fresh graph.

- Add workdir to Memory GenServer state
- Persist after record_deploy, decay, and terminate/2
- Version byte prefix (<<1>>) for future format migrations
- :erlang.binary_to_term with [:safe] prevents code execution
- Update existing tests to use tmp_dir for isolation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Canary and blue_green deploys now honestly report :progressing instead
of :completed. A per-rollout Monitor polls the Kulta Rollout CRD and
supports manual promote/rollback from all surfaces.

- Add :progressing status to Result, add Result.progressing/5
- Add Progressive.Rollout struct for rollout lifecycle tracking
- Add patch_resource to K8s Behaviour for promote/rollback
- Implement Progressive.Monitor GenServer (poll/promote/rollback)
- Add Progressive.Supervisor (DynamicSupervisor) to OTP tree
- Integrate progressive flow into Deploy pipeline
- Add promote/rollback to Surface, CLI, MCP, and HTTP
- MCP now exposes 8 tools (added nopea_promote, nopea_rollback)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a shared Nopea.Surface façade so CLI/MCP/HTTP surfaces delegate to consistent domain logic, adds on-disk persistence for the Memory graph, and implements progressive delivery monitoring for canary/blue-green deploys.

Changes:

  • Add Nopea.Surface and update CLI/MCP/HTTP routes to delegate to it (plus new surface commands/tools/endpoints).
  • Persist Memory graph to disk (.nopea/graph.etf) with a restore chain (Cache snapshot → disk → fresh graph).
  • Implement progressive delivery rollout monitoring (DynamicSupervisor + per-rollout GenServer) and return :progressing for canary/blue-green deploys.

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
test/nopea/surface_test.exs Adds tests for the new Nopea.Surface façade behaviors.
test/nopea/progressive/monitor_test.exs Exercises progressive rollout monitoring, promote/rollback, polling, and list_active.
test/nopea/memory_test.exs Updates Memory tests to isolate persistence via tmp_dir workdir.
test/nopea/memory_persistence_test.exs Adds coverage for disk persistence/restore and corrupt-file handling.
test/nopea/memory/query_test.exs Minor assertion adjustments for non-empty results.
test/nopea/mcp_test.exs Updates MCP tool list expectations to 8 tools.
test/nopea/deploy_test.exs Updates canary/blue_green expectations to :progressing.
test/nopea/deploy_progressive_test.exs Adds integration-style tests for progressive deploy behavior and monitor startup.
test/nopea/application_test.exs Adjusts Memory startup tests to use tmp_dir workdir.
lib/nopea/surface.ex New shared façade for status/context/history/explain/health/services/promote/rollback.
lib/nopea/progressive/supervisor.ex Adds DynamicSupervisor to manage rollout monitor processes.
lib/nopea/progressive/rollout.ex Introduces rollout state struct/type for progressive delivery tracking.
lib/nopea/progressive/monitor.ex Adds per-rollout GenServer polling Kulta Rollout CRD and supporting promote/rollback.
lib/nopea/occurrence.ex Refactors persistence flow for ETF writing in occurrences.
lib/nopea/memory.ex Adds disk persistence + restore for Memory graph; supports workdir: and terminate persistence.
lib/nopea/mcp.ex Adds new tools (services/promote/rollback) and delegates tool logic to Nopea.Surface.
lib/nopea/k8s/behaviour.ex Extends K8s behaviour with patch_resource/5.
lib/nopea/k8s.ex Implements patch_resource/5 using K8s.Client.patch.
lib/nopea/events/emitter.ex Refactors retry failure handling into a helper function.
lib/nopea/deploy/result.ex Adds :progressing status and a constructor for progressive deploy results.
lib/nopea/deploy.ex Returns :progressing for canary/blue_green, starts monitor, and emits progressing logs.
lib/nopea/cli.ex Adds/rewires CLI commands and delegates to Nopea.Surface (incl. mcp).
lib/nopea/application.ex Adds progressive supervisor to the supervision tree (gated with deploy supervisor flag).
lib/nopea/api/router.ex Adds new HTTP endpoints and routes deploy through Deploy.deploy/1; delegates to Nopea.Surface.
config/config.exs Expands allowed logger metadata keys.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/nopea/deploy.ex Outdated
{:ok, {applied, :progressing}} ->
duration_ms = duration_ms(start_time)
result = Result.progressing(deploy_id, spec, strategy, applied, duration_ms)
record_outcome(result, context)
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a progressive strategy returns :progressing, this branch still calls record_outcome/2, which writes status: :progressing into Memory + Cache and generates an occurrence. The monitor later records a terminal outcome only to Memory, so Cache (and /api/status, services, etc.) can remain stuck at :progressing and Memory analytics will double-count (progressing + completed/failed) for a single deploy.

Consider not calling record_outcome/2 here (or make it an explicit “in-progress” record that doesn’t overwrite service_state), and instead update Cache/service state + occurrence when the rollout reaches a terminal phase.

Suggested change
record_outcome(result, context)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3d0ea1e — removed record_outcome(result, context) from the :progressing branch. The Monitor now owns the full lifecycle: it records the outcome (and updates Cache) when the rollout reaches a terminal phase.

Comment on lines +270 to +282
defp record_outcome(rollout) do
if Process.whereis(Nopea.Memory) do
status = if rollout.phase in [:completed, :promoted], do: :completed, else: :failed

Nopea.Memory.record_deploy(%{
service: rollout.service,
namespace: rollout.namespace,
status: status,
error: if(status == :failed, do: :rollout_failed),
duration_ms: DateTime.diff(rollout.updated_at, rollout.started_at, :millisecond),
concurrent_deploys: []
})
end
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

record_outcome/1 only writes the terminal result to Memory. Since Deploy.run/1 stores the initial :progressing state in Cache/service_state, the cache-backed surfaces can remain permanently :progressing unless the monitor also updates Cache (and the cached deployment record) on completion/failure/timeout.

Consider mirroring the Deploy.record_outcome cache update logic here (or factoring it into a shared helper) so terminal rollout outcomes are reflected across all surfaces.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3d0ea1erecord_outcome/1 in Monitor now updates Cache with the terminal status via Nopea.Cache.put_service_state/2, matching the pattern used in Deploy.record_outcome/2.

Comment thread lib/nopea/memory.ex Outdated
Comment on lines +165 to +174

with :ok <- File.mkdir_p(Path.dirname(path)) do
binary = @graph_version <> :erlang.term_to_binary(graph)
File.write(path, binary)
end
rescue
error ->
Logger.warning("Failed to persist graph to disk",
error: inspect(error)
)
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

persist_to_disk/2 doesn’t log or handle File.write/2 failures (it only rescues exceptions). If File.write/2 returns {:error, reason}, persistence silently fails and callers ignore the returned error, risking undetected data loss.

Handle the File.write/2 return value (e.g., log {:error, reason}) so operational failures are visible.

Suggested change
with :ok <- File.mkdir_p(Path.dirname(path)) do
binary = @graph_version <> :erlang.term_to_binary(graph)
File.write(path, binary)
end
rescue
error ->
Logger.warning("Failed to persist graph to disk",
error: inspect(error)
)
binary = @graph_version <> :erlang.term_to_binary(graph)
with :ok <- File.mkdir_p(Path.dirname(path)),
:ok <- File.write(path, binary) do
:ok
else
{:error, reason} ->
Logger.warning("Failed to persist graph to disk",
error: inspect(reason)
)
{:error, reason}
end
rescue
error ->
Logger.warning("Failed to persist graph to disk",
error: inspect(error)
)
{:error, error}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3d0ea1eFile.write/2 is now chained into the with clause, and failures are logged via Logger.warning with the error reason. The rescue clause also logs instead of silently swallowing.

Comment thread lib/nopea/api/router.ex
Comment on lines +75 to +88
post "/api/promote/:deploy_id" do
case Nopea.Surface.promote(deploy_id) do
{:ok, rollout} -> json(conn, 200, Map.from_struct(rollout))
{:error, :not_found} -> json(conn, 404, %{error: "not_found"})
{:error, reason} -> json(conn, 500, %{error: inspect(reason)})
end
end

post "/api/rollback/:deploy_id" do
case Nopea.Surface.rollback(deploy_id) do
{:ok, rollout} -> json(conn, 200, Map.from_struct(rollout))
{:error, :not_found} -> json(conn, 404, %{error: "not_found"})
{:error, reason} -> json(conn, 500, %{error: inspect(reason)})
end
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These endpoints return 500 with inspect(reason) in the JSON body. This can leak internal details (atoms, exception messages, potentially upstream responses) to API clients.

Prefer logging reason server-side and returning a stable, client-safe error payload (similar to handle_deploy/1’s Internal server error).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3d0ea1e — error responses now return a generic "Internal server error" message. The actual reason is logged server-side via Logger.error with the deploy_id for correlation.

Comment thread lib/nopea/api/router.ex
Comment on lines +75 to +88
post "/api/promote/:deploy_id" do
case Nopea.Surface.promote(deploy_id) do
{:ok, rollout} -> json(conn, 200, Map.from_struct(rollout))
{:error, :not_found} -> json(conn, 404, %{error: "not_found"})
{:error, reason} -> json(conn, 500, %{error: inspect(reason)})
end
end

post "/api/rollback/:deploy_id" do
case Nopea.Surface.rollback(deploy_id) do
{:ok, rollout} -> json(conn, 200, Map.from_struct(rollout))
{:error, :not_found} -> json(conn, 404, %{error: "not_found"})
{:error, reason} -> json(conn, 500, %{error: inspect(reason)})
end
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The POST /api/promote/:deploy_id and POST /api/rollback/:deploy_id routes perform privileged rollout operations without any authentication or authorization checks. If the Nopea HTTP API is reachable from an untrusted network, an attacker could arbitrarily promote or roll back progressive rollouts and thus modify the state of the Kubernetes cluster. Add strong authentication/authorization in front of these endpoints (or ensure they are only reachable behind an authenticated proxy) so that only authorized operators can trigger rollout changes.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — auth is needed before production use. This PR focuses on the core progressive delivery mechanics. Auth middleware (API key or mTLS) is a follow-up concern that should apply uniformly to all mutating endpoints, not just promote/rollback. Will track separately.

…s, sanitize HTTP errors

- Remove record_outcome from :progressing branch to prevent Cache
  getting stuck at :progressing status
- Add Cache update to Monitor.record_outcome so final rollout status
  is persisted
- Chain File.write into with clause in persist_to_disk with proper
  error logging
- Replace inspect(reason) in HTTP error responses with generic
  "Internal server error" and server-side Logger.error

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@yairfalse yairfalse merged commit 1ded75d into main Mar 10, 2026
0 of 2 checks passed
@yairfalse yairfalse deleted the feat/surface-persistence-progressive branch March 10, 2026 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants