fix: rollback restores configuration snapshot alongside application image (#4260) by neuralmint · Pull Request #4 · neuralmint/AgentOrchestration

neuralmint · 2026-05-25T11:25:15Z

Summary

Fixes bounty #4260 — deployment rollback now restores both the application image and the configuration snapshot that was recorded alongside that release.

Problem

When a release is rolled back, only the application image is restored. Configuration maps or feature flags that changed after the release remain, causing the previous application version to start with incompatible settings.

Fix

ReleaseManager (src/deploy/release.py): Each release records both an image_digest and a deep-copied config_snapshot. Configuration is frozen at creation time so subsequent mutations don't affect recorded releases.
Rollback restores the paired configuration snapshot alongside the image digest.
Post-rollback verification checks internal consistency (valid image, valid config dict, valid version).
CLI updates:
- ao deploy --image-digest X manifest.json — records a release with the manifest as config snapshot.
- ao release list — lists all releases with image digest and config key count.
- ao release show <version> — shows full release details.
- ao release rollback --confirm <version> — restores image AND config from target release.
- ao --release-db <path> — optional persistent JSON-backed release store.

Acceptance Criteria

Each release records the image and configuration digests used.
Rollback restores the matching configuration snapshot.
Post-rollback verification checks structural consistency.
35 new tests covering core logic, rollback, serialization, edge cases, and CLI integration.
All 67 existing non-asyncio tests continue to pass.

Test Output (35 new tests)

35 passed in 0.11s

Fixes #4260

Bounty #3947 — Bound retry metadata growth on repeated failures. Changes: - Added MAX_RETRY_METADATA hard cap (100) to prevent unbounded retry counter growth. - Added dead_letter store for permanently failed tasks. - fail() now enforces the repeated-failures invariant before re-enqueueing: tasks that exceed max_retries or the hard cap go to dead letter instead. - enqueue() rejects tasks past the hard cap and returns None for the caller to handle. - Added preserve_retries parameter to enqueue() so retry metadata is preserved during re-enqueue (idempotent retry path). - Scheduled task promotion (dequeue) also respects the hard cap. - Added detailed logging for all retry/rejection decisions. - Backward compatible: default max_retries remains 3; existing callers unaffected. - Regression tests cover: repeated-failures trigger, metadata bound, idempotent fail, dead-letter isolation, exhausted enqueue rejection.

Add an atomic state precondition in the scheduler dequeue path to reject tasks whose associated run has been deleted. This prevents stale, duplicate, or policy-violating transitions when a workflow is removed concurrently with run materialization. Changes: - Add tracking set with method - Add precondition in — rejects both queued and scheduled tasks for deleted runs - Bounded audit metadata via structured logging (warn-level with run_id and task_id context) - Fix pre-existing bug: dict now stores task dicts alongside timestamps so data is not lost during promotion - Wire up WorkflowManager in OrchestrationEngine for future mark_run_deleted integration - Add 5 deterministic regression tests covering: * Dequeue rejection for deleted runs * Scheduled task skip for deleted runs * Idempotent mark_run_deleted * Normal unaffected workflows * Isolated deletion between concurrent runs Closes #3977

Adds a data lake governance module that enforces purpose limitation on ingestion writes. Every data lake write now requires purpose metadata (purpose, data class, owner, destination) and is blocked when the destination is not approved for that data class. New components: - DataClassificationRegistry: registers data classes with approved destinations; supports wildcard (all destinations) via empty set - PurposeMetadata: declares purpose, data class, owner, destination - IngestionManifest: full manifest for data lake writes - DataLakeGovernor: validates manifests, enforces policy, records audit log with grouping by purpose and owner - Custom errors: MissingPurposeMetadataError, DataClassNotRegisteredError, DestinationNotApprovedError All 19 new tests pass. Existing test suite unaffected. Closes #3998

- Add release workflow (release.yml) that: - Triggers on version tags (v*) - Builds packages with uv build - Generates build provenance attestation via actions/attest-build-provenance - Creates GitHub Releases with attested artifacts - Publishes to PyPI with attestation support - Add artifact verification section to README with gh CLI instructions The attestation includes source repository, commit SHA, workflow run, and artifact digest — enabling consumers to verify artifact provenance. Closes #4050

…tadata Closes #4088 Multi-stage Dockerfile isolates all build-time-only ARG declarations (BUILD_ENV, PIP_INDEX_URL, UV_VERSION) inside the builder stage. The final runtime stage inherits zero build-time ARGs, preventing leakage into image history, labels, or environment variables. Changes: - Dockerfile: two-stage build (builder → final), ARGs only in builder - .dockerignore: exclude dev/CI artifacts from build context - infra/docker-compose.yml: pass args only to builder stage - infra/scripts/audit_image_metadata.sh: CI audit for leaked metadata - .github/workflows/ci.yml: add docker-build-and-audit job - Makefile: docker-audit / docker-build-slim targets

…es not support ARG expansion) Docker's COPY --from= instruction does not support variable expansion for image references. The previous approach used: COPY --from=ghcr.io/astral-sh/uv:${UV_VERSION} /uv /usr/local/bin/uv which fails at build time with: 'variable expansion is not supported for --from' Fix: create a dedicated uv-image stage using FROM with the ARG, then COPY --from=uv-image using a static stage name. This is the documented Docker workaround for this limitation. Also moved UV_VERSION ARG to global scope (before first FROM) so it's available to the uv-image FROM line, and removed it from the builder stage since it's no longer consumed there.

…mage Bounty #4260 — Deployment rollback now restores BOTH image and configuration, preventing incompatible settings at startup. Changes: - Add src/deploy/ (ReleaseManager, Release dataclass) — records image digest and a deep-copied config snapshot per release. - Rollback restores the paired config snapshot, not just the image. - Post-rollback verification checks internal consistency. - CLI gains `release list`, `release show`, `release rollback` subcmds. - `deploy` command now records release metadata at deploy time. - 35 tests covering core logic, rollback, serialization, edge cases, and CLI integration. Fixes #4260

neuralmint and others added 7 commits May 24, 2026 23:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: rollback restores configuration snapshot alongside application image (#4260)#4

fix: rollback restores configuration snapshot alongside application image (#4260)#4
neuralmint wants to merge 7 commits into
mainfrom
fix/release-rollback-paired-config-4260

neuralmint commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

neuralmint commented May 25, 2026

Summary

Problem

Fix

Acceptance Criteria

Test Output (35 new tests)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant