Skip to content

fix: rollback restores configuration snapshot alongside application image (#4260)#4

Open
neuralmint wants to merge 7 commits into
mainfrom
fix/release-rollback-paired-config-4260
Open

fix: rollback restores configuration snapshot alongside application image (#4260)#4
neuralmint wants to merge 7 commits into
mainfrom
fix/release-rollback-paired-config-4260

Conversation

@neuralmint
Copy link
Copy Markdown
Owner

Summary

Fixes bounty #4260 — deployment rollback now restores both the application image and the configuration snapshot that was recorded alongside that release.

Problem

When a release is rolled back, only the application image is restored. Configuration maps or feature flags that changed after the release remain, causing the previous application version to start with incompatible settings.

Fix

  • ReleaseManager (src/deploy/release.py): Each release records both an image_digest and a deep-copied config_snapshot. Configuration is frozen at creation time so subsequent mutations don't affect recorded releases.
  • Rollback restores the paired configuration snapshot alongside the image digest.
  • Post-rollback verification checks internal consistency (valid image, valid config dict, valid version).
  • CLI updates:
    • ao deploy --image-digest X manifest.json — records a release with the manifest as config snapshot.
    • ao release list — lists all releases with image digest and config key count.
    • ao release show <version> — shows full release details.
    • ao release rollback --confirm <version> — restores image AND config from target release.
    • ao --release-db <path> — optional persistent JSON-backed release store.

Acceptance Criteria

  • Each release records the image and configuration digests used.
  • Rollback restores the matching configuration snapshot.
  • Post-rollback verification checks structural consistency.
  • 35 new tests covering core logic, rollback, serialization, edge cases, and CLI integration.
  • All 67 existing non-asyncio tests continue to pass.

Test Output (35 new tests)

35 passed in 0.11s

Fixes #4260

neuralmint and others added 7 commits May 24, 2026 23:51
Bounty #3947 — Bound retry metadata growth on repeated failures.

Changes:
- Added MAX_RETRY_METADATA hard cap (100) to prevent unbounded retry
  counter growth.
- Added dead_letter store for permanently failed tasks.
- fail() now enforces the repeated-failures invariant before re-enqueueing:
  tasks that exceed max_retries or the hard cap go to dead letter instead.
- enqueue() rejects tasks past the hard cap and returns None for the
  caller to handle.
- Added preserve_retries parameter to enqueue() so retry metadata is
  preserved during re-enqueue (idempotent retry path).
- Scheduled task promotion (dequeue) also respects the hard cap.
- Added detailed logging for all retry/rejection decisions.
- Backward compatible: default max_retries remains 3; existing callers
  unaffected.
- Regression tests cover: repeated-failures trigger, metadata bound,
  idempotent fail, dead-letter isolation, exhausted enqueue rejection.
Add an atomic state precondition in the scheduler dequeue path to
reject tasks whose associated run has been deleted. This prevents
stale, duplicate, or policy-violating transitions when a workflow
is removed concurrently with run materialization.

Changes:
- Add  tracking set with  method
- Add  precondition in  — rejects
  both queued and scheduled tasks for deleted runs
- Bounded audit metadata via structured logging (warn-level with
  run_id and task_id context)
- Fix pre-existing bug:  dict now stores task dicts
  alongside timestamps so data is not lost during promotion
- Wire up WorkflowManager in OrchestrationEngine for future
  mark_run_deleted integration
- Add 5 deterministic regression tests covering:
  * Dequeue rejection for deleted runs
  * Scheduled task skip for deleted runs
  * Idempotent mark_run_deleted
  * Normal unaffected workflows
  * Isolated deletion between concurrent runs

Closes #3977
Adds a data lake governance module that enforces purpose limitation on
ingestion writes. Every data lake write now requires purpose metadata
(purpose, data class, owner, destination) and is blocked when the
destination is not approved for that data class.

New components:
- DataClassificationRegistry: registers data classes with approved
  destinations; supports wildcard (all destinations) via empty set
- PurposeMetadata: declares purpose, data class, owner, destination
- IngestionManifest: full manifest for data lake writes
- DataLakeGovernor: validates manifests, enforces policy, records
  audit log with grouping by purpose and owner
- Custom errors: MissingPurposeMetadataError, DataClassNotRegisteredError,
  DestinationNotApprovedError

All 19 new tests pass. Existing test suite unaffected.

Closes #3998
- Add release workflow (release.yml) that:
  - Triggers on version tags (v*)
  - Builds packages with uv build
  - Generates build provenance attestation via actions/attest-build-provenance
  - Creates GitHub Releases with attested artifacts
  - Publishes to PyPI with attestation support
- Add artifact verification section to README with gh CLI instructions

The attestation includes source repository, commit SHA, workflow run,
and artifact digest — enabling consumers to verify artifact provenance.

Closes #4050
…tadata

Closes #4088

Multi-stage Dockerfile isolates all build-time-only ARG declarations
(BUILD_ENV, PIP_INDEX_URL, UV_VERSION) inside the builder stage.
The final runtime stage inherits zero build-time ARGs, preventing
leakage into image history, labels, or environment variables.

Changes:
- Dockerfile: two-stage build (builder → final), ARGs only in builder
- .dockerignore: exclude dev/CI artifacts from build context
- infra/docker-compose.yml: pass args only to builder stage
- infra/scripts/audit_image_metadata.sh: CI audit for leaked metadata
- .github/workflows/ci.yml: add docker-build-and-audit job
- Makefile: docker-audit / docker-build-slim targets
…es not support ARG expansion)

Docker's COPY --from= instruction does not support variable expansion for
image references. The previous approach used:
  COPY --from=ghcr.io/astral-sh/uv:${UV_VERSION} /uv /usr/local/bin/uv
which fails at build time with:
  'variable expansion is not supported for --from'

Fix: create a dedicated uv-image stage using FROM with the ARG, then
COPY --from=uv-image using a static stage name. This is the documented
Docker workaround for this limitation.

Also moved UV_VERSION ARG to global scope (before first FROM) so it's
available to the uv-image FROM line, and removed it from the builder
stage since it's no longer consumed there.
…mage

Bounty #4260 — Deployment rollback now restores BOTH image and
configuration, preventing incompatible settings at startup.

Changes:
- Add src/deploy/ (ReleaseManager, Release dataclass) — records image
  digest and a deep-copied config snapshot per release.
- Rollback restores the paired config snapshot, not just the image.
- Post-rollback verification checks internal consistency.
- CLI gains `release list`, `release show`, `release rollback` subcmds.
- `deploy` command now records release metadata at deploy time.
- 35 tests covering core logic, rollback, serialization, edge cases,
  and CLI integration.

Fixes #4260
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ Bounty $6k ] [ Deploy ] Roll back configuration with application version — release rollback

1 participant