Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -239,6 +239,23 @@ pins, or verification scope, update the corresponding `gov-infra` docs and/or ve
Do not weaken standards silently just to get green checks. If a verifier or rubric requirement needs to change, treat
that as a policy change: update the rubric/docs/evidence surface intentionally and keep the change explicit.

## Consumer release verification standard

When `lesser-host` consumes release assets from another repo (`lesser`, `lesser-body`, or future managed deploy
dependencies), no contract or rollout-readiness approval is valid until the exact published release assets have been
verified through the real consumer path.

Minimum bar:

- download the exact published release assets that `lesser-host` will consume
- verify every manifest, checksum, path, and schema field that the managed runner enforces
- run the real `lesser-host` ingestion/deploy path against those exact published artifacts whenever practical, rather
than relying only on repo-local fixtures, source review, or synthetic tests
- treat source-level or fixture-only validation as insufficient for consumer signoff

If those steps have not happened yet, do not claim the producer contract is ready, complete, or safe for managed
rollout.

## Testing and linting

Go:
Expand Down
46 changes: 46 additions & 0 deletions docs/managed-update-recovery.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,50 @@ Each job response includes the fields needed to diagnose and safely retry:
Portal and operator UIs should surface those fields directly instead of inferring recovery state from raw DynamoDB
records.

## Body update failure classes

For `body_only` update jobs, portal and operator surfaces should treat `error_code` as the canonical classifier instead
of trying to infer failure type from free-form text.

- `body_release_preflight_failed`
- Meaning: the requested `lesser-body` release was rejected before a CodeBuild runner started. Typical causes are
missing release assets, checksum coverage gaps, or manifest-contract mismatches.
- Expected evidence:
- `failed_phase=body`
- `body_status=failed`
- `body_error` and `error_message` begin with `lesser-body release preflight failed:`
- `run_id`, `run_url`, and `body_run_url` are usually empty because no runner was launched
- Recovery: publish or select a release whose assets satisfy the managed consumer contract, then submit a fresh
`POST /api/v1/portal/instances/{slug}/updates` request.

- `body_deploy_failed`
- Meaning: the `RUN_MODE=lesser-body` CodeBuild runner started and reached a terminal failure state.
- Expected evidence:
- `failed_phase=body`
- `body_status=failed`
- `body_error` and `error_message` describe the sanitized terminal runner failure
- `body_run_url` and `run_url` preserve the CodeBuild deep link when one was observed
- Recovery: inspect the preserved CodeBuild link, fix the release or configuration problem, and submit a new update
job. No DynamoDB edits are required.

- `body_receipt_load_failed`
- Meaning: the body phase reached receipt ingest, but `lesser-host` could not load the expected
`managed/updates/<slug>/<jobId>/body-state.json` receipt after exhausting retries.
- Expected evidence:
- `failed_phase=body`
- `body_status=failed`
- `body_error` and `error_message` begin with `failed to load lesser-body receipt:`
- `body_run_url` and `run_url` may still be present because the runner itself can have completed successfully before
receipt ingestion failed
- Recovery: restore the missing or malformed receipt condition, then submit a fresh retry through the normal portal
update API.

These distinctions are intentional:

- preflight rejection means the release contract was blocked before AWS-side execution
- deploy failure means the runner itself failed and CodeBuild evidence should exist
- receipt-ingest failure means the runner may have succeeded, but durable deploy evidence could not be loaded afterward

## Canonical retry workflow

1. Inspect the most recent update job with `GET /api/v1/portal/instances/{slug}/updates`.
Expand All @@ -76,6 +120,8 @@ Retry safety rules:
that stale marker back to `error`.
- If a CodeBuild runner disappears before the worker records a terminal result, the update worker now reconciles that as a
terminal error instead of leaving the job permanently in `deploy.wait`.
- The same sweep path reconciles terminal `body.deploy.wait` failures and preserves the best available CodeBuild deep
link in `run_url` and `body_run_url`.

The supported recovery path is therefore:

Expand Down
13 changes: 13 additions & 0 deletions docs/portal.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,19 @@ Update jobs return the failure and recovery fields operators need directly from
The canonical retry/recovery contract for failed release-driven managed updates is documented in
`docs/managed-update-recovery.md`.

For `body_only` updates, portal and operator surfaces should present the `error_code` distinctions directly:

- `body_release_preflight_failed`: the requested `lesser-body` release was rejected before CodeBuild started. Expect
`failed_phase=body`, `body_status=failed`, and usually no `body_run_url`.
- `body_deploy_failed`: the `lesser-body` runner reached a terminal CodeBuild failure. Expect `failed_phase=body`,
`body_status=failed`, and a preserved `body_run_url` / `run_url` when the runner deep link was observed.
- `body_receipt_load_failed`: the body runner may have completed, but receipt ingest failed afterward. Expect
`failed_phase=body`, `body_status=failed`, and a `body_error` / `error_message` that names receipt loading.

Portal and operator UIs should use those fields to explain recovery steps instead of asking operators to inspect raw
table state. Once the latest job is terminal `error`, the supported recovery path is simply to correct the underlying
release or receipt problem and submit a fresh `POST /api/v1/portal/instances/{slug}/updates` request.

### Budgets

- `GET /api/v1/portal/instances/{slug}/budgets` (list)
Expand Down
55 changes: 55 additions & 0 deletions internal/controlplane/handlers_portal_updates_internal_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -226,6 +226,61 @@ func TestHandlePortalCreateInstanceUpdateJob_AllowsRetryAfterFailedUpdate(t *tes
require.Equal(t, "v1.2.6", parsed.LesserVersion)
}

func TestHandlePortalCreateInstanceUpdateJob_AllowsRetryAfterFailedBodyUpdate(t *testing.T) {
t.Parallel()

tdb := newPortalTestDB()
qUpdate := new(ttmocks.MockQuery)
tdb.db.On("Model", mock.AnythingOfType("*models.UpdateJob")).Return(qUpdate).Maybe()
addStandardMockQueryStubs(qUpdate)

s := &Server{
cfg: config.Config{
Stage: "lab",
WebAuthnRPID: "example.com",
ManagedInstanceRoleName: "role",
},
store: store.New(tdb.db),
}

ctx := &apptheory.Context{AuthIdentity: "alice", RequestID: "rid"}
ctx.Params = map[string]string{"slug": testPortalInstanceSlugDemo}
body, _ := json.Marshal(createUpdateJobRequest{BodyOnly: true, LesserBodyVersion: "v0.2.3"})
ctx.Request.Body = body

tdb.qInstance.On("First", mock.AnythingOfType("*models.Instance")).Return(nil).Run(func(args mock.Arguments) {
dest := testutil.RequireMockArg[*models.Instance](t, args, 0)
*dest = models.Instance{
Slug: testPortalInstanceSlugDemo,
Owner: "alice",
HostedAccountID: "123",
HostedRegion: "us-east-1",
HostedBaseDomain: "demo.example.com",
LesserVersion: "v1.2.6",
LesserBodyVersion: "v0.2.3",
BodyEnabled: nil,
UpdateStatus: models.UpdateJobStatusError,
UpdateJobID: "job-failed-body",
LesserBodyUpdateStatus: models.UpdateJobStatusError,
LesserBodyUpdateJobID: "job-failed-body",
}
}).Once()
expectEmptyUpdateJobHistory(t, qUpdate)

resp, err := s.handlePortalCreateInstanceUpdateJob(ctx)
require.NoError(t, err)
require.Equal(t, 202, resp.Status)

var parsed updateJobResponse
require.NoError(t, json.Unmarshal(resp.Body, &parsed))
require.Equal(t, updateJobKindBody, parsed.Kind)
require.NotEmpty(t, parsed.ID)
require.NotEqual(t, "job-failed-body", parsed.ID)
require.Equal(t, models.UpdateJobStatusQueued, parsed.Status)
require.Equal(t, "v0.2.3", parsed.LesserBodyVersion)
require.True(t, parsed.BodyOnly)
}

func TestHandlePortalCreateInstanceUpdateJob_DoesNotDefaultLesserBodyVersionForLesserUpdates(t *testing.T) {
t.Parallel()

Expand Down
182 changes: 182 additions & 0 deletions internal/provisionworker/release_preflight_internal_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,44 @@ func newHappyManagedLesserReleaseClient(t *testing.T, versions ...string) *http.
}))
}

func newHappyManagedLesserBodyReleaseClient(t *testing.T, stage string, versions ...string) *http.Client {
t.Helper()

if len(versions) == 0 {
versions = []string{"v0.2.3"}
}
stage = normalizeManagedLesserStage(stage)
if stage == "" {
stage = managedStageDev
}

responses := map[string][]byte{}
for _, version := range versions {
version = strings.TrimSpace(version)
if version == "" {
continue
}
releasePath := fmt.Sprintf("/equaltoai/lesser-body/releases/download/%s/lesser-body-release.json", version)
checksumsPath := fmt.Sprintf("/equaltoai/lesser-body/releases/download/%s/checksums.txt", version)
responses[releasePath] = lesserBodyReleaseManifestJSON(t, version, stage)
responses[checksumsPath] = lesserBodyChecksumsTXT(stage, true)
}

return newManagedReleaseTestClient(t, http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
body, ok := responses[r.URL.Path]
if !ok {
http.NotFound(w, r)
return
}
if strings.HasSuffix(r.URL.Path, ".json") {
w.Header().Set("Content-Type", "application/json")
} else {
w.Header().Set("Content-Type", "text/plain")
}
_, _ = w.Write(body)
}))
}

func lesserReleaseManifestJSON(t *testing.T, version string) []byte {
t.Helper()

Expand Down Expand Up @@ -125,6 +163,73 @@ func lesserBundleManifestJSON(t *testing.T) []byte {
return raw
}

func lesserBodyReleaseManifestJSON(t *testing.T, version string, stage string) []byte {
t.Helper()

if strings.TrimSpace(stage) == "" {
stage = managedStageDev
}
templatePath := fmt.Sprintf("lesser-body-managed-%s.template.json", stage)
raw, err := json.Marshal(map[string]any{
"schema": 1,
"name": "lesser-body",
"version": version,
"git_sha": "bodysha",
"artifacts": map[string]any{
"checksums": map[string]any{
"path": "checksums.txt",
"algorithm": "sha256",
},
"lambda_zip": map[string]any{
"path": "lesser-body.zip",
"sha256": "zip-sha",
},
"deploy_manifest": map[string]any{
"path": "lesser-body-deploy.json",
"sha256": "manifest-sha",
"schema": 1,
},
"deploy_templates": map[string]any{
stage: map[string]any{
"path": templatePath,
"sha256": "template-sha",
"format": "cloudformation-json",
},
},
"deploy_script": map[string]any{
"path": "deploy-lesser-body-from-release.sh",
"sha256": "script-sha",
},
},
"deploy": map[string]any{
"schema": 1,
"manifest_path": "lesser-body-deploy.json",
"template_selection": "by_stage",
"source_checkout_required": false,
"npm_install_required": false,
},
})
require.NoError(t, err)
return raw
}

func lesserBodyChecksumsTXT(stage string, includeReleaseChecksum bool) []byte {
if strings.TrimSpace(stage) == "" {
stage = managedStageDev
}
templatePath := fmt.Sprintf("lesser-body-managed-%s.template.json", stage)
lines := []string{
"zip-sha lesser-body.zip",
"manifest-sha lesser-body-deploy.json",
fmt.Sprintf("template-sha %s", templatePath),
"script-sha deploy-lesser-body-from-release.sh",
}
if includeReleaseChecksum {
lines = append(lines, "release-sha lesser-body-release.json")
}
return []byte(strings.Join(lines, "\n") + "\n")
}

func TestPreflightManagedLesserRelease_ValidatesReleaseAndBundleManifest(t *testing.T) {
t.Parallel()

Expand All @@ -150,6 +255,33 @@ func TestPreflightManagedLesserRelease_ValidatesReleaseAndBundleManifest(t *test
require.NoError(t, srv.preflightManagedLesserRelease(context.Background(), version))
}

func TestPreflightManagedLesserBodyRelease_ValidatesReleaseManifestAndChecksums(t *testing.T) {
t.Parallel()

const version = "v0.2.3"
const stage = managedStageDev
handler := http.NewServeMux()
handler.HandleFunc("/equaltoai/lesser-body/releases/download/"+version+"/lesser-body-release.json", func(w http.ResponseWriter, _ *http.Request) {
w.Header().Set("Content-Type", "application/json")
_, _ = w.Write(lesserBodyReleaseManifestJSON(t, version, stage))
})
handler.HandleFunc("/equaltoai/lesser-body/releases/download/"+version+"/checksums.txt", func(w http.ResponseWriter, _ *http.Request) {
w.Header().Set("Content-Type", "text/plain")
_, _ = w.Write(lesserBodyChecksumsTXT(stage, true))
})

srv := &Server{
cfg: config.Config{
Stage: "lab",
ManagedLesserBodyGitHubOwner: "equaltoai",
ManagedLesserBodyGitHubRepo: "lesser-body",
},
releaseHTTPClient: newManagedReleaseTestClient(t, handler),
}

require.NoError(t, srv.preflightManagedLesserBodyRelease(context.Background(), version, stage))
}

func TestAdvanceUpdateDeployReleasePreflightFailureFailsBeforeRunnerStarts(t *testing.T) {
t.Parallel()

Expand Down Expand Up @@ -211,6 +343,56 @@ func TestAdvanceUpdateDeployReleasePreflightFailureFailsBeforeRunnerStarts(t *te
}
}

func TestAdvanceUpdateBodyReleasePreflightFailureFailsBeforeRunnerStarts(t *testing.T) {
t.Parallel()

st, db := newBranchTestStore()
mockBranchInstanceLookup(t, db, managedUpdateRunnerInstance(), nil)

fakeCB := &fakeCodebuild{
startOut: &codebuild.StartBuildOutput{
Build: &cbtypes.Build{Id: aws.String("run-should-not-start")},
},
}
const version = "v0.2.2"
handler := http.NewServeMux()
handler.HandleFunc("/equaltoai/lesser-body/releases/download/"+version+"/lesser-body-release.json", func(w http.ResponseWriter, _ *http.Request) {
w.Header().Set("Content-Type", "application/json")
_, _ = w.Write(lesserBodyReleaseManifestJSON(t, version, managedStageDev))
})
handler.HandleFunc("/equaltoai/lesser-body/releases/download/"+version+"/checksums.txt", func(w http.ResponseWriter, _ *http.Request) {
w.Header().Set("Content-Type", "text/plain")
_, _ = w.Write(lesserBodyChecksumsTXT(managedStageDev, false))
})

srv := &Server{
cfg: config.Config{
Stage: "lab",
ManagedLesserBodyGitHubOwner: "equaltoai",
ManagedLesserBodyGitHubRepo: "lesser-body",
},
store: st,
releaseHTTPClient: newManagedReleaseTestClient(t, handler),
cb: fakeCB,
}

job := managedUpdateRunnerJob(updateStepBodyDeployStart)
job.LesserBodyVersion = version
delay, done, err := srv.advanceUpdateBodyDeployStart(context.Background(), job, "req", time.Unix(1, 0).UTC())
require.NoError(t, err)
require.False(t, done)
require.Zero(t, delay)
require.Equal(t, models.UpdateJobStatusError, job.Status)
require.Equal(t, updateStepFailed, job.Step)
require.Equal(t, "body_release_preflight_failed", job.ErrorCode)
require.Equal(t, updatePhaseBody, job.FailedPhase)
require.Equal(t, updatePhaseStatusFailed, job.BodyStatus)
require.Contains(t, job.ErrorMessage, "lesser-body release preflight failed")
require.Contains(t, job.BodyError, "checksum entry missing for lesser-body-release.json")
require.Empty(t, job.RunID)
require.Empty(t, fakeCB.startInputs)
}

func TestValidateManagedLesserLambdaBundleManifest_RequiresFileInventory(t *testing.T) {
t.Parallel()

Expand Down
Loading
Loading