diff --git a/.gitignore b/.gitignore index 391564d4..f6ed7534 100644 --- a/.gitignore +++ b/.gitignore @@ -21,4 +21,7 @@ testbin/* /website/public # /tmp -tmp/ \ No newline at end of file +tmp/ + +# coding agents +.pi \ No newline at end of file diff --git a/docs/specs/version-cleanup-without-prometheus/spec.md b/docs/specs/version-cleanup-without-prometheus/spec.md new file mode 100644 index 00000000..25ce2f85 --- /dev/null +++ b/docs/specs/version-cleanup-without-prometheus/spec.md @@ -0,0 +1,378 @@ +--- +workspace: ../../.. +planning: + model: hai-claude/anthropic--claude-4.7-opus + reviewer_model: hai-gemini/gemini-2.5-pro + max_rounds: 3 + timeout: 30m + image: alpine +execution: + model: hai-claude/anthropic--claude-4.6-sonnet + timeout: 8m + concurrency: 3 + image: go-alpine +correction: + max_retries: 2 + max_waves: 2 +acceptance: + model: hai-claude/anthropic--claude-4.7-opus + image: go-alpine + timeout: 15m +--- + +# Allow CAPApplicationVersion Cleanup Without a Prometheus Connection + +The CAP Operator controller currently refuses to start the `CAPApplicationVersion` +cleanup routines unless the `PROMETHEUS_ADDRESS` environment variable is set +*and* a Prometheus client can be initialised against it (a `Runtimeinfo` probe +must succeed). This makes Prometheus a hard dependency for an entirely +metrics-free use case: an application whose `CAPApplicationVersion` workloads +declare no `monitoring.deletionRules` would still like older, unused versions +to be cleaned up automatically once a newer version becomes `Ready`. + +This change decouples the cleanup loop from Prometheus availability: + +- The cleanup routines start unconditionally. +- Versions whose deployment workloads declare *no* deletion rules continue to + be eligible for cleanup as today, regardless of whether Prometheus is + reachable. +- Versions that *do* declare deletion rules are only evaluated when a working + Prometheus client is available; otherwise they are skipped (left for the + next cycle) with a clear log message and *not* deleted. + +The annotation contract on `CAPApplication` (`sme.sap.com/enable-cleanup-monitoring`) +is unchanged. Defaulting that annotation, or adding an opt-out value, is +explicitly out of scope here and tracked separately. + +## Background — current behaviour + +All paths below are inside the controller package and refer to the existing +implementation that this change is modifying. + +`internal/controller/version-monitoring.go`: + +- `parseMonitoringEnv()` returns `nil` when `PROMETHEUS_ADDRESS` is empty or + whitespace. +- `(*Controller).startVersionCleanup(ctx)` short-circuits with `return` when + `parseMonitoringEnv()` returned `nil`. The schedule and worker goroutines + are never started. +- `initializeVersionCleanupOrchestrator(ctx, mEnv)` constructs a Prometheus + client and validates it with `v1api.Runtimeinfo(ctx)`. Either failure + returns `nil`, and the surrounding loop in `startVersionCleanup` retries + after `acquireClientRetryDelay` (default `1h`). +- The `cleanupOrchestrator` struct stores the `promv1.API` once at startup; + the `api` value is then read by every downstream evaluator and is assumed + to be non-nil. +- `evaluateWorkloadForCleanup(ctx, cav, wl, promapi)` is the per-workload + evaluator. When `wl.DeploymentDefinition.Monitoring == nil` *or* + `Monitoring.DeletionRules == nil`, the workload is auto-eligible without + ever touching the Prometheus client; otherwise it calls + `evaluateExpression` or `evaluateMetric`, both of which dereference the + `promv1.API`. + +`internal/controller/controller.go` (around line 233): + +- Calls `c.startVersionCleanup(qCxt)` from the controller's main goroutine + group, wrapped in `wg.Go(...)`. + +`internal/controller/reconcile.go` (line 57): + +- Defines `AnnotationEnableCleanupMonitoring = "sme.sap.com/enable-cleanup-monitoring"`. +- `queueVersionsForCleanupEvaluation` already filters CAPApplications by this + annotation (only `"true"` and `"dry-run"` enable the loop) and excludes + versions linked to tenants and versions newer than the latest `Ready` + version. None of that filtering needs to change. + +Documentation that mentions the old behaviour: + +- `website/content/en/docs/usage/version-monitoring.md` — section + "Integration with Prometheus" claims "If no Prometheus address is supplied, + the version monitoring function is not started." +- `website/content/en/docs/configuration/_index.md` — the `PROMETHEUS_ADDRESS` + bullet says "If not set, the version monitoring function is not started." + +Tests in `internal/controller/version-monitoring_test.go` that hard-code the +"empty address ⇒ nil monitoring env / no routines" assumption: + +- `TestMonitoringEnv` — first sub-case asserts `mEnv == nil` when the env var + is unset. +- `TestGracefulShutdownMonitoringRoutines` — sets `PROMETHEUS_ADDRESS` before + starting the cleanup loop. +- `Test_initializeVersionCleanupOrchestrator` — passes a non-empty address. + +## Required behaviour + +### Routine startup + +`startVersionCleanup(ctx)` must always start the schedule and worker +goroutines, regardless of `PROMETHEUS_ADDRESS`: + +- `parseMonitoringEnv()` must always return a non-nil `*monitoringEnv`. + - When `PROMETHEUS_ADDRESS` is unset/blank, `address` is the empty string. + The other fields (`acquireClientRetryDelay`, `evaluationInterval`) are + populated using the same env-var lookup with the same defaults + (`1h` and `10m`) as today. + - When `PROMETHEUS_ADDRESS` is set, behaviour is unchanged. +- `startVersionCleanup` must no longer `return` early based on the + monitoring env. The schedule + worker goroutines must run for the lifetime + of `ctx`, with the same panic-recovery + `restartSignal` semantics as + today. +- The Prometheus *client* must no longer block routine startup. The current + `setup()` retry loop (which sleeps `acquireClientRetryDelay` while the + Prom server is unavailable) must not be on the critical path for starting + the worker. The orchestrator must be constructable without a working + Prometheus connection. + +### Prometheus client lifecycle + +Introduce a way to obtain the current Prometheus API client lazily, so that +each evaluation cycle reflects the latest reachability: + +- When `address` is empty, the provider always reports "unavailable"; no + client is ever constructed and no `Runtimeinfo` probe is made. +- When `address` is non-empty: + - A client is constructed; if construction fails, the provider reports + "unavailable" for this cycle and logs the error at most once per + transition from available → unavailable (avoid log spam every cycle). + - The client's reachability is verified via `Runtimeinfo` at most once per + `acquireClientRetryDelay` while it is unavailable, *not* on every + queueing iteration. Once a probe succeeds, the working client is cached + and reused; subsequent query failures during evaluation do not by + themselves invalidate the cache (those are already handled per-query + today). + - On a successful probe transitioning from unavailable → available, log + an informational message. + +Implementation freedom: this can be a `promClientProvider` interface, a +field on `cleanupOrchestrator` mutated under a mutex, or any equivalent +design. The only external contract is "give me the current API or tell me +it's unavailable, without doing a network call on every evaluation". + +### Per-version evaluation + +`evaluateVersionForCleanup` and `evaluateWorkloadForCleanup` must accept the +case where the Prometheus API is currently unavailable: + +- For each deployment workload of the version: + - Workloads whose `deploymentDefinition.monitoring` is nil, or whose + `deploymentDefinition.monitoring.deletionRules` is nil, remain + auto-eligible. No Prometheus access is required or attempted. + - Workloads with non-nil `deletionRules`: + - If the Prometheus API is available: evaluate as today (expression or + metric rules) and use the result. + - If the Prometheus API is unavailable: the workload is *not* eligible + for cleanup. Log a single informational/warning entry per evaluation + attempt naming the version, the workload, and the reason + ("prometheus client unavailable"). Do not return an error from + `evaluateVersionForCleanup`; the version simply stays in the system + and will be re-considered on the next cycle. +- The "all deployment workloads must be eligible for the version to be + cleaned up" rule is unchanged. +- The annotation-controlled deletion vs. dry-run decision in + `evaluateVersionForCleanup` (`AnnotationEnableCleanupMonitoring == "true"` + ⇒ delete; `"dry-run"` ⇒ event only) is unchanged. + +In particular, this means a `CAPApplicationVersion` whose deployment +workloads collectively declare *no* deletion rules can be cleaned up even +when no Prometheus is configured — that is the central use case being +unlocked. + +### Queue behaviour + +`queueVersionsForCleanupEvaluation` continues to: + +- Only consider `CAPApplication`s whose + `sme.sap.com/enable-cleanup-monitoring` annotation is `"true"` or + `"dry-run"` (case-insensitive). +- Exclude versions tied to tenants (`spec.version` or + `status.currentCAPApplicationVersionInstance`). +- Exclude versions whose `spec.version` is greater than the latest `Ready` + version (semver compare). +- Skip everything when no `Ready` version exists. + +These rules must still apply regardless of Prometheus availability. + +### Logging / observability + +- A clear `klog.InfoS` (or `klog.WarningS`) line when the cleanup loop is + starting without a configured Prometheus address, e.g. + "PROMETHEUS_ADDRESS is not set; only versions without deletion rules will + be cleaned up". +- A clear log line when a workload is skipped because the Prom client is + unavailable (one per workload per evaluation cycle is fine; do not flood + the log every queue iteration). +- Existing per-query error logs in `executePromQL`, `evaluateExpression`, + and `evaluateMetric` are kept. + +### Documentation updates + +`website/content/en/docs/usage/version-monitoring.md`: + +- The "Integration with Prometheus" paragraph must be rewritten to say: + Prometheus is only required to evaluate workloads with `deletionRules`. + When `PROMETHEUS_ADDRESS` is not set (or the configured server is + unreachable), the cleanup loop still runs; versions whose deployment + workloads have no `deletionRules` can be cleaned up, and versions with + `deletionRules` are skipped until Prometheus becomes available again. +- The "Evaluating CAPApplicationVersion Resources for Cleanup" section + must mention the same Prometheus-availability behaviour next to the + existing description of how workloads without `deletionRules` are + treated. + +`website/content/en/docs/configuration/_index.md`: + +- Update the `PROMETHEUS_ADDRESS` bullet to: "URL of the Prometheus server + for executing PromQL queries... If not set, the controller still runs + the version cleanup loop, but only versions whose workloads have no + `deletionRules` are eligible for cleanup." + +## Constraints + +- Go only; no new third-party dependencies. Reuse the existing + `github.com/prometheus/client_golang/api` package already in use. +- Public API of the `internal/controller` package (types and functions + exported from the package) must not be broken. Keep + `cleanupOrchestrator`, `monitoringEnv`, `initializeVersionCleanupOrchestrator`, + `startVersionCleanup`, `queueVersionsForCleanupEvaluation`, + `processVersionCleanupQueue`, and `processVersionCleanupQueueItem` + callable with their current signatures from tests, even if their + internals change. The `cleanupOrchestrator.api` field may be replaced + by an alternative client-provider field if you also update the tests + that touch it. +- The controller CRDs, generated client code under `pkg/client/...`, and + the API types in `pkg/apis/sme.sap.com/v1alpha1` must not change. +- The `sme.sap.com/enable-cleanup-monitoring` annotation contract is + unchanged. Do not introduce a new default, do not introduce new + accepted values, do not invert any condition. +- The Prometheus reachability probe must remain rate-limited. Do not call + `Runtimeinfo` (or any equivalent) on every iteration of the schedule + loop — at most once per `acquireClientRetryDelay`. +- Existing tests must continue to pass after their assertions about the + "empty address ⇒ nil mEnv / no routines" behaviour are updated to + reflect the new contract. No tests may be deleted; they are migrated. + +## Non-goals + +- Defaulting the `sme.sap.com/enable-cleanup-monitoring` annotation to + `"true"` or introducing an explicit opt-out value (e.g. `"false"`, + `"disabled"`). This is tracked separately and is out of scope here. +- Changing how individual PromQL queries are constructed or what + `Gauge`/`Counter`/`Expression` evaluation means. +- Changing how `ServiceMonitor` resources are created from + `monitoring.scrapeConfig`. +- Adding a Prometheus-availability indicator to any CR status field. +- Changing the controller's leader election, queue, or reconcile + scheduler behaviour for any other resource. + +## Tests + +All tests live in `internal/controller/version-monitoring_test.go` (plus +`testdata/version-monitoring/*.yaml`). The build/test commands the +planner should attach to validation tasks are: + +- `go build ./...` +- `go test ./internal/controller/...` +- `go test ./...` for a final wide check. + +Update existing tests as follows; they must keep passing. + +- `TestMonitoringEnv`: + - The case where `PROMETHEUS_ADDRESS` is unset must now assert that + `parseMonitoringEnv()` returns a non-nil `*monitoringEnv` whose + `address` is empty and whose `acquireClientRetryDelay` / + `evaluationInterval` carry their defaults (`1h`, `10m`). The cases + that set the address remain as today. +- `TestGracefulShutdownMonitoringRoutines`: + - Must additionally cover the case where `PROMETHEUS_ADDRESS` is *not* + set. `startVersionCleanup` must still launch the routines and they + must shut down cleanly when the context is cancelled. +- `Test_initializeVersionCleanupOrchestrator`: + - The "server unavailable" case must no longer rely on the orchestrator + being `nil`. The orchestrator must be constructable; the assertion + becomes "the embedded Prometheus client provider reports unavailable" + rather than "initialiser returned nil". If the test currently waits + for a context deadline as a proxy for "kept retrying", replace that + with a direct assertion against the provider. + - Add an explicit case: empty address ⇒ orchestrator constructs; + provider is permanently unavailable; no HTTP calls are made (the + test's mock server should record zero `runtimeinfo` hits). + +Add new test coverage: + +- A version-cleanup evaluation test where Prometheus is *not* configured + (empty address) and a CAV's deployment workloads all have no + `deletionRules`: the evaluator must mark the version eligible and the + CAV must be deleted (when the CAPApplication annotation is `"true"`). + Reuse the existing `setupTestControllerWithInitialResources` helper + and add a `cav-no-deletion-rules.yaml` (or equivalent) fixture under + `testdata/version-monitoring/`. +- A version-cleanup evaluation test where Prometheus is *not* configured + but the CAV has at least one workload with `deletionRules`: the + evaluator must *not* delete the CAV, must not panic, and must not + return an error that causes the queue to spin (no requeues for the + same item beyond the normal flow). The existing + `cav-v1-deletion-rules.yaml` fixture is suitable. +- A version-cleanup evaluation test where the address *is* configured + but the mock Prom server returns 503 from `runtimeinfo`: a CAV with + `deletionRules` is not deleted; a CAV with no `deletionRules` (in the + same evaluation cycle, separate fixture) *is* deleted. +- A test that the reachability probe is rate-limited: drive two + consecutive evaluation cycles closer together than + `acquireClientRetryDelay` and assert the mock Prom server records at + most one `runtimeinfo` request across both. Use a short + `acquireClientRetryDelay` plus a shorter test-window between cycles + to make this deterministic without `time.Sleep` blocking the test. + +`TestVersionSelectionForCleanup` and `TestVersionCleanupEvaluation` +existing scenarios should continue to pass unchanged once the production +code is updated. If the orchestrator's `api` field is renamed or +replaced, update the tests' construction of `cleanupOrchestrator` to +match — keep the same scenarios. + +## Acceptance + +The acceptor will run the merged code in the `go-alpine` image and verify +each of the following criteria. Any single failing criterion produces a +rejection naming that criterion, so the re-planner can target the gap. + +1. **Build is green.** `go build ./...` exits 0 from the repo root. +2. **Full test suite is green.** `go test ./...` exits 0 from the repo + root. +3. **Controller tests are green and include the new coverage.** + `go test ./internal/controller/...` exits 0, and the test binary + reports tests covering at least: + - cleanup loop startup with `PROMETHEUS_ADDRESS` unset (graceful + shutdown); + - `parseMonitoringEnv()` returning a non-nil env when the address is + unset; + - eligibility-based deletion of a CAV with no `deletionRules` when no + Prometheus client is available; + - skipping (no deletion, no panic, no error-driven requeue storm) of a + CAV with `deletionRules` when no Prometheus client is available; + - rate-limited reachability probing (at most one `runtimeinfo` request + per `acquireClientRetryDelay` window across consecutive cycles). +4. **Early-return is gone.** `internal/controller/version-monitoring.go` + no longer contains an `if mEnv == nil { return }` (or equivalent + early `return`) inside `startVersionCleanup`. The function always + reaches the goroutine-launching code path for any non-cancelled + context. +5. **Annotation contract is unchanged.** A `grep` for + `AnnotationEnableCleanupMonitoring` over `internal/controller/` + shows the same accepted values (`"true"`, `"dry-run"`, + case-insensitive) — no new accepted values, no inverted defaults, + no removal of the existing filter in + `queueVersionsForCleanupEvaluation`. +6. **API types and CRDs untouched.** `git diff` against the merge base + shows no changes under `pkg/apis/sme.sap.com/v1alpha1/`, + `pkg/client/`, or `crds/`. +7. **Documentation reflects the new behaviour.** + - `website/content/en/docs/configuration/_index.md` no longer claims + that version monitoring is disabled when `PROMETHEUS_ADDRESS` is + unset; instead, it states that the cleanup loop still runs and + describes which versions are eligible without Prometheus. + - `website/content/en/docs/usage/version-monitoring.md` ("Integration + with Prometheus" and "Evaluating CAPApplicationVersion Resources + for Cleanup" sections) describes the new contract: cleanup runs + unconditionally, workloads without `deletionRules` are eligible + without Prometheus, workloads with `deletionRules` are skipped + until Prometheus is reachable. diff --git a/internal/controller/testdata/version-monitoring/cav-v1-no-deletion-rules.yaml b/internal/controller/testdata/version-monitoring/cav-v1-no-deletion-rules.yaml new file mode 100644 index 00000000..160f92e8 --- /dev/null +++ b/internal/controller/testdata/version-monitoring/cav-v1-no-deletion-rules.yaml @@ -0,0 +1,63 @@ +apiVersion: sme.sap.com/v1alpha1 +kind: CAPApplicationVersion +metadata: + creationTimestamp: "2022-03-18T22:14:33Z" + generation: 1 + annotations: + sme.sap.com/btp-app-identifier: btp-glo-acc-id.test-cap-01 + sme.sap.com/owner-identifier: default.test-cap-01 + labels: + sme.sap.com/btp-app-identifier-hash: f20cc8aeb2003b3abc33f749a16bd53544b6bab2 + sme.sap.com/owner-generation: "2" + sme.sap.com/owner-identifier-hash: 1f74ae2fbff71a708786a4df4bb2ca87ec603581 + name: test-cap-01-cav-v1 + namespace: default + ownerReferences: + - apiVersion: sme.sap.com/v1alpha1 + blockOwnerDeletion: true + controller: true + kind: CAPApplication + name: test-cap-01 + uid: 3c7ba7cb-dc04-4fd1-be86-3eb3a5c64a98 + resourceVersion: "11371108" + uid: 5e64489b-7346-4984-8617-e8c37338b3d8 + finalizers: + - sme.sap.com/capapplicationversion +spec: + capApplicationInstance: test-cap-01 + registrySecrets: + - regcred + version: 5.6.7 + workloads: + - name: cap-backend-srv + consumedBTPServices: + - cap-uaa + - cap-service-manager + - cap-saas-registry + deploymentDefinition: + type: CAP + image: docker.image.repo/srv/server:latest + - name: content + consumedBTPServices: + - cap-uaa + jobDefinition: + type: Content + image: docker.image.repo/content/cap-content:latest + - name: mtx + consumedBTPServices: + - cap-uaa + - cap-service-manager + - cap-saas-registry + jobDefinition: + type: "TenantOperation" + image: docker.image.repo/srv/server:latest +status: + conditions: + - reason: WorkloadsReady + observedGeneration: 1 + status: "True" + type: Ready + observedGeneration: 1 + finishedJobs: + - test-cap-01-cav-v1-content + state: Ready diff --git a/internal/controller/version-monitoring.go b/internal/controller/version-monitoring.go index 4aa8ec69..82276516 100644 --- a/internal/controller/version-monitoring.go +++ b/internal/controller/version-monitoring.go @@ -10,6 +10,7 @@ import ( "fmt" "os" "strings" + "sync" "time" promapi "github.com/prometheus/client_golang/api" @@ -40,10 +41,98 @@ const ( CounterEvaluationExpression = "sum(rate(%s{job=\"%s\",namespace=\"%s\"}[%s]))" ) +// promClientProvider abstracts Prometheus API availability so that each +// evaluation cycle can obtain the current API client without making a network +// call on every iteration. +type promClientProvider interface { + // Get returns the current Prometheus API and true when available, or + // (nil, false) when the address is unset or the server is unreachable. + Get(ctx context.Context) (promv1.API, bool) +} + +// noopPromClientProvider is used when PROMETHEUS_ADDRESS is not set. +// It always reports the client as unavailable. +type noopPromClientProvider struct{} + +func (n *noopPromClientProvider) Get(_ context.Context) (promv1.API, bool) { + return nil, false +} + +// cachedPromClientProvider holds a cached Prometheus API client and manages +// reachability probing with a configurable delay between probes. +type cachedPromClientProvider struct { + address string + retryDelay time.Duration + + mu sync.Mutex + cached promv1.API // non-nil when the last probe succeeded + lastProbeAt time.Time // when the last probe (success or failure) was attempted + loggedError bool // true after we have logged the first construction/probe failure +} + +// Get returns the cached Prometheus API when available. When unavailable it +// probes at most once per retryDelay window. +func (p *cachedPromClientProvider) Get(ctx context.Context) (promv1.API, bool) { + p.mu.Lock() + defer p.mu.Unlock() + + // Fast path: already have a working client. + if p.cached != nil { + return p.cached, true + } + + // Only probe if enough time has passed since the last attempt. + if time.Since(p.lastProbeAt) < p.retryDelay { + return nil, false + } + + // Attempt to construct a client and verify reachability. + p.lastProbeAt = time.Now() + + client, err := promapi.NewClient(promapi.Config{Address: p.address}) + if err != nil { + if !p.loggedError { + klog.ErrorS(err, "could not create prometheus client", "address", p.address) + p.loggedError = true + } + return nil, false + } + + api := promv1.NewAPI(client) + _, err = api.Runtimeinfo(ctx) + if err != nil { + if !p.loggedError { + klog.ErrorS(err, "could not fetch runtime info from prometheus server", "address", p.address) + p.loggedError = true + } + return nil, false + } + + // Probe succeeded — cache the client and reset the error flag. + klog.InfoS("prometheus server is now reachable", "address", p.address) + p.cached = api + p.loggedError = false + return p.cached, true +} + +// newPromClientProvider creates the appropriate promClientProvider based on +// whether a Prometheus address is configured. +func newPromClientProvider(mEnv *monitoringEnv) promClientProvider { + if mEnv.address == "" { + return &noopPromClientProvider{} + } + return &cachedPromClientProvider{ + address: mEnv.address, + retryDelay: mEnv.acquireClientRetryDelay, + // lastProbeAt is zero so the first call triggers an immediate probe. + } +} + +// cleanupOrchestrator holds the state needed by the version-cleanup routines. type cleanupOrchestrator struct { - api promv1.API - queue workqueue.TypedRateLimitingInterface[NamespacedResourceKey] - mEnv *monitoringEnv + clientProvider promClientProvider + queue workqueue.TypedRateLimitingInterface[NamespacedResourceKey] + mEnv *monitoringEnv } type monitoringEnv struct { @@ -52,12 +141,14 @@ type monitoringEnv struct { evaluationInterval time.Duration } +// parseMonitoringEnv always returns a non-nil *monitoringEnv. +// When PROMETHEUS_ADDRESS is unset or whitespace-only, address is set to the +// empty string; the duration fields are still populated from their respective +// env vars (with the same defaults as before). func parseMonitoringEnv() *monitoringEnv { - promAdd := strings.TrimSpace(os.Getenv(EnvPrometheusAddress)) - if promAdd == "" { - return nil + env := &monitoringEnv{ + address: strings.TrimSpace(os.Getenv(EnvPrometheusAddress)), } - env := &monitoringEnv{address: promAdd} evalDurationEnv := func(envName string, fallback time.Duration) time.Duration { if v, ok := os.LookupEnv(envName); ok && strings.TrimSpace(v) != "" { @@ -73,39 +164,35 @@ func parseMonitoringEnv() *monitoringEnv { return env } +// startVersionCleanup starts the version-cleanup schedule and worker goroutines +// for the lifetime of ctx. It no longer short-circuits when there is no +// Prometheus address configured — workloads without deletionRules are still +// eligible for cleanup. func (c *Controller) startVersionCleanup(ctx context.Context) { mEnv := parseMonitoringEnv() - if mEnv == nil { - return // no prometheus address + + if mEnv.address == "" { + klog.InfoS("PROMETHEUS_ADDRESS is not set; only versions without deletion rules will be cleaned up") } restartSignal := make(chan bool, 1) setup := func() context.CancelFunc { - for { - o := initializeVersionCleanupOrchestrator(ctx, mEnv) - if o == nil { - select { - case <-ctx.Done(): - return nil - case <-time.After(mEnv.acquireClientRetryDelay): // sleep a long time before attempting to setup the cleanup process - continue - } - } - child, cancelFn := context.WithCancel(ctx) - go func() { - <-child.Done() - o.queue.ShutDown() - }() - go c.scheduleVersionCollectionForCleanup(child, o, restartSignal) - go c.processVersionCleanupQueue(child, o, restartSignal) - return cancelFn - } + o := initializeVersionCleanupOrchestrator(ctx, mEnv) + child, cancelFn := context.WithCancel(ctx) + go func() { + <-child.Done() + o.queue.ShutDown() + }() + go c.scheduleVersionCollectionForCleanup(child, o, restartSignal) + go c.processVersionCleanupQueue(child, o, restartSignal) + return cancelFn } for { cancel := setup() select { case <-ctx.Done(): + cancel() return case <-restartSignal: // restart broken routines cancel() @@ -124,24 +211,15 @@ func recoverVersionCleanupRoutine(restart chan<- bool) { } } +// initializeVersionCleanupOrchestrator always returns a non-nil +// *cleanupOrchestrator. It sets up the work queue and the Prometheus client +// provider but does NOT block on, or return nil because of, Prometheus +// reachability. func initializeVersionCleanupOrchestrator(ctx context.Context, mEnv *monitoringEnv) *cleanupOrchestrator { - promClient, err := promapi.NewClient(promapi.Config{Address: mEnv.address}) - if err != nil { - klog.ErrorS(err, "could not create client", "address", mEnv.address) - return nil - } - v1api := promv1.NewAPI(promClient) - _, err = v1api.Runtimeinfo(ctx) - if err != nil { - klog.ErrorS(err, "could not fetch runtime info from prometheus server", "address", mEnv.address) - return nil - } - - // create orchestrator return &cleanupOrchestrator{ - api: v1api, - queue: workqueue.NewTypedRateLimitingQueue(workqueue.DefaultTypedControllerRateLimiter[NamespacedResourceKey]()), - mEnv: mEnv, + clientProvider: newPromClientProvider(mEnv), + queue: workqueue.NewTypedRateLimitingQueue(workqueue.DefaultTypedControllerRateLimiter[NamespacedResourceKey]()), + mEnv: mEnv, } } @@ -246,6 +324,10 @@ func (c *Controller) processVersionCleanupQueue(ctx context.Context, orc *cleanu } } +// processVersionCleanupQueueItem dequeues one item and evaluates it for +// cleanup. It obtains the current Prometheus API from the provider; if the +// provider reports unavailable, nil is passed to the evaluator so that +// workloads without deletionRules can still be cleaned up. func (c *Controller) processVersionCleanupQueueItem(ctx context.Context, orc *cleanupOrchestrator) (stop bool) { item, shutdown := orc.queue.Get() if shutdown { @@ -253,7 +335,10 @@ func (c *Controller) processVersionCleanupQueueItem(ctx context.Context, orc *cl } defer orc.queue.Done(item) - if c.evaluateVersionForCleanup(ctx, item, orc.api) != nil { + // Obtain the current Prometheus API — may be nil if unavailable. + api, _ := orc.clientProvider.Get(ctx) + + if c.evaluateVersionForCleanup(ctx, item, api) != nil { orc.queue.AddRateLimited(item) } else { orc.queue.Forget(item) @@ -261,6 +346,11 @@ func (c *Controller) processVersionCleanupQueueItem(ctx context.Context, orc *cl return false } +// evaluateVersionForCleanup evaluates one CAPApplicationVersion for deletion +// eligibility. promapi may be nil when Prometheus is unavailable; in that +// case workloads that require deletionRules are considered not eligible (kept +// for re-evaluation next cycle) while workloads without deletionRules remain +// automatically eligible. func (c *Controller) evaluateVersionForCleanup(ctx context.Context, item NamespacedResourceKey, promapi promv1.API) error { lister := c.crdInformerFactory.Sme().V1alpha1().CAPApplicationVersions().Lister() cav, err := lister.CAPApplicationVersions(item.Namespace).Get(item.Name) @@ -295,11 +385,28 @@ func (c *Controller) evaluateVersionForCleanup(ctx context.Context, item Namespa return nil } +// evaluateWorkloadForCleanup decides whether a single workload is eligible for +// cleanup. +// +// - When the workload has no deletionRules, it is automatically eligible +// (returns true) regardless of Prometheus availability. +// - When deletionRules are present but promapi is nil (Prometheus +// unavailable), the workload is NOT eligible (returns false) and a +// single informational log entry is emitted. +// - When deletionRules are present and promapi is available, the rules are +// evaluated via the existing helpers. func evaluateWorkloadForCleanup(ctx context.Context, cav NamespacedResourceKey, wl *v1alpha1.WorkloadDetails, promapi promv1.API) bool { if wl.DeploymentDefinition == nil || wl.DeploymentDefinition.Monitoring == nil || wl.DeploymentDefinition.Monitoring.DeletionRules == nil { return true // if there are no rules - the workload is automatically eligible for cleanup } + // Deletion rules are present; we need Prometheus to evaluate them. + if promapi == nil { + klog.InfoS("prometheus client unavailable; workload with deletion rules cannot be evaluated and will be skipped", + "workload", wl.Name, "version", cav.Name) + return false + } + if wl.DeploymentDefinition.Monitoring.DeletionRules.ScalarExpression != nil { // evaluate provided expression isRelevantForCleanup, err := evaluateExpression(ctx, *wl.DeploymentDefinition.Monitoring.DeletionRules.ScalarExpression, promapi) if err != nil { diff --git a/internal/controller/version-monitoring_test.go b/internal/controller/version-monitoring_test.go index efc7e36b..b891b80c 100644 --- a/internal/controller/version-monitoring_test.go +++ b/internal/controller/version-monitoring_test.go @@ -14,6 +14,7 @@ import ( "net/http/httptest" "os" "sync" + "sync/atomic" "testing" "time" @@ -56,8 +57,19 @@ func TestMonitoringEnv(t *testing.T) { mEnv := parseMonitoringEnv() if tt.add == nil { - if mEnv != nil { - t.Errorf("did not expect monitoring environment") + // New contract: always returns non-nil *monitoringEnv with empty address. + if mEnv == nil { + t.Errorf("expected non-nil monitoringEnv even when PROMETHEUS_ADDRESS is unset") + return + } + if mEnv.address != "" { + t.Errorf("expected empty address when PROMETHEUS_ADDRESS is unset, got %q", mEnv.address) + } + if mEnv.acquireClientRetryDelay != time.Hour { + t.Errorf("expected default acquire client retry interval (1h), got %v", mEnv.acquireClientRetryDelay) + } + if mEnv.evaluationInterval != 10*time.Minute { + t.Errorf("expected default evaluation interval (10m), got %v", mEnv.evaluationInterval) } return } @@ -114,27 +126,44 @@ func setupTestControllerWithInitialResources(t *testing.T, initialResources []st } func TestGracefulShutdownMonitoringRoutines(t *testing.T) { - // Deregister metrics at the end of the test - defer deregisterMetrics() - - c := setupTestControllerWithInitialResources(t, []string{}) + t.Run("with PROMETHEUS_ADDRESS set", func(t *testing.T) { + defer deregisterMetrics() - s, _ := getPromServer(false, []queryTestCase{}) - defer s.Close() + c := setupTestControllerWithInitialResources(t, []string{}) - os.Setenv(EnvPrometheusAddress, s.URL) - defer os.Unsetenv(EnvPrometheusAddress) + s, _ := getPromServer(false, []queryTestCase{}) + defer s.Close() - testCtx, cancel := context.WithTimeout(context.Background(), time.Second) - defer cancel() + os.Setenv(EnvPrometheusAddress, s.URL) + defer os.Unsetenv(EnvPrometheusAddress) - var wg sync.WaitGroup + testCtx, cancel := context.WithTimeout(context.Background(), time.Second) + defer cancel() - wg.Go(func() { - c.startVersionCleanup(testCtx) + var wg sync.WaitGroup + wg.Go(func() { + c.startVersionCleanup(testCtx) + }) + wg.Wait() // goroutines must exit when context is cancelled — or the test times out }) - wg.Wait() // check whether routines are closing - or test timeout + t.Run("without PROMETHEUS_ADDRESS set", func(t *testing.T) { + defer deregisterMetrics() + + // Ensure the env var is not set. + os.Unsetenv(EnvPrometheusAddress) + + c := setupTestControllerWithInitialResources(t, []string{}) + + testCtx, cancel := context.WithTimeout(context.Background(), time.Second) + defer cancel() + + var wg sync.WaitGroup + wg.Go(func() { + c.startVersionCleanup(testCtx) + }) + wg.Wait() // goroutines must exit when context is cancelled — or the test times out + }) } func TestVersionSelectionForCleanup(t *testing.T) { @@ -368,6 +397,12 @@ func mockPromQueryHandler(testCases []queryTestCase, query string, w http.Respon ) } +// getPromServer starts a mock Prometheus HTTP server. +// When unavailable is true the /api/v1/status/runtimeinfo endpoint returns 503. +// The second return value is a function that returns the set of query strings +// that were received by the /api/v1/query endpoint; the third return value is a +// function that returns the total number of /api/v1/status/runtimeinfo requests +// that were received (useful for rate-limiting assertions). func getPromServer(unavailable bool, cases []queryTestCase) (*httptest.Server, func() map[string]bool) { calledQueries := map[string]bool{} server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { @@ -390,46 +425,108 @@ func getPromServer(unavailable bool, cases []queryTestCase) (*httptest.Server, f } } -func Test_initializeVersionCleanupOrchestrator(t *testing.T) { - tests := []struct { - name string - serverUnavailable bool - }{ - { - name: "initialize cleanup orchestrator and verify connection", - serverUnavailable: false, - }, - { - name: "ensure retry of cleanup orchestrator initialization", - serverUnavailable: true, - }, - } - for _, tt := range tests { - t.Run(tt.name, func(t *testing.T) { - s, _ := getPromServer(tt.serverUnavailable, []queryTestCase{}) - defer s.Close() - var o *cleanupOrchestrator - testCtx, cancel := context.WithTimeout(context.Background(), 3*time.Second) - go func() { - o = initializeVersionCleanupOrchestrator(testCtx, &monitoringEnv{address: s.URL, evaluationInterval: 2 * time.Minute, acquireClientRetryDelay: 30 * time.Second}) - if o != nil { - cancel() - } - }() - <-testCtx.Done() - if tt.serverUnavailable { - if testCtx.Err() == nil || testCtx.Err() != context.DeadlineExceeded { - t.Error("expected to exceed test context deadline") - } - } else { - if o == nil { - t.Errorf("could not initialize prometheus client") - } - defer o.queue.ShutDown() +// getPromServerWithCounter is like getPromServer but also returns an atomic +// counter tracking how many times /api/v1/status/runtimeinfo was called. +func getPromServerWithCounter(unavailable bool, cases []queryTestCase) (*httptest.Server, func() map[string]bool, *atomic.Int64) { + calledQueries := map[string]bool{} + var runtimeInfoHits atomic.Int64 + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + if r.URL.Path == "/api/v1/status/runtimeinfo" { + runtimeInfoHits.Add(1) + mockPromRuntimeInfoHandler(unavailable, w) + return + } + if r.URL.Path == "/api/v1/query" { + q := r.FormValue("query") + if q != "" { + calledQueries[q] = false } + mockPromQueryHandler(cases, q, w) + return + } + w.WriteHeader(http.StatusInternalServerError) + })) + return server, func() map[string]bool { return calledQueries }, &runtimeInfoHits +} +func Test_initializeVersionCleanupOrchestrator(t *testing.T) { + t.Run("initialize cleanup orchestrator - server available", func(t *testing.T) { + s, _, hits := getPromServerWithCounter(false, []queryTestCase{}) + defer s.Close() + + o := initializeVersionCleanupOrchestrator(context.TODO(), &monitoringEnv{ + address: s.URL, + evaluationInterval: 2 * time.Minute, + acquireClientRetryDelay: 30 * time.Second, }) - } + + if o == nil { + t.Fatal("expected non-nil orchestrator") + } + defer o.queue.ShutDown() + + // Provider should report the API as available on first probe. + api, ok := o.clientProvider.Get(context.TODO()) + if !ok || api == nil { + t.Errorf("expected provider to report prometheus API as available") + } + if hits.Load() == 0 { + t.Errorf("expected at least one runtimeinfo probe to have been made") + } + }) + + t.Run("initialize cleanup orchestrator - server unavailable", func(t *testing.T) { + s, _, hits := getPromServerWithCounter(true, []queryTestCase{}) + defer s.Close() + + o := initializeVersionCleanupOrchestrator(context.TODO(), &monitoringEnv{ + address: s.URL, + evaluationInterval: 2 * time.Minute, + acquireClientRetryDelay: 30 * time.Second, + }) + + if o == nil { + t.Fatal("expected non-nil orchestrator even when server is unavailable") + } + defer o.queue.ShutDown() + + // Provider should immediately report unavailable (server returns 503). + api, ok := o.clientProvider.Get(context.TODO()) + if ok || api != nil { + t.Errorf("expected provider to report prometheus API as unavailable") + } + if hits.Load() == 0 { + t.Errorf("expected at least one runtimeinfo probe to have been attempted") + } + }) + + t.Run("initialize cleanup orchestrator - empty address", func(t *testing.T) { + // Start a server but do NOT pass its URL to the orchestrator. + s, _, hits := getPromServerWithCounter(false, []queryTestCase{}) + defer s.Close() + + o := initializeVersionCleanupOrchestrator(context.TODO(), &monitoringEnv{ + address: "", + evaluationInterval: 2 * time.Minute, + acquireClientRetryDelay: 30 * time.Second, + }) + + if o == nil { + t.Fatal("expected non-nil orchestrator when address is empty") + } + defer o.queue.ShutDown() + + // Provider must always report unavailable when address is empty. + api, ok := o.clientProvider.Get(context.TODO()) + if ok || api != nil { + t.Errorf("expected provider to report prometheus API as unavailable with empty address") + } + + // The mock server must have received zero runtimeinfo hits. + if hits.Load() != 0 { + t.Errorf("expected zero runtimeinfo hits when address is empty, got %d", hits.Load()) + } + }) } func TestVersionCleanupEvaluation(t *testing.T) { @@ -611,3 +708,184 @@ func TestVersionCleanupEvaluation(t *testing.T) { }) } } + +// TestVersionCleanupEvaluation_NoPrometheus_NoDeletionRules verifies that a +// CAV whose workloads have NO deletionRules is deleted even when Prometheus is +// unavailable (empty address → noopPromClientProvider). +func TestVersionCleanupEvaluation_NoPrometheus_NoDeletionRules(t *testing.T) { + defer deregisterMetrics() + + // v2 is the latest Ready version; v1 (no deletion rules) is outdated. + c := setupTestControllerWithInitialResources(t, []string{ + "testdata/version-monitoring/ca-cleanup-enabled.yaml", + "testdata/version-monitoring/cav-v1-no-deletion-rules.yaml", + "testdata/version-monitoring/cav-v2-deletion-rules.yaml", + "testdata/version-monitoring/cat-provider-v2-ready.yaml", + }) + + o := initializeVersionCleanupOrchestrator(context.TODO(), &monitoringEnv{ + address: "", // no Prometheus → noopPromClientProvider + evaluationInterval: 10 * time.Minute, + acquireClientRetryDelay: time.Hour, + }) + defer o.queue.ShutDown() + + item := NamespacedResourceKey{Namespace: "default", Name: "test-cap-01-cav-v1"} + o.queue.Add(item) + _ = c.processVersionCleanupQueueItem(context.TODO(), o) + + // No requeues expected. + if o.queue.NumRequeues(item) != 0 { + t.Errorf("expected zero requeues for version without deletion rules, got %d", o.queue.NumRequeues(item)) + } + + // v1 must have been deleted. + _, err := c.crdClient.SmeV1alpha1().CAPApplicationVersions("default").Get(context.TODO(), "test-cap-01-cav-v1", v1.GetOptions{}) + if err == nil || !errors.IsNotFound(err) { + t.Errorf("expected version test-cap-01-cav-v1 to be deleted when it has no deletion rules and Prometheus is unavailable") + } +} + +// TestVersionCleanupEvaluation_NoPrometheus_WithDeletionRules verifies that a +// CAV whose workloads DO have deletionRules is NOT deleted when Prometheus is +// unavailable, and that no requeue storm occurs. +func TestVersionCleanupEvaluation_NoPrometheus_WithDeletionRules(t *testing.T) { + defer deregisterMetrics() + + c := setupTestControllerWithInitialResources(t, []string{ + "testdata/version-monitoring/ca-cleanup-enabled.yaml", + "testdata/version-monitoring/cav-v1-deletion-rules.yaml", + "testdata/version-monitoring/cav-v2-deletion-rules.yaml", + "testdata/version-monitoring/cat-provider-v2-ready.yaml", + }) + + o := initializeVersionCleanupOrchestrator(context.TODO(), &monitoringEnv{ + address: "", + evaluationInterval: 10 * time.Minute, + acquireClientRetryDelay: time.Hour, + }) + defer o.queue.ShutDown() + + item := NamespacedResourceKey{Namespace: "default", Name: "test-cap-01-cav-v1"} + o.queue.Add(item) + _ = c.processVersionCleanupQueueItem(context.TODO(), o) + + // No requeues expected (skipped, not an error). + if o.queue.NumRequeues(item) != 0 { + t.Errorf("expected zero requeues when CAV has deletion rules but Prometheus is unavailable, got %d", o.queue.NumRequeues(item)) + } + + // v1 must NOT have been deleted. + _, err := c.crdClient.SmeV1alpha1().CAPApplicationVersions("default").Get(context.TODO(), "test-cap-01-cav-v1", v1.GetOptions{}) + if err != nil { + t.Errorf("expected version test-cap-01-cav-v1 to still exist when deletion rules cannot be evaluated: %v", err) + } +} + +// TestVersionCleanupEvaluation_PromUnreachable tests two evaluation cycles +// against an unreachable Prometheus server: +// - a CAV WITH deletionRules must NOT be deleted +// - a CAV WITHOUT deletionRules MUST be deleted +func TestVersionCleanupEvaluation_PromUnreachable(t *testing.T) { + s, _ := getPromServer(true, []queryTestCase{}) // server always returns 503 + defer s.Close() + + t.Run("with deletion rules - not deleted", func(t *testing.T) { + defer deregisterMetrics() + + c := setupTestControllerWithInitialResources(t, []string{ + "testdata/version-monitoring/ca-cleanup-enabled.yaml", + "testdata/version-monitoring/cav-v1-deletion-rules.yaml", + "testdata/version-monitoring/cav-v2-deletion-rules.yaml", + "testdata/version-monitoring/cat-provider-v2-ready.yaml", + }) + + o := initializeVersionCleanupOrchestrator(context.TODO(), &monitoringEnv{ + address: s.URL, + evaluationInterval: 10 * time.Minute, + acquireClientRetryDelay: time.Millisecond, // short so the first probe is attempted + }) + defer o.queue.ShutDown() + + item := NamespacedResourceKey{Namespace: "default", Name: "test-cap-01-cav-v1"} + o.queue.Add(item) + _ = c.processVersionCleanupQueueItem(context.TODO(), o) + + if o.queue.NumRequeues(item) != 0 { + t.Errorf("expected zero requeues for CAV with deletion rules when Prometheus is unreachable, got %d", o.queue.NumRequeues(item)) + } + + _, err := c.crdClient.SmeV1alpha1().CAPApplicationVersions("default").Get(context.TODO(), "test-cap-01-cav-v1", v1.GetOptions{}) + if err != nil { + t.Errorf("expected version test-cap-01-cav-v1 to still exist: %v", err) + } + }) + + t.Run("without deletion rules - deleted", func(t *testing.T) { + defer deregisterMetrics() + + c := setupTestControllerWithInitialResources(t, []string{ + "testdata/version-monitoring/ca-cleanup-enabled.yaml", + "testdata/version-monitoring/cav-v1-no-deletion-rules.yaml", + "testdata/version-monitoring/cav-v2-deletion-rules.yaml", + "testdata/version-monitoring/cat-provider-v2-ready.yaml", + }) + + o := initializeVersionCleanupOrchestrator(context.TODO(), &monitoringEnv{ + address: s.URL, + evaluationInterval: 10 * time.Minute, + acquireClientRetryDelay: time.Millisecond, + }) + defer o.queue.ShutDown() + + item := NamespacedResourceKey{Namespace: "default", Name: "test-cap-01-cav-v1"} + o.queue.Add(item) + _ = c.processVersionCleanupQueueItem(context.TODO(), o) + + if o.queue.NumRequeues(item) != 0 { + t.Errorf("expected zero requeues for CAV without deletion rules, got %d", o.queue.NumRequeues(item)) + } + + _, err := c.crdClient.SmeV1alpha1().CAPApplicationVersions("default").Get(context.TODO(), "test-cap-01-cav-v1", v1.GetOptions{}) + if err == nil || !errors.IsNotFound(err) { + t.Errorf("expected version test-cap-01-cav-v1 to be deleted when it has no deletion rules") + } + }) +} + +// TestVersionCleanupEvaluation_RuntimeInfoRateLimited verifies that the +// cachedPromClientProvider issues at most one runtimeinfo probe per +// acquireClientRetryDelay window. Two back-to-back Get() calls should only +// result in a single HTTP hit on the mock server. +func TestVersionCleanupEvaluation_RuntimeInfoRateLimited(t *testing.T) { + s, _, hits := getPromServerWithCounter(true, []queryTestCase{}) // server always returns 503 + defer s.Close() + + // Use a generous retry delay so that the second immediate call is still + // within the rate-limit window. + o := initializeVersionCleanupOrchestrator(context.TODO(), &monitoringEnv{ + address: s.URL, + evaluationInterval: 10 * time.Minute, + acquireClientRetryDelay: 5 * time.Second, + }) + defer o.queue.ShutDown() + + ctx := context.TODO() + + // First call: triggers a probe (lastProbeAt is zero). + api1, ok1 := o.clientProvider.Get(ctx) + if ok1 || api1 != nil { + t.Errorf("expected first Get() to report unavailable (server returns 503)") + } + + // Second call: must be rate-limited; no additional HTTP request should be made. + api2, ok2 := o.clientProvider.Get(ctx) + if ok2 || api2 != nil { + t.Errorf("expected second Get() to report unavailable") + } + + // Only one runtimeinfo probe must have hit the server. + if hits.Load() > 1 { + t.Errorf("expected at most 1 runtimeinfo probe, got %d", hits.Load()) + } +} diff --git a/website/content/en/docs/configuration/_index.md b/website/content/en/docs/configuration/_index.md index 7eccf401..f0244266 100644 --- a/website/content/en/docs/configuration/_index.md +++ b/website/content/en/docs/configuration/_index.md @@ -18,7 +18,7 @@ The following environment variables are used to configure CAP Operator. - `DNS_MANAGER`: The external DNS manager to use. Possible values: - `gardener`: ["Gardener" external DNS manager](https://github.com/gardener/external-dns-management) - `kubernetes`: [External DNS management from Kubernetes](https://github.com/kubernetes-sigs/external-dns) -- `PROMETHEUS_ADDRESS`: URL of the Prometheus server for executing PromQL queries, for example `http://prometheus-operated.monitoring.svc.cluster.local:9090`. If not set, the version monitoring function is not started. +- `PROMETHEUS_ADDRESS`: URL of the Prometheus server for executing PromQL queries, for example `http://prometheus-operated.monitoring.svc.cluster.local:9090`. If not set, the controller still runs the version cleanup loop; however, only versions whose workloads have no `deletionRules` defined are eligible for cleanup — versions that have `deletionRules` are skipped until a Prometheus connection becomes available. - `PROM_ACQUIRE_CLIENT_RETRY_DELAY`: Time delay between retries when Prometheus client creation or connection check fails. - `METRICS_EVAL_INTERVAL`: Time interval between iterations where outdated versions are identified and queued for evaluation. - `MAX_CONCURRENT_RECONCILES_CAP_APPLICATION`: Maximum number of concurrent reconciles for `CAPApplication` (for example, `1`). diff --git a/website/content/en/docs/usage/version-monitoring.md b/website/content/en/docs/usage/version-monitoring.md index 9a1ecb04..b1f15673 100644 --- a/website/content/en/docs/usage/version-monitoring.md +++ b/website/content/en/docs/usage/version-monitoring.md @@ -11,7 +11,7 @@ In a continuous delivery environment where new application versions are deployed ## Integration with Prometheus -[Prometheus](https://prometheus.io/) is the industry standard for monitoring application metrics and provides a wide variety of tools for managing and reporting metrics. The CAP Operator controller can be connected to a Prometheus server by setting the `PROMETHEUS_ADDRESS` environment variable (see [Configuration](../../configuration)). The controller then queries application-related metrics based on the workload specification of `CAPApplicationVersion` resources. If no Prometheus address is supplied, the version monitoring function is not started. +[Prometheus](https://prometheus.io/) is the industry standard for monitoring application metrics and provides a wide variety of tools for managing and reporting metrics. The CAP Operator controller can be connected to a Prometheus server by setting the `PROMETHEUS_ADDRESS` environment variable (see [Configuration](../../configuration)). The controller then queries application-related metrics based on the workload specification of `CAPApplicationVersion` resources. Prometheus is only required to evaluate workloads that declare `deletionRules`. When `PROMETHEUS_ADDRESS` is not set, or the configured server is unreachable, the cleanup loop still runs. Versions whose deployment workloads declare no `deletionRules` can still be cleaned up in that case. Versions that do declare `deletionRules` are skipped — they are neither deleted nor surfaced as errors — and will be re-evaluated on the next cycle once Prometheus becomes available again. ## Configure `CAPApplication` @@ -147,4 +147,6 @@ At specified intervals (controlled by the controller environment variable `METRI - `status.currentCAPApplicationVersionInstance` — the tenant's current version. - `spec.version` — the version a tenant is upgrading to. -Workloads from the identified versions are then evaluated against their `deletionRules`. Workloads without `deletionRules` are automatically eligible for cleanup. All deployment workloads of a version must satisfy the evaluation criteria for the version to be deleted. +Workloads from the identified versions are then evaluated against their `deletionRules`. Workloads without `deletionRules` are automatically eligible for cleanup regardless of whether Prometheus is configured or reachable. All deployment workloads of a version must satisfy the evaluation criteria for the version to be deleted. + +When a workload declares `deletionRules` and the Prometheus client is currently unavailable — either because no `PROMETHEUS_ADDRESS` has been configured, or because the `Runtimeinfo` probe against the configured server is failing — that workload is treated as not eligible and the version is left untouched until the next evaluation cycle. Reachability is re-checked at most once per `PROM_ACQUIRE_CLIENT_RETRY_DELAY` interval, so a recovered Prometheus server will be detected and used on the next cycle after that delay elapses.