Skip to content

allow CAPApplicationVersion cleanup without a Prometheus connection#416

Draft
skrishnan-sap wants to merge 8 commits into
mainfrom
vers-clean-wo-prom-client
Draft

allow CAPApplicationVersion cleanup without a Prometheus connection#416
skrishnan-sap wants to merge 8 commits into
mainfrom
vers-clean-wo-prom-client

Conversation

@skrishnan-sap
Copy link
Copy Markdown
Contributor

Allow CAPApplicationVersion cleanup without a Prometheus connection

Problem

The version-cleanup routine only started when PROMETHEUS_ADDRESS was set and the controller could reach the Prometheus server (Runtimeinfo probe succeeded). This made Prometheus a hard dependency even for applications that don't define any monitoring.deletionRules on their workloads and would simply like to clean up older versions once a new one becomes Ready.

Change

The version-cleanup loop is now decoupled from Prometheus availability:

  • The schedule + worker goroutines start unconditionally for the lifetime of the controller's context.
  • When PROMETHEUS_ADDRESS is not set (or the configured server is unreachable), the loop still runs:
    • Versions whose deployment workloads declare no deletionRules are eligible for cleanup as before.
    • Versions that do declare deletionRules are skipped (not deleted, not requeued in error) and re-evaluated on the next cycle once Prometheus becomes available again.
  • Reachability is probed at most once per PROM_ACQUIRE_CLIENT_RETRY_DELAY, never on every queue iteration.

The sme.sap.com/enable-cleanup-monitoring annotation contract on CAPApplication is unchanged ("true" and "dry-run", case-insensitive). Defaulting that annotation or adding an opt-out value is explicitly out of scope here.

Implementation details (internal/controller/version-monitoring.go)

  • parseMonitoringEnv() now always returns a non-nil *monitoringEnv. When PROMETHEUS_ADDRESS is unset, address is empty; the duration fields keep their existing defaults (acquireClientRetryDelay = 1h, evaluationInterval = 10m).
  • startVersionCleanup(ctx) no longer short-circuits when no address is configured. It logs once at startup ("PROMETHEUS_ADDRESS is not set; only versions without deletion rules will be cleaned up") and proceeds to launch the goroutines.
  • initializeVersionCleanupOrchestrator(ctx, mEnv) now always returns a non-nil *cleanupOrchestrator. It no longer blocks on Prometheus reachability.
  • New promClientProvider abstraction:
    • noopPromClientProvider — used when address == "". Always reports unavailable; never constructs a client; never calls Runtimeinfo.
    • cachedPromClientProvider — constructs the client lazily, caches the working promv1.API on success, and rate-limits reachability probes via a mutex + lastProbeAt. Logs the first failure (and the recovery transition) but does not spam.
  • cleanupOrchestrator.api was replaced by cleanupOrchestrator.clientProvider; processVersionCleanupQueueItem now obtains the API via clientProvider.Get(ctx) and passes it (possibly nil) to evaluateVersionForCleanup.
  • evaluateWorkloadForCleanup accepts a possibly-nil promv1.API. Workloads without deletionRules remain auto-eligible (no Prom access). Workloads with deletionRules and an unavailable client are marked not eligible and a single info log entry is emitted naming the version + workload + reason.

Tests (internal/controller/version-monitoring_test.go)

  • TestMonitoringEnv — empty-address case now asserts a non-nil env with empty address and the documented duration defaults.
  • TestGracefulShutdownMonitoringRoutines — extended to cover the no-address case; goroutines must still launch and shut down cleanly on context cancellation.
  • Test_initializeVersionCleanupOrchestrator — orchestrator is always non-nil; reachability is asserted directly against the new provider; new explicit empty-address subtest verifies zero runtimeinfo HTTP hits.
  • New tests:
    • TestVersionCleanupEvaluation_NoPrometheus_NoDeletionRules — empty address, CAV without deletion rules → CAV is deleted.
    • TestVersionCleanupEvaluation_NoPrometheus_WithDeletionRules — empty address, CAV with deletion rules → CAV is not deleted, no requeue, no panic.
    • TestVersionCleanupEvaluation_PromUnreachable — address set but server returns 503 from runtimeinfo; CAV with deletion rules not deleted, CAV without deletion rules deleted in the same cycle.
    • TestVersionCleanupEvaluation_RuntimeInfoRateLimited — two consecutive probe attempts within acquireClientRetryDelay produce at most one runtimeinfo request on the mock server.
  • New fixture internal/controller/testdata/version-monitoring/cav-v1-no-deletion-rules.yaml for the no-rules scenarios.

Documentation

  • website/content/en/docs/configuration/_index.mdPROMETHEUS_ADDRESS bullet rewritten: the controller still runs the cleanup loop without a Prometheus connection; only versions with no deletionRules are cleaned up in that case.
  • website/content/en/docs/usage/version-monitoring.md:
    • "Integration with Prometheus" rewritten to clarify Prometheus is only needed for evaluating deletionRules.
    • "Evaluating CAPApplicationVersion Resources for Cleanup" extended with the new availability behaviour and the rate-limited reachability probing.

Verification

  • go build ./...
  • go test ./...
  • go test ./internal/controller/...
  • All 16 acceptance criteria passed (build/test green, runtime probes rate-limited, annotation contract preserved, no diff under pkg/apis/sme.sap.com/v1alpha1/, pkg/client/, or crds/, docs updated).

Diff stat

internal/controller/testdata/version-monitoring/cav-v1-no-deletion-rules.yaml |  63 +++
internal/controller/version-monitoring.go                                     | 197 +++++++---
internal/controller/version-monitoring_test.go                                | 384 ++++++++++++++++---
website/content/en/docs/configuration/_index.md                               |   2 +-
website/content/en/docs/usage/version-monitoring.md                           |   6 +-
5 files changed, 551 insertions(+), 101 deletions(-)

Non-goals

  • Defaulting sme.sap.com/enable-cleanup-monitoring to "true" or introducing an opt-out value (e.g. "false"/"disabled") — tracked separately.
  • Changes to PromQL query construction, ServiceMonitor creation, or any CR status fields.

@Pavan-SAP Pavan-SAP added the build-pr-images Set this label on PRs to build docker images for changes label May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build-pr-images Set this label on PRs to build docker images for changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants