Feat: Keycloak bootstrap — operand infra and realm import#412
Feat: Keycloak bootstrap — operand infra and realm import#412ChristianZaccaria wants to merge 2 commits into
Conversation
cwiklik
left a comment
There was a problem hiding this comment.
Substantial, mostly well-built bootstrap — but it doesn't start cleanly, and CI proves it. The blocker is structural: an ungated bootstrap whose Postgres step is fatal, so the operator crash-loops wherever the keycloak namespace isn't pre-created. The E2E log shows exactly this, ×6:
"problem running manager","error":"ensuring Postgres: creating keycloak-db-secret: namespaces \"keycloak\" not found" → main.main cmd/main.go:666 → exit
The operator never becomes ready, so the E2E specs time out (180s). That's a REQUEST_CHANGES regardless of the rest. Fix = gate it (like MLflow/OTel) and make the Postgres step non-fatal / namespace-tolerant (comments #1, #2).
The rest is genuinely good: crypto/rand passwords, secret-preservation done right (don't overwrite existing DB/test-users secrets, with tests), a solid Postgres securityContext (RunAsNonRoot, drop ALL, seccomp), and RBAC synced across all three (kubebuilder markers + config/rbac/role.yaml + chart) — the hygiene that was missing in #378.
Tests: good coverage of postgres/secrets/idempotency/realm-import, but ensureKeycloakCR/ensureRoute aren't asserted — their GVKs aren't in the test scheme, so they fail gracefully and silently in tests. Register the GVKs and assert them.
Nits: feat: lacks emoji prefix; the Makefile docker-buildx change is the same one as #411 (verified safe — Dockerfile already carries --platform), duplicated here; one-shot bootstrap won't retry if the Keycloak CRDs/Route appear after startup (same pattern as #411).
| setupLog.Info("OTel collector bootstrap enabled") | ||
| } | ||
|
|
||
| keycloakBootstrap := &bootstrap.KeycloakBootstrapRunnable{ |
There was a problem hiding this comment.
must-fix: The Keycloak bootstrap is registered unconditionally — every other bootstrap here (MLflow, OTel) is behind an if enable… gate. Combined with the fatal Postgres path (see keycloak.go), this crash-loops the operator on any cluster without a pre-existing keycloak namespace — exactly the E2E failure (namespaces "keycloak" not found → problem running manager → exit, ×6). Gate this behind a feature flag (e.g. --enable-keycloak-bootstrap, default off).
There was a problem hiding this comment.
I added a namespace check at top of Start() on the KeycloakBootstrapRunnable. I'm thinking of skip adding a flag since Keycloak is a core Kagenti dependency, but happy to add --enable-keycloak-bootstrap if the team prefers explicit opt-in.
| log.Info("Starting Keycloak infrastructure bootstrap", "namespace", r.Namespace) | ||
|
|
||
| if err := r.ensurePostgres(ctx, log); err != nil { | ||
| return fmt.Errorf("ensuring Postgres: %w", err) |
There was a problem hiding this comment.
must-fix: ensurePostgres returning an error makes Start() fatal (the manager exits), unlike the sibling steps right below (ensureKeycloakCR/ensureRoute/ensureRealmBootstrap) which log + return nil. A missing keycloak namespace shouldn't take down the whole operator. Either create/ensure the namespace, or handle its absence gracefully like the other steps.
There was a problem hiding this comment.
Good catch, thanks! I made ensurePostgres non-fatal (log + skip, like the other steps). Operator won't crash-loop on any cluster state now.
| Type: corev1.SecretTypeOpaque, | ||
| Data: map[string][]byte{ | ||
| "username": []byte("testuser"), | ||
| "password": []byte("thisisonly4testingNOT4prod"), |
There was a problem hiding this comment.
suggestion: DB password is hardcoded (thisisonly4testingNOT4prod) here and duplicated as a plaintext env in the StatefulSet (POSTGRESQL_PASSWORD), not sourced from keycloak-db-secret via secretKeyRef. Inconsistent with the randomly generated kagenti-test-users passwords just below. For ephemeral test infra it's tolerable, but recommend generating it like the test users and having Postgres read it from the secret.
There was a problem hiding this comment.
Fixed — DB password is now generated via crypto/rand (same randomPassword() used for test users) and the StatefulSet reads both POSTGRESQL_USER and POSTGRESQL_PASSWORD from keycloak-db-secret via secretKeyRef. Single source of truth, no more hardcoded creds.
| // mapsEqual does a shallow comparison sufficient for detecting spec drift on | ||
| // the top-level keys. For the Keycloak CR this is adequate since we own the | ||
| // entire spec. | ||
| func mapsEqual(a, b map[string]any) bool { |
There was a problem hiding this comment.
suggestion: mapsEqual does a stringify-based shallow compare. If the Keycloak operator defaults fields into .spec, len(existing) != len(desired) → you re-Update the CR on every operator restart, fighting the operator's defaults. Server-Side Apply (with a field manager) is the idiomatic "I own this spec" idempotent approach and avoids the drift churn.
There was a problem hiding this comment.
Agreed — switched both ensureKeycloakCR and ensureRealmImport to Server-Side Apply with client.FieldOwner("kagenti-operator-bootstrap") + ForceOwnership. Removed the mapsEqual helper entirely. Now operator restarts are no-ops when the spec hasn't changed, and operator-defaulted fields won't trigger spurious updates.
| return nil | ||
| } | ||
|
|
||
| func (r *KeycloakBootstrapRunnable) discoverPublicURL(ctx context.Context, log logr.Logger) string { |
There was a problem hiding this comment.
suggestion: discoverPublicURL runs right after the Route is created, when spec.host is typically not populated yet → returns "" → empty __AUDIENCE_URL__ in the realm's kagenti-platform-audience mapper (the one "so AuthBridge ext-proc accepts the token"). Lean on the explicit KeycloakPublicURL (which main passes); the discovery fallback is unreliable at bootstrap time and silently yields a broken audience claim.
There was a problem hiding this comment.
Good call. Kept auto-discovery but added a retry loop (6 attempts, 5s apart) so we don't race the router. If KeycloakPublicURL is explicitly set it's used directly; otherwise we poll the Route for spec.host. If it's still empty after 30s, realm import is skipped with a clear log rather than silently baking in an empty audience.
A user could explicitly set it via KAGENTI_KEYCLOAK_PUBLIC_URL env var.
Signed-off-by: ChristianZaccaria <christian.zaccaria.cz@gmail.com>
c40ecbf to
e8840f1
Compare
…and use Server-Side Apply with Route discovery retry for idempotent Keycloak bootstrap Signed-off-by: ChristianZaccaria <christian.zaccaria.cz@gmail.com>
pdettori
left a comment
There was a problem hiding this comment.
All five original review issues are addressed in the fix commit (c6ccab1):
- Unconditional bootstrap → namespace existence check, skips gracefully
- Fatal ensurePostgres → errors logged + skipped
- Hardcoded DB password →
crypto/rand+secretKeyRef - mapsEqual shallow compare → Server-Side Apply with FieldOwner
- Route discovery race → retry loop (6×5s)
E2E failure is a pre-existing webhook TLS race also present on main — not introduced by this PR.
One suggestion carried forward: test scheme lacks Keycloak/Route GVK registration (see inline).
|
|
||
| Unless required by applicable law or agreed to in writing, software | ||
| distributed under the License is distributed on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
There was a problem hiding this comment.
suggestion: Test scheme only registers corev1 and appsv1 — k8s.keycloak.org/v2alpha1 (Keycloak, KeycloakRealmImport) and route.openshift.io/v1 (Route) GVKs are missing. This means ensureKeycloakCR, ensureRealmImport, and ensureRoute silently succeed in tests without their output being verifiable. Registering those CRDs in the test scheme and adding assertions would close the coverage gap flagged in the first review.
|
@cwiklik pls. have another look |
Summary
Adds a
KeycloakBootstrapRunnablethat creates and maintains all Keycloak infrastructure at operator startup: Postgres (StatefulSet, Service, Secret, ConfigMap), Keycloak CR, Route, test-users Secret with random passwords, and KeycloakRealmImport CR. This moves Keycloak operand management from the kagenti-deps Helm chart into the operator for better lifecycle control.Changes
internal/bootstrap/keycloak.go— new runnable managing 8 resources (create-if-absent, spec-drift correction, no-delete for Keycloak CR)internal/bootstrap/keycloak_test.go— unit tests for creation, idempotency, and secret preservationcmd/main.go— register bootstrap with realm and public URL paramscharts/kagenti-operator/templates/rbac/role.yaml— RBAC for keycloaks, keycloakrealmimports, statefulsets, routesMakefile— fix docker-buildx duplicate --platform flagTest Plan
KeycloakRealmImportstatus showsDone: True