Skip to content

Feat: Keycloak bootstrap — operand infra and realm import#412

Open
ChristianZaccaria wants to merge 2 commits into
kagenti:mainfrom
ChristianZaccaria:pr-8
Open

Feat: Keycloak bootstrap — operand infra and realm import#412
ChristianZaccaria wants to merge 2 commits into
kagenti:mainfrom
ChristianZaccaria:pr-8

Conversation

@ChristianZaccaria

@ChristianZaccaria ChristianZaccaria commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a KeycloakBootstrapRunnable that creates and maintains all Keycloak infrastructure at operator startup: Postgres (StatefulSet, Service, Secret, ConfigMap), Keycloak CR, Route, test-users Secret with random passwords, and KeycloakRealmImport CR. This moves Keycloak operand management from the kagenti-deps Helm chart into the operator for better lifecycle control.

Changes

  • internal/bootstrap/keycloak.go — new runnable managing 8 resources (create-if-absent, spec-drift correction, no-delete for Keycloak CR)
  • internal/bootstrap/keycloak_test.go — unit tests for creation, idempotency, and secret preservation
  • cmd/main.go — register bootstrap with realm and public URL params
  • charts/kagenti-operator/templates/rbac/role.yaml — RBAC for keycloaks, keycloakrealmimports, statefulsets, routes
  • Makefile — fix docker-buildx duplicate --platform flag

Test Plan

  • Deploy operator, verify all 8 resources created in keycloak namespace
  • Verify KeycloakRealmImport status shows Done: True
  • Confirm realm OIDC discovery endpoint responds
  • Restart operator — verify idempotent (no duplicate resources)
  • Delete kagenti-test-users Secret, restart — verify new passwords generated

@cwiklik cwiklik left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Substantial, mostly well-built bootstrap — but it doesn't start cleanly, and CI proves it. The blocker is structural: an ungated bootstrap whose Postgres step is fatal, so the operator crash-loops wherever the keycloak namespace isn't pre-created. The E2E log shows exactly this, ×6:

"problem running manager","error":"ensuring Postgres: creating keycloak-db-secret: namespaces \"keycloak\" not found" → main.main cmd/main.go:666 → exit

The operator never becomes ready, so the E2E specs time out (180s). That's a REQUEST_CHANGES regardless of the rest. Fix = gate it (like MLflow/OTel) and make the Postgres step non-fatal / namespace-tolerant (comments #1, #2).

The rest is genuinely good: crypto/rand passwords, secret-preservation done right (don't overwrite existing DB/test-users secrets, with tests), a solid Postgres securityContext (RunAsNonRoot, drop ALL, seccomp), and RBAC synced across all three (kubebuilder markers + config/rbac/role.yaml + chart) — the hygiene that was missing in #378.

Tests: good coverage of postgres/secrets/idempotency/realm-import, but ensureKeycloakCR/ensureRoute aren't asserted — their GVKs aren't in the test scheme, so they fail gracefully and silently in tests. Register the GVKs and assert them.

Nits: feat: lacks emoji prefix; the Makefile docker-buildx change is the same one as #411 (verified safe — Dockerfile already carries --platform), duplicated here; one-shot bootstrap won't retry if the Keycloak CRDs/Route appear after startup (same pattern as #411).

setupLog.Info("OTel collector bootstrap enabled")
}

keycloakBootstrap := &bootstrap.KeycloakBootstrapRunnable{

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must-fix: The Keycloak bootstrap is registered unconditionally — every other bootstrap here (MLflow, OTel) is behind an if enable… gate. Combined with the fatal Postgres path (see keycloak.go), this crash-loops the operator on any cluster without a pre-existing keycloak namespace — exactly the E2E failure (namespaces "keycloak" not foundproblem running manager → exit, ×6). Gate this behind a feature flag (e.g. --enable-keycloak-bootstrap, default off).

@ChristianZaccaria ChristianZaccaria Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a namespace check at top of Start() on the KeycloakBootstrapRunnable. I'm thinking of skip adding a flag since Keycloak is a core Kagenti dependency, but happy to add --enable-keycloak-bootstrap if the team prefers explicit opt-in.

log.Info("Starting Keycloak infrastructure bootstrap", "namespace", r.Namespace)

if err := r.ensurePostgres(ctx, log); err != nil {
return fmt.Errorf("ensuring Postgres: %w", err)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must-fix: ensurePostgres returning an error makes Start() fatal (the manager exits), unlike the sibling steps right below (ensureKeycloakCR/ensureRoute/ensureRealmBootstrap) which log + return nil. A missing keycloak namespace shouldn't take down the whole operator. Either create/ensure the namespace, or handle its absence gracefully like the other steps.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks! I made ensurePostgres non-fatal (log + skip, like the other steps). Operator won't crash-loop on any cluster state now.

Type: corev1.SecretTypeOpaque,
Data: map[string][]byte{
"username": []byte("testuser"),
"password": []byte("thisisonly4testingNOT4prod"),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: DB password is hardcoded (thisisonly4testingNOT4prod) here and duplicated as a plaintext env in the StatefulSet (POSTGRESQL_PASSWORD), not sourced from keycloak-db-secret via secretKeyRef. Inconsistent with the randomly generated kagenti-test-users passwords just below. For ephemeral test infra it's tolerable, but recommend generating it like the test users and having Postgres read it from the secret.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — DB password is now generated via crypto/rand (same randomPassword() used for test users) and the StatefulSet reads both POSTGRESQL_USER and POSTGRESQL_PASSWORD from keycloak-db-secret via secretKeyRef. Single source of truth, no more hardcoded creds.

// mapsEqual does a shallow comparison sufficient for detecting spec drift on
// the top-level keys. For the Keycloak CR this is adequate since we own the
// entire spec.
func mapsEqual(a, b map[string]any) bool {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: mapsEqual does a stringify-based shallow compare. If the Keycloak operator defaults fields into .spec, len(existing) != len(desired) → you re-Update the CR on every operator restart, fighting the operator's defaults. Server-Side Apply (with a field manager) is the idiomatic "I own this spec" idempotent approach and avoids the drift churn.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — switched both ensureKeycloakCR and ensureRealmImport to Server-Side Apply with client.FieldOwner("kagenti-operator-bootstrap") + ForceOwnership. Removed the mapsEqual helper entirely. Now operator restarts are no-ops when the spec hasn't changed, and operator-defaulted fields won't trigger spurious updates.

return nil
}

func (r *KeycloakBootstrapRunnable) discoverPublicURL(ctx context.Context, log logr.Logger) string {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: discoverPublicURL runs right after the Route is created, when spec.host is typically not populated yet → returns "" → empty __AUDIENCE_URL__ in the realm's kagenti-platform-audience mapper (the one "so AuthBridge ext-proc accepts the token"). Lean on the explicit KeycloakPublicURL (which main passes); the discovery fallback is unreliable at bootstrap time and silently yields a broken audience claim.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Kept auto-discovery but added a retry loop (6 attempts, 5s apart) so we don't race the router. If KeycloakPublicURL is explicitly set it's used directly; otherwise we poll the Route for spec.host. If it's still empty after 30s, realm import is skipped with a clear log rather than silently baking in an empty audience.

A user could explicitly set it via KAGENTI_KEYCLOAK_PUBLIC_URL env var.

Signed-off-by: ChristianZaccaria <christian.zaccaria.cz@gmail.com>
@ChristianZaccaria ChristianZaccaria force-pushed the pr-8 branch 2 times, most recently from c40ecbf to e8840f1 Compare June 9, 2026 11:05
@ChristianZaccaria ChristianZaccaria changed the title feat: Keycloak bootstrap — operand infra and realm import Feat: Keycloak bootstrap — operand infra and realm import Jun 9, 2026
…and use Server-Side Apply with Route discovery retry for idempotent Keycloak bootstrap

Signed-off-by: ChristianZaccaria <christian.zaccaria.cz@gmail.com>

@pdettori pdettori left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All five original review issues are addressed in the fix commit (c6ccab1):

  • Unconditional bootstrap → namespace existence check, skips gracefully
  • Fatal ensurePostgres → errors logged + skipped
  • Hardcoded DB passwordcrypto/rand + secretKeyRef
  • mapsEqual shallow compare → Server-Side Apply with FieldOwner
  • Route discovery race → retry loop (6×5s)

E2E failure is a pre-existing webhook TLS race also present on main — not introduced by this PR.

One suggestion carried forward: test scheme lacks Keycloak/Route GVK registration (see inline).


Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Test scheme only registers corev1 and appsv1k8s.keycloak.org/v2alpha1 (Keycloak, KeycloakRealmImport) and route.openshift.io/v1 (Route) GVKs are missing. This means ensureKeycloakCR, ensureRealmImport, and ensureRoute silently succeed in tests without their output being verifiable. Registering those CRDs in the test scheme and adding assertions would close the coverage gap flagged in the first review.

@pdettori

Copy link
Copy Markdown
Member

@cwiklik pls. have another look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants