Skip to content

mtls: Pods restart multiple times when SPIRE is first enabled #374

@mahil-2040

Description

@mahil-2040

What happened:

When SPIRE is enabled via helm upgrade --set spire.enabled=true, the Router and WorkloadManager pods crash and restart several times before stabilizing (Router: 5 restarts, WM: 3, SPIRE Agent: 3).

The root cause is a startup race condition. The spiffe-helper sidecar and the main container start simultaneously, but the main container needs cert files that spiffe-helper hasn't written yet. Both Router (cmd/router/main.go:66) and WorkloadManager (cmd/workload-manager/main.go:82) call klog.Fatalf if certs aren't found within 30s (pkg/mtls/wait.go:30), crashing the pod.

What you expected to happen:
Pods should start cleanly with 0 restarts.

How to reproduce it (as minimally and precisely as possible):

  1. Install AgentCube without SPIRE on a Kind cluster.
  2. Run helm upgrade with --set spire.enabled=true.
  3. Watch pods via kubectl get pods -n agentcube -w. observe multiple CrashLoopBackOff cycles.

Anything else we need to know?:

Suggested fix: Convert spiffe-helper from a regular sidecar to a Kubernetes native sidecar by moving it to initContainers with restartPolicy: Always . This guarantees Kubernetes starts spiffe-helper and lets it write the cert files before the main container starts.

Environment:

  • agentcube version:
  • Kubernetes version: v1.32.2 (Kind v0.27.0)
  • Others: Helm v3.17.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions