What happened:
When SPIRE is enabled via helm upgrade --set spire.enabled=true, the Router and WorkloadManager pods crash and restart several times before stabilizing (Router: 5 restarts, WM: 3, SPIRE Agent: 3).
The root cause is a startup race condition. The spiffe-helper sidecar and the main container start simultaneously, but the main container needs cert files that spiffe-helper hasn't written yet. Both Router (cmd/router/main.go:66) and WorkloadManager (cmd/workload-manager/main.go:82) call klog.Fatalf if certs aren't found within 30s (pkg/mtls/wait.go:30), crashing the pod.
What you expected to happen:
Pods should start cleanly with 0 restarts.
How to reproduce it (as minimally and precisely as possible):
- Install AgentCube without SPIRE on a Kind cluster.
- Run
helm upgrade with --set spire.enabled=true.
- Watch pods via
kubectl get pods -n agentcube -w. observe multiple CrashLoopBackOff cycles.
Anything else we need to know?:
Suggested fix: Convert spiffe-helper from a regular sidecar to a Kubernetes native sidecar by moving it to initContainers with restartPolicy: Always . This guarantees Kubernetes starts spiffe-helper and lets it write the cert files before the main container starts.
Environment:
- agentcube version:
- Kubernetes version: v1.32.2 (Kind v0.27.0)
- Others: Helm v3.17.3
What happened:
When SPIRE is enabled via helm upgrade
--set spire.enabled=true, the Router and WorkloadManager pods crash and restart several times before stabilizing (Router: 5 restarts, WM: 3, SPIRE Agent: 3).The root cause is a startup race condition. The
spiffe-helpersidecar and the main container start simultaneously, but the main container needs cert files thatspiffe-helperhasn't written yet. Both Router (cmd/router/main.go:66) and WorkloadManager (cmd/workload-manager/main.go:82) callklog.Fatalfif certs aren't found within 30s (pkg/mtls/wait.go:30), crashing the pod.What you expected to happen:
Pods should start cleanly with 0 restarts.
How to reproduce it (as minimally and precisely as possible):
helm upgradewith--set spire.enabled=true.kubectl get pods -n agentcube -w. observe multiple CrashLoopBackOff cycles.Anything else we need to know?:
Suggested fix: Convert
spiffe-helperfrom a regular sidecar to a Kubernetes native sidecar by moving it toinitContainerswithrestartPolicy: Always. This guarantees Kubernetes startsspiffe-helperand lets it write the cert files before the main container starts.Environment: