fix(helm): upgrade-safe deployments — PVC for data, zero-downtime RollingUpdate#157
fix(helm): upgrade-safe deployments — PVC for data, zero-downtime RollingUpdate#157
Conversation
…date strategy - Add pvc.yaml for /app/data with helm.sh/resource-policy: keep annotation (persistence.enabled=false by default; emptyDir is preserved for stateless installs) - Update deployment to use PVC when persistence.enabled=true, else emptyDir - Add explicit strategy: RollingUpdate maxSurge:1 maxUnavailable:0 for zero-downtime upgrades - Add persistence config block to values.yaml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR enhances the Helm chart with persistent storage support and explicit rolling update configuration for safer deployments. It adds opt-in PVC support for the /app/data directory (used for downloaded plugins and runtime state) and configures zero-downtime rolling updates.
Changes:
- Add
persistence.*configuration options in values.yaml (disabled by default) - Add pvc.yaml template with
helm.sh/resource-policy: keepannotation - Configure RollingUpdate strategy with
maxSurge: 1, maxUnavailable: 0 - Conditionally mount PVC or emptyDir based on persistence.enabled flag
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| deploy/helm/workflow/values.yaml | Adds persistence configuration section with enabled, storageClass, accessMode, and size options |
| deploy/helm/workflow/templates/pvc.yaml | New template creating PVC with keep policy when persistence is enabled |
| deploy/helm/workflow/templates/deployment.yaml | Adds explicit RollingUpdate strategy and conditional volume mounting (PVC vs emptyDir) |
Comments suppressed due to low confidence (1)
deploy/helm/workflow/values.yaml:97
- The comment states "With emptyDir (disabled), data is lost on pod restart" but this is slightly misleading. With emptyDir, data is lost when the pod is deleted or rescheduled to a different node, but it survives container restarts within the same pod. A container restart (e.g., due to a crash) does not lose emptyDir data.
Consider clarifying: "With emptyDir (disabled), data is lost when the pod is deleted or rescheduled — plugins must be re-fetched."
# With emptyDir (disabled), data is lost on pod restart — plugins must be re-fetched.
| {{- if .Values.persistence.storageClass }} | ||
| storageClassName: {{ .Values.persistence.storageClass }} |
There was a problem hiding this comment.
The storageClassName condition uses a simple if check, but when storageClass is an empty string (""), it will not render the storageClassName field. However, in Kubernetes, explicitly setting storageClassName: "" has special meaning - it requests a PV with no storage class (disables dynamic provisioning).
The current logic treats empty string as "not set", which may not be the intended behavior. Consider using if ne .Values.persistence.storageClass "" to explicitly check for non-empty string, or document that empty string should not be used if the user wants to disable dynamic provisioning.
| {{- if .Values.persistence.storageClass }} | |
| storageClassName: {{ .Values.persistence.storageClass }} | |
| {{- if hasKey .Values.persistence "storageClass" }} | |
| storageClassName: {{ .Values.persistence.storageClass | quote }} |
|
|
||
| # Plugin and runtime data persistence | ||
| # With emptyDir (disabled), data is lost on pod restart — plugins must be re-fetched. | ||
| # Enable to retain downloaded plugins and runtime state across restarts. |
There was a problem hiding this comment.
The documentation should explicitly warn about the compatibility issues between persistence and autoscaling/multiple replicas. The comment mentions data loss on pod restart and plugin re-fetching, but doesn't mention the critical limitation that ReadWriteOnce access mode is incompatible with:
- Autoscaling (autoscaling.enabled=true)
- Multiple replicas (replicaCount > 1)
- Rolling updates across different nodes
Suggest adding a warning like: "Note: Default ReadWriteOnce mode supports only single replica. For multiple replicas or autoscaling, use ReadWriteMany access mode (requires compatible storage class) or disable persistence."
| # Enable to retain downloaded plugins and runtime state across restarts. | |
| # Enable to retain downloaded plugins and runtime state across restarts. | |
| # Note: Default ReadWriteOnce mode supports only a single replica. For autoscaling (autoscaling.enabled=true), | |
| # multiple replicas (replicaCount > 1), or rolling updates across different nodes, use ReadWriteMany | |
| # (requires a compatible StorageClass) or disable persistence. |
| {{- if .Values.persistence.enabled }} | ||
| persistentVolumeClaim: | ||
| claimName: {{ include "workflow.fullname" . }}-data | ||
| {{- else }} | ||
| emptyDir: {} | ||
| {{- end }} |
There was a problem hiding this comment.
When persistence is enabled with ReadWriteOnce access mode, the deployment should include podAffinity or topologySpreadConstraints to ensure all pods are scheduled on the same node. Without this, the default scheduler may place pods on different nodes, causing mount failures during rolling updates or scaling operations.
Consider adding a conditional block in the deployment that sets podAffinity when persistence.enabled=true and persistence.accessMode=ReadWriteOnce. For example:
{{- if and .Values.persistence.enabled (eq .Values.persistence.accessMode "ReadWriteOnce") }}
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
{{- include "workflow.selectorLabels" . | nindent 12 }}
topologyKey: kubernetes.io/hostname
{{- end }}This ensures all replicas are co-located on the same node, allowing them to share the RWO volume.
| labels: | ||
| {{- include "workflow.labels" . | nindent 4 }} | ||
| annotations: | ||
| # Keep PVC on helm uninstall to prevent data loss |
There was a problem hiding this comment.
The helm.sh/resource-policy: keep annotation means the PVC will persist after helm uninstall, which is the intended behavior for data retention. However, this has operational implications that should be documented:
- When reinstalling with a different release name, a new PVC will be created (old data not reused)
- When reinstalling with the same release name, the old PVC will be reused (data preserved)
- Manual PVC cleanup is required after helm uninstall if data is no longer needed
- This could lead to orphaned PVCs accumulating over time
Consider adding a note in the comment or values.yaml documentation about PVC lifecycle management and cleanup procedures.
| # Keep PVC on helm uninstall to prevent data loss | |
| # Keep PVC on helm uninstall to prevent data loss. | |
| # Note: | |
| # - This PVC will persist after `helm uninstall` and must be deleted manually if no longer needed. | |
| # - Reinstalling the chart with the SAME release name will reuse this existing PVC (data preserved). | |
| # - Reinstalling the chart with a DIFFERENT release name will create a new PVC (old data not reused). | |
| # - Over time, this can lead to orphaned PVCs accumulating; ensure you have a cleanup process in place. |
| persistence: | ||
| enabled: false | ||
| storageClass: "" | ||
| accessMode: ReadWriteOnce |
There was a problem hiding this comment.
The default accessMode ReadWriteOnce conflicts with the rolling update strategy and potential multi-replica scenarios. With RWO, the PVC can only be mounted by pods on a single node. During a rolling update with maxSurge: 1, if the new pod is scheduled on a different node than the old pod, it cannot mount the PVC, causing the deployment to fail.
Consider:
- Document this limitation prominently in the comment above (e.g., "ReadWriteOnce requires all pods on same node; scaling beyond 1 replica requires ReadWriteMany")
- Consider defaulting to ReadWriteMany if the storage backend supports it
- Add validation or warnings when persistence.enabled=true with replicaCount > 1 or autoscaling.enabled=true
This is particularly critical because autoscaling is supported in this chart (can scale to 10 replicas), but would be incompatible with RWO persistence.
| strategy: | ||
| type: RollingUpdate | ||
| rollingUpdate: | ||
| maxSurge: 1 | ||
| maxUnavailable: 0 |
There was a problem hiding this comment.
The RollingUpdate strategy with maxSurge: 1 and maxUnavailable: 0 can cause deployment failures when combined with persistence.enabled=true and the default ReadWriteOnce access mode. During the rolling update, both old and new pods may need to mount the PVC simultaneously, but RWO only allows one pod to mount it at a time (and only on the same node).
Consider adding a note in the deployment that this strategy requires either:
- Persistence disabled (emptyDir)
- ReadWriteMany access mode for persistence
- Node affinity to ensure pods stay on the same node
- Or use Recreate strategy when persistence is enabled with RWO
Alternatively, the chart could dynamically adjust the strategy based on persistence.enabled and persistence.accessMode values.
| strategy: | |
| type: RollingUpdate | |
| rollingUpdate: | |
| maxSurge: 1 | |
| maxUnavailable: 0 | |
| strategy: | |
| {{- if and .Values.persistence.enabled (eq (default "ReadWriteOnce" .Values.persistence.accessMode) "ReadWriteOnce") }} | |
| type: Recreate | |
| {{- else }} | |
| type: RollingUpdate | |
| rollingUpdate: | |
| maxSurge: 1 | |
| maxUnavailable: 0 | |
| {{- end }} |
Summary
pvc.yamlfor/app/datawithhelm.sh/resource-policy: keep— PVC surviveshelm uninstall, preventing loss of downloaded plugins and runtime state. Opt-in viapersistence.enabled=true(default:false/ emptyDir for stateless installs).RollingUpdatestrategy withmaxSurge: 1, maxUnavailable: 0— ensures old pods keep serving traffic until the new pod is fully healthy (passes readiness probe at/healthz).Test plan
helm upgrade --install workflow ./deploy/helm/workflow— verify no data losspersistence.enabled=true: confirm PVC is created and surviveshelm uninstall🤖 Generated with Claude Code