Skip to content

fix(helm): upgrade-safe deployments — PVC for data, zero-downtime RollingUpdate#157

Merged
intel352 merged 1 commit intomainfrom
feature/helm-upgrade-safety
Feb 25, 2026
Merged

fix(helm): upgrade-safe deployments — PVC for data, zero-downtime RollingUpdate#157
intel352 merged 1 commit intomainfrom
feature/helm-upgrade-safety

Conversation

@intel352
Copy link
Contributor

Summary

  • Add pvc.yaml for /app/data with helm.sh/resource-policy: keep — PVC survives helm uninstall, preventing loss of downloaded plugins and runtime state. Opt-in via persistence.enabled=true (default: false / emptyDir for stateless installs).
  • Explicit RollingUpdate strategy with maxSurge: 1, maxUnavailable: 0 — ensures old pods keep serving traffic until the new pod is fully healthy (passes readiness probe at /healthz).

Test plan

  • helm upgrade --install workflow ./deploy/helm/workflow — verify no data loss
  • With persistence.enabled=true: confirm PVC is created and survives helm uninstall
  • Rolling update: confirm zero downtime (old pod stays up until new pod is ready)

🤖 Generated with Claude Code

…date strategy

- Add pvc.yaml for /app/data with helm.sh/resource-policy: keep annotation
  (persistence.enabled=false by default; emptyDir is preserved for stateless installs)
- Update deployment to use PVC when persistence.enabled=true, else emptyDir
- Add explicit strategy: RollingUpdate maxSurge:1 maxUnavailable:0 for zero-downtime upgrades
- Add persistence config block to values.yaml

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings February 25, 2026 00:24
@intel352 intel352 merged commit e948204 into main Feb 25, 2026
19 checks passed
@intel352 intel352 deleted the feature/helm-upgrade-safety branch February 25, 2026 00:27
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the Helm chart with persistent storage support and explicit rolling update configuration for safer deployments. It adds opt-in PVC support for the /app/data directory (used for downloaded plugins and runtime state) and configures zero-downtime rolling updates.

Changes:

  • Add persistence.* configuration options in values.yaml (disabled by default)
  • Add pvc.yaml template with helm.sh/resource-policy: keep annotation
  • Configure RollingUpdate strategy with maxSurge: 1, maxUnavailable: 0
  • Conditionally mount PVC or emptyDir based on persistence.enabled flag

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
deploy/helm/workflow/values.yaml Adds persistence configuration section with enabled, storageClass, accessMode, and size options
deploy/helm/workflow/templates/pvc.yaml New template creating PVC with keep policy when persistence is enabled
deploy/helm/workflow/templates/deployment.yaml Adds explicit RollingUpdate strategy and conditional volume mounting (PVC vs emptyDir)
Comments suppressed due to low confidence (1)

deploy/helm/workflow/values.yaml:97

  • The comment states "With emptyDir (disabled), data is lost on pod restart" but this is slightly misleading. With emptyDir, data is lost when the pod is deleted or rescheduled to a different node, but it survives container restarts within the same pod. A container restart (e.g., due to a crash) does not lose emptyDir data.

Consider clarifying: "With emptyDir (disabled), data is lost when the pod is deleted or rescheduled — plugins must be re-fetched."

# With emptyDir (disabled), data is lost on pod restart — plugins must be re-fetched.

Comment on lines +14 to +15
{{- if .Values.persistence.storageClass }}
storageClassName: {{ .Values.persistence.storageClass }}
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The storageClassName condition uses a simple if check, but when storageClass is an empty string (""), it will not render the storageClassName field. However, in Kubernetes, explicitly setting storageClassName: "" has special meaning - it requests a PV with no storage class (disables dynamic provisioning).

The current logic treats empty string as "not set", which may not be the intended behavior. Consider using if ne .Values.persistence.storageClass "" to explicitly check for non-empty string, or document that empty string should not be used if the user wants to disable dynamic provisioning.

Suggested change
{{- if .Values.persistence.storageClass }}
storageClassName: {{ .Values.persistence.storageClass }}
{{- if hasKey .Values.persistence "storageClass" }}
storageClassName: {{ .Values.persistence.storageClass | quote }}

Copilot uses AI. Check for mistakes.

# Plugin and runtime data persistence
# With emptyDir (disabled), data is lost on pod restart — plugins must be re-fetched.
# Enable to retain downloaded plugins and runtime state across restarts.
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation should explicitly warn about the compatibility issues between persistence and autoscaling/multiple replicas. The comment mentions data loss on pod restart and plugin re-fetching, but doesn't mention the critical limitation that ReadWriteOnce access mode is incompatible with:

  • Autoscaling (autoscaling.enabled=true)
  • Multiple replicas (replicaCount > 1)
  • Rolling updates across different nodes

Suggest adding a warning like: "Note: Default ReadWriteOnce mode supports only single replica. For multiple replicas or autoscaling, use ReadWriteMany access mode (requires compatible storage class) or disable persistence."

Suggested change
# Enable to retain downloaded plugins and runtime state across restarts.
# Enable to retain downloaded plugins and runtime state across restarts.
# Note: Default ReadWriteOnce mode supports only a single replica. For autoscaling (autoscaling.enabled=true),
# multiple replicas (replicaCount > 1), or rolling updates across different nodes, use ReadWriteMany
# (requires a compatible StorageClass) or disable persistence.

Copilot uses AI. Check for mistakes.
Comment on lines +98 to +103
{{- if .Values.persistence.enabled }}
persistentVolumeClaim:
claimName: {{ include "workflow.fullname" . }}-data
{{- else }}
emptyDir: {}
{{- end }}
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When persistence is enabled with ReadWriteOnce access mode, the deployment should include podAffinity or topologySpreadConstraints to ensure all pods are scheduled on the same node. Without this, the default scheduler may place pods on different nodes, causing mount failures during rolling updates or scaling operations.

Consider adding a conditional block in the deployment that sets podAffinity when persistence.enabled=true and persistence.accessMode=ReadWriteOnce. For example:

{{- if and .Values.persistence.enabled (eq .Values.persistence.accessMode "ReadWriteOnce") }}
affinity:
  podAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchLabels:
          {{- include "workflow.selectorLabels" . | nindent 12 }}
      topologyKey: kubernetes.io/hostname
{{- end }}

This ensures all replicas are co-located on the same node, allowing them to share the RWO volume.

Copilot uses AI. Check for mistakes.
labels:
{{- include "workflow.labels" . | nindent 4 }}
annotations:
# Keep PVC on helm uninstall to prevent data loss
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The helm.sh/resource-policy: keep annotation means the PVC will persist after helm uninstall, which is the intended behavior for data retention. However, this has operational implications that should be documented:

  1. When reinstalling with a different release name, a new PVC will be created (old data not reused)
  2. When reinstalling with the same release name, the old PVC will be reused (data preserved)
  3. Manual PVC cleanup is required after helm uninstall if data is no longer needed
  4. This could lead to orphaned PVCs accumulating over time

Consider adding a note in the comment or values.yaml documentation about PVC lifecycle management and cleanup procedures.

Suggested change
# Keep PVC on helm uninstall to prevent data loss
# Keep PVC on helm uninstall to prevent data loss.
# Note:
# - This PVC will persist after `helm uninstall` and must be deleted manually if no longer needed.
# - Reinstalling the chart with the SAME release name will reuse this existing PVC (data preserved).
# - Reinstalling the chart with a DIFFERENT release name will create a new PVC (old data not reused).
# - Over time, this can lead to orphaned PVCs accumulating; ensure you have a cleanup process in place.

Copilot uses AI. Check for mistakes.
persistence:
enabled: false
storageClass: ""
accessMode: ReadWriteOnce
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default accessMode ReadWriteOnce conflicts with the rolling update strategy and potential multi-replica scenarios. With RWO, the PVC can only be mounted by pods on a single node. During a rolling update with maxSurge: 1, if the new pod is scheduled on a different node than the old pod, it cannot mount the PVC, causing the deployment to fail.

Consider:

  1. Document this limitation prominently in the comment above (e.g., "ReadWriteOnce requires all pods on same node; scaling beyond 1 replica requires ReadWriteMany")
  2. Consider defaulting to ReadWriteMany if the storage backend supports it
  3. Add validation or warnings when persistence.enabled=true with replicaCount > 1 or autoscaling.enabled=true

This is particularly critical because autoscaling is supported in this chart (can scale to 10 replicas), but would be incompatible with RWO persistence.

Copilot uses AI. Check for mistakes.
Comment on lines +11 to +15
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RollingUpdate strategy with maxSurge: 1 and maxUnavailable: 0 can cause deployment failures when combined with persistence.enabled=true and the default ReadWriteOnce access mode. During the rolling update, both old and new pods may need to mount the PVC simultaneously, but RWO only allows one pod to mount it at a time (and only on the same node).

Consider adding a note in the deployment that this strategy requires either:

  1. Persistence disabled (emptyDir)
  2. ReadWriteMany access mode for persistence
  3. Node affinity to ensure pods stay on the same node
  4. Or use Recreate strategy when persistence is enabled with RWO

Alternatively, the chart could dynamically adjust the strategy based on persistence.enabled and persistence.accessMode values.

Suggested change
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
strategy:
{{- if and .Values.persistence.enabled (eq (default "ReadWriteOnce" .Values.persistence.accessMode) "ReadWriteOnce") }}
type: Recreate
{{- else }}
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
{{- end }}

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants