Skip to content

Previously working step no longer working as expected... #1922

@pommetjehorlepiep

Description

@pommetjehorlepiep

Dagu deployed using Dagu Helm chart, running in K8s (1.34)
Step works in Dagu version 2.3.1
Step failing in Dagu version >= 2.3.8 (earliest version post 2.3.1 I was able to run the step (due to issue that got fixed in 2.3.8))

Problem

I have a step which runs kubectl commands. This step runs fine in Dagu version 2.3.1

Setup to be able to do this:

  • kubectl installed in the dagu worker container
  • ServiceAccount added (to get ServiceAccountToken which is needed to get access to k8s cluster)
  • ClusterRole added
  • ClusterRoleBinding added
  • updated the dagu helm chart accordingly

This all worked fine <= Dagu 2.3.1

Simple example definition

steps:
  - command: kubectl get pods

Helm chart changes

ServiceAccount

apiVersion: v1
kind: ServiceAccount
metadata:
  name: dagu
  labels:
    app.kubernetes.io/name: dagu
    app.kubernetes.io/instance: dagu
    app.kubernetes.io/managed-by: Helm
    helm.sh/chart: dagu-1.0.4
automountServiceAccountToken: true

ClusterRole

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: dagu-worker
  labels:
    app.kubernetes.io/name: dagu
    app.kubernetes.io/instance: dagu
    app.kubernetes.io/managed-by: Helm
    helm.sh/chart: dagu-1.0.4
rules:
- apiGroups:
    - ""
  resources:
    - pods
  verbs:
    - delete
    - get
    - list
    - watch
    - exec
- apiGroups:
    - batch
  resources:
    - jobs
  verbs:
    - create
    - delete
    - get
    - list
    - watch
- apiGroups:
    - apps
  resources:
    - deployments
    - replicasets
    - statefulsets
  verbs:
    - get
    - list
    - watch
- apiGroups:
    - apps
  resources:
    - deployments/scale
    - statefulsets/scale
  verbs:
    - get
    - patch
    - update
- apiGroups:
    - ""
  resources:
    - pods/exec
  verbs:
    - create
- apiGroups:
    - ""
  resources:
    - pods/log
  verbs:
    - get

ClusterRoleBinding

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: dagu-worker-binding
  labels:
    app.kubernetes.io/name: dagu
    app.kubernetes.io/instance: dagu
    app.kubernetes.io/managed-by: Helm
    helm.sh/chart: dagu-1.0.4
subjects:
- kind: ServiceAccount
  name: dagu
  namespace: default
roleRef:
  kind: ClusterRole
  name: dagu-worker

Worker deployment template

{{- range $poolName, $pool := .Values.workerPools }}
{{- if not (regexMatch "^[a-z][a-z0-9-]*$" $poolName) }}
{{- fail (printf "invalid workerPool name %q: must match ^[a-z][a-z0-9-]*$" $poolName) }}
{{- end }}
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "dagu.fullname" $ }}-worker-{{ $poolName }}
  labels:
    {{- include "dagu.labels" $ | nindent 4 }}
    app.kubernetes.io/component: worker
    daguit.dev/worker-pool: {{ $poolName }}
spec:
  replicas: {{ $pool.replicas }}
  selector:
    matchLabels:
      {{- include "dagu.selectorLabels" $ | nindent 6 }}
      app.kubernetes.io/component: worker
      daguit.dev/worker-pool: {{ $poolName }}
  template:
    metadata:
      labels:
        {{- include "dagu.labels" $ | nindent 8 }}
        app.kubernetes.io/component: worker
        daguit.dev/worker-pool: {{ $poolName }}
    spec:
      # Disable Kubernetes Service env var injection to avoid overriding
      # dagu config values (e.g., scheduler.port) with Service URLs
      enableServiceLinks: false
      serviceAccountName: dagu
      containers:
        - name: worker
          image: "{{ $.Values.image.repository }}:{{ $.Values.image.tag }}"
          imagePullPolicy: {{ $.Values.image.pullPolicy }}
          command:
            - dagu
            - worker
            - --config
            - /etc/dagu/dagu.yaml
            {{- if $pool.labels }}
            - --worker.labels
            - {{ include "dagu.workerLabels" $pool.labels | quote }}
            {{- end }}
          env:
            - name: DAGU_HOME
              value: /data
            - name: DAGU_WORKER_ID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
          ports:
            - name: health
              containerPort: {{ $.Values.worker.healthPort }}
              protocol: TCP
          volumeMounts:
            - name: data
              mountPath: /data
            - name: config
              mountPath: /etc/dagu
            - name: sa-token
              mountPath: /serviceaccount
              readOnly: true
          livenessProbe:
            httpGet:
              path: /health
              port: health
            initialDelaySeconds: 10
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /health
              port: health
            initialDelaySeconds: 5
            periodSeconds: 5
          resources:
            {{- toYaml $pool.resources | nindent 12 }}
      {{- with $pool.nodeSelector }}
      nodeSelector:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      {{- with $pool.tolerations }}
      tolerations:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      {{- with $pool.affinity }}
      affinity:
        {{- toYaml . | nindent 8 }}
      {{- end }}
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: {{ include "dagu.fullname" $ }}-data
        - name: config
          configMap:
            name: {{ include "dagu.fullname" $ }}-config
        - name: sa-token
          projected:
            sources:
              - serviceAccountToken:
                  path: token
                  expirationSeconds: 3600
{{- end }}

Dagu 2.3.1 output

NAME                                  READY   STATUS    RESTARTS      AGE
...
dagu-coordinator-59bb4d4b4d-ssdlx     1/1     Running   0             51s
dagu-scheduler-6c8c94985-dqkmx        1/1     Running   0             51s
dagu-ui-6b7f8587c4-6cqj4              1/1     Running   0             51s
dagu-worker-general-5d66564b-7sqg8    1/1     Running   0             51s
dagu-worker-general-5d66564b-bvv8c    1/1     Running   0             45s
dagu-worker-general-5d66564b-c8tzh    1/1     Running   0             51s
dagu-worker-general-5d66564b-hf56x    1/1     Running   0             44s
dagu-worker-general-5d66564b-ptplv    1/1     Running   0             51s
metallb-controller-765c495b75-lklnp   1/1     Running   0             41d
metallb-speaker-88r7s                 4/4     Running   6 (41d ago)   76d
metallb-speaker-fr8z2                 4/4     Running   6 (41d ago)   76d
metallb-speaker-lhs5t                 4/4     Running   6 (41d ago)   76d
metallb-speaker-vk228                 4/4     Running   6 (41d ago)   76d
metallb-speaker-wmcgb                 4/4     Running   6 (41d ago)   76d
...

Dagu >= 2.3.8 output

E0401 13:04:24.195945      22 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp [::1]:8080: connect: connection refused"
E0401 13:04:24.196438      22 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp [::1]:8080: connect: connection refused"
E0401 13:04:24.197924      22 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp [::1]:8080: connect: connection refused"
E0401 13:04:24.198283      22 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp [::1]:8080: connect: connection refused"
E0401 13:04:24.199746      22 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp [::1]:8080: connect: connection refused"
The connection to the server localhost:8080 was refused - did you specify the right host or port?
 
--- error ---
calhost:8080/api?timeout=32s\": dial tcp [::1]:8080: connect: connection refused"
E0401 13:04:24.196438      22 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp [::1]:8080: connect: connection refused"
E0401 13:04:24.197924      22 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp [::1]:8080: connect: connection refused"
E0401 13:04:24.198283      22 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp [::1]:8080: connect: connection refused"
E0401 13:04:24.199746      22 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"http://localhost:8080/api?timeout=32s\": dial tcp [::1]:8080: connect: connection refused"
The connection to the server localhost:8080 was refused - did you specify the right host or port?
 
--- script content ---
       1: kubectl get pods
---

This is basically the same error you get when the ServiceAccountToken is not accessible in the container...
Which is weird, because it is accessible in the container!

Just a kubectl exec -ti <dagu worker pod> -- mount | sort and shows

/dev/nvme0n1p2 on /dev/termination-log type ext4 (rw,relatime)
/dev/nvme0n1p2 on /etc/dagu type ext4 (ro,relatime)
/dev/nvme0n1p2 on /etc/hosts type ext4 (rw,relatime)
cgroup2 on /sys/fs/cgroup type cgroup2 (ro,nosuid,nodev,noexec,relatime)
csi-cephfs-node.1@fd902db0-38df-42f9-9f8a-13984e978c48.cephfs=/volumes/csi/csi-vol-a43d6495-3cf8-4161-aee1-1362c7e93672/b5c8b265-3c88-4e7a-8465-c834beb2fef5 on /data type ceph (rw,relatime,name=csi-cephfs-node.1,secret=<hidden>,acl,mon_addr=10.105.204.197:6789/10.102.124.85:6789/10.110.254.72:6789)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
overlay on / type overlay (rw,relatime,lowerdir=/var/lib/containers/storage/overlay/l/LMWL4SOWOPOTERLN4RKDFL3ZEO:/var/lib/containers/storage/overlay/l/6XARVL5MUEFZYVSFARZYVZ5O7D:/var/lib/containers/storage/overlay/l/G2W4NYK5LDNBS4OWMLOKRQNFZZ:/var/lib/containers/storage/overlay/l/IZJDEP7IJRALGNXQ4SA2ETGX4F,upperdir=/var/lib/containers/storage/overlay/cf88184480e428dcccf4f7bec6bf1934af43642207de3dd2955c8e0c52c06eda/diff,workdir=/var/lib/containers/storage/overlay/cf88184480e428dcccf4f7bec6bf1934af43642207de3dd2955c8e0c52c06eda/work,uuid=on,volatile,nouserxattr)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/fs type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/irq type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime)
shm on /dev/shm type tmpfs (rw,relatime,size=65536k,inode64)
sysfs on /sys type sysfs (ro,nosuid,nodev,noexec,relatime)
tmpfs on /dev type tmpfs (rw,nosuid,size=65536k,mode=755,inode64)
tmpfs on /etc/hostname type tmpfs (rw,relatime,size=3275676k,mode=755,inode64)
tmpfs on /etc/resolv.conf type tmpfs (rw,nosuid,nodev,noexec,relatime,size=3275676k,mode=755,inode64)
tmpfs on /proc/acpi type tmpfs (ro,relatime,size=3275676k,mode=755,inode64)
tmpfs on /proc/asound type tmpfs (ro,relatime,size=3275676k,mode=755,inode64)
tmpfs on /proc/scsi type tmpfs (ro,relatime,size=3275676k,mode=755,inode64)
tmpfs on /run/.containerenv type tmpfs (rw,relatime,size=3275676k,mode=755,inode64)
tmpfs on /run/secrets/kubernetes.io/serviceaccount type tmpfs (ro,relatime,size=262144k,inode64,noswap)
tmpfs on /serviceaccount type tmpfs (ro,relatime,size=262144k,inode64,noswap)
tmpfs on /sys/devices/system/cpu/cpu0/thermal_throttle type tmpfs (ro,relatime,size=3275676k,mode=755,inode64)
tmpfs on /sys/devices/system/cpu/cpu1/thermal_throttle type tmpfs (ro,relatime,size=3275676k,mode=755,inode64)
tmpfs on /sys/devices/system/cpu/cpu2/thermal_throttle type tmpfs (ro,relatime,size=3275676k,mode=755,inode64)
tmpfs on /sys/devices/system/cpu/cpu3/thermal_throttle type tmpfs (ro,relatime,size=3275676k,mode=755,inode64)
tmpfs on /sys/devices/system/cpu/cpu4/thermal_throttle type tmpfs (ro,relatime,size=3275676k,mode=755,inode64)
tmpfs on /sys/devices/system/cpu/cpu5/thermal_throttle type tmpfs (ro,relatime,size=3275676k,mode=755,inode64)
tmpfs on /sys/devices/system/cpu/cpu6/thermal_throttle type tmpfs (ro,relatime,size=3275676k,mode=755,inode64)
tmpfs on /sys/devices/system/cpu/cpu7/thermal_throttle type tmpfs (ro,relatime,size=3275676k,mode=755,inode64)
tmpfs on /sys/devices/virtual/powercap type tmpfs (ro,relatime,size=3275676k,mode=755,inode64)
tmpfs on /sys/firmware type tmpfs (ro,relatime,size=3275676k,mode=755,inode64)
udev on /proc/interrupts type devtmpfs (ro,relatime,size=16337712k,nr_inodes=4084428,mode=755,inode64)
udev on /proc/kcore type devtmpfs (ro,relatime,size=16337712k,nr_inodes=4084428,mode=755,inode64)
udev on /proc/keys type devtmpfs (ro,relatime,size=16337712k,nr_inodes=4084428,mode=755,inode64)
udev on /proc/latency_stats type devtmpfs (ro,relatime,size=16337712k,nr_inodes=4084428,mode=755,inode64)
udev on /proc/timer_list type devtmpfs (ro,relatime,size=16337712k,nr_inodes=4084428,mode=755,inode64)

and kubectl exec -ti <dagu worker pod> -- ls -l /serviceaccount

total 0
lrwxrwxrwx    1 root     root            12 Apr  1 13:03 token -> ..data/token

... clearly shows that the token required by kubectl in the pod itself is available.

On top of that, when I run kubectl get pods from with the pod it works as expected.

Summary

  • kubectl get pods fails when running as part of a step in Dagu >=2.3.8 (possibly > 2.3.1)
  • kubectl get pods runs ok in pod itself
  • Had a look at the code changes, but couldn't see anything obvious that would impact the step behaviour
  • Only thing I can think of is some kind of jail / chroot / hardening introduced which basically makes /serviceaccount mount invisible

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions