Add Temporal workflow engine to automation chart#516
Conversation
This PR adds Temporal as a dependency for the automation service to provide durable workflow execution for automation runs. ## Changes ### charts/automation/Chart.yaml - Added Temporal chart dependency (version >=1.0.0-0) - Bumped chart version to 0.2.0 ### charts/automation/values.yaml - Added temporal configuration section with: - external: Configuration for connecting to external Temporal cluster - server: Temporal subchart configuration (PostgreSQL persistence) - admintools: Admin tools pod configuration - web: Temporal Web UI configuration - Added worker deployment configuration ### charts/automation/templates/_env.yaml - Added Temporal env vars (AUTOMATION_TEMPORAL_HOST, PORT, NAMESPACE, TASK_QUEUE) - Dynamically sets host based on whether Temporal is deployed as subchart ### charts/automation/templates/temporal-init.yaml (new) - Pre-install hook Job that creates Temporal databases in PostgreSQL - Creates temporal and temporal_visibility databases - Creates temporal_user with appropriate permissions - Enables reusing parent chart's PostgreSQL instance ### charts/automation/templates/worker-deployment.yaml (new) - Separate deployment for Temporal workers - Waits for Temporal to be ready before starting - Runs automation.temporal.worker module ### charts/openhands/values.yaml - Added temporal configuration in automation subchart values - Pre-configured to use parent's PostgreSQL (oh-main-postgresql:5432) - Added worker configuration ## DB Reuse Pattern When automation is deployed as a subchart of openhands: 1. temporal.enabled=true deploys Temporal pods 2. temporal-init job creates databases in parent's PostgreSQL 3. Temporal server connects to parent's PostgreSQL via connectAddr 4. automation worker connects to Temporal via internal service name This follows the same pattern as other subcharts (e.g., keycloak) that reuse the parent's PostgreSQL instance. --- *This PR was created by an AI assistant (OpenHands) on behalf of the user.* Co-authored-by: openhands <openhands@all-hands.dev>
- Bump openhands chart version from 0.3.11 to 0.3.12 - Update automation dependency version from 0.1.0 to 0.2.0 This is required since values.yaml was modified to add Temporal configuration. Co-authored-by: openhands <openhands@all-hands.dev>
The temporal-db-init job was configured as a pre-install/pre-upgrade Helm hook, which caused a deadlock during deployment: 1. Helm hooks run before main release resources are created 2. The job waits for PostgreSQL to be available 3. PostgreSQL is deployed as a subchart in the main release 4. Deadlock: hook waits for PostgreSQL, but PostgreSQL can't deploy until hooks complete Fix: Remove the hook annotations so the job deploys alongside the main release. The job already has a built-in wait loop (60 iterations × 5 seconds = 5 minutes) that will wait for PostgreSQL to become available. Co-authored-by: openhands <openhands@all-hands.dev>
…config - Add connectProtocol: 'tcp' to both default and visibility datastores (required by Temporal chart 1.0.0-rc.3 for SQL connections) - Set createDatabase: false since databases are created by the temporal-db-init job, avoiding permission issues with temporal_user Co-authored-by: openhands <openhands@all-hands.dev>
The Temporal subchart creates services named {release-name}-frontend,
not {release-name}-temporal-frontend. Updated the templates to use
the correct service name pattern.
Co-authored-by: openhands <openhands@all-hands.dev>
Two changes to fix conflicts between ddtrace and Temporal's workflow sandbox: 1. Worker deployment: Remove ddtrace-run wrapper since it conflicts with Temporal's workflow sandbox import system (beartype circular import) 2. API deployment: Set AUTOMATION_SKIP_WORKER=true when temporal.enabled, so the API doesn't create an in-process worker (which would fail with ddtrace). The separate worker deployment handles workflow execution. Co-authored-by: openhands <openhands@all-hands.dev>
Add a Job that runs after Temporal is deployed to create the 'default' namespace (or whatever is configured in temporal.namespace). This ensures the automation worker and API can connect to Temporal immediately. The job: - Waits for Temporal frontend to be healthy - Checks if the namespace already exists - Creates the namespace with 72h retention if it doesn't exist Also added temporal.namespace to values.yaml for configurability. Co-authored-by: openhands <openhands@all-hands.dev>
| # When deployed as subchart of openhands: | ||
| # - Temporal reuses the parent's PostgreSQL instance | ||
| # - Set temporal.enabled=true with temporal.server.config.persistence pointing to parent DB | ||
| temporal: |
There was a problem hiding this comment.
We should reconsider provisioning of the temporal cluster inside of this chart.
If we ever want to use Temporal in other parts of our stack (runtime-api, anyone?) we would want to share an instance across them.
Where we need isolation between different application's workflows, namespaces are a perfect fit for that.
As a side note, I'm advocating that we stop this pattern of provisioning dependencies in all of our charts. We generally don't use runtime-api and automation in standalone configurations where they are deployed without first deploying openhands, so I'd like to stop maintaining these additional paths that we never actually exercise.
| external: | ||
| host: "temporal-frontend" | ||
| port: 7233 | ||
| namespace: "default" |
There was a problem hiding this comment.
Let's move these outside the temporal values.
This pattern of adding values that don't actually exist happens elsewhere in this repo, but it's something we should get away from. It's confusing because I have to look at the chart source to understand whether or not a value is actually used in the dependency or just by us, and when we started adding linters to this repo it may fail should the dependency not allow additional properties.
| limits: | ||
| cpu: 200m | ||
| memory: 256Mi |
There was a problem hiding this comment.
What limits does Temporal set by default? Do we know why we need to override them, are the default limits too much, too little?
| Note: This is intentionally NOT a Helm hook because PostgreSQL is deployed | ||
| as part of the same release. Using pre-install/pre-upgrade hooks would create | ||
| a deadlock where this job waits for PostgreSQL, but PostgreSQL can't be | ||
| deployed until the hooks complete. | ||
| */}} |
There was a problem hiding this comment.
Creating jobs without hooks, and leaning on ttlSecondsAfterFinished for cleanup, means that if these jobs fail we have to wait 300 seconds before we can run another helm deploy.
Most fields on a Job are also immutable, so we can't update the job after it's been deployed. We have to wait for it to be cleaned up. Aaaand we have a running timer once the job fails to capture the logs.
Recommend either using an init container or you can create a 1 replica deployment that ends with a container that just runs pause.
Summary
This PR adds Temporal as a dependency for the automation service to provide durable workflow execution for automation runs.
Changes
charts/automation/Chart.yaml
charts/automation/values.yaml
temporalconfiguration section with:external: Configuration for connecting to external Temporal clusterserver: Temporal subchart configuration (PostgreSQL persistence)admintools: Admin tools pod configurationweb: Temporal Web UI configurationworkerdeployment configurationcharts/automation/templates/_env.yaml
AUTOMATION_TEMPORAL_HOST,PORT,NAMESPACE,TASK_QUEUE)charts/automation/templates/temporal-init.yaml (new)
temporalandtemporal_visibilitydatabasestemporal_userwith appropriate permissionscharts/automation/templates/worker-deployment.yaml (new)
automation.temporal.workermodulecharts/openhands/values.yaml
oh-main-postgresql:5432)DB Reuse Pattern
When automation is deployed as a subchart of openhands:
temporal.enabled=truedeploys Temporal podstemporal-initjob creates databases in parent's PostgreSQLconnectAddrThis follows the same pattern as other subcharts (e.g., keycloak) that reuse the parent's PostgreSQL instance.
Deployment Example
To enable Temporal when deploying automation as a subchart:
This PR was created by an AI assistant (OpenHands) on behalf of the user.