Skip to content

DesiredHash mismatch when secret refs appear outside env_vars (e.g. Droplet user_data) #558

@intel352

Description

@intel352

Surfaced by

Core-dump TC2 staging cutover (jon@langevin.me, 2026-05-05). After workflow#541 fix landed in v0.21.2 + the corresponding deploy.yml stopgap-removal (drop `STAGING_PG_PASSWORD` from Plan env block), `wfctl infra apply --plan plan.json` fails with:

```
error: plan stale: config hash mismatch (run wfctl infra plan again)
```

Root cause

`infraPreserveKeys` in `cmd/wfctl/infra.go` only preserves `env_vars`, `env_vars_secret`, `secret_env_vars` submap keys through plan-time serialization. Other config fields that legitimately contain `${VAR}` references — e.g. Droplet `user_data` (cloud-init script that needs the random_hex secret to provision Postgres) — are still substituted by `os.ExpandEnv` at plan-time.

When Plan runs WITHOUT the secret in env (post-#541, no need for the stopgap) and Apply runs WITH the secret in env (W-5 JIT for the value during apply-time substitution):

  • Plan-time: `user_data` contains `POSTGRES_PASSWORD: '${STAGING_PG_PASSWORD}'` → `os.ExpandEnv` substitutes to `''` (empty).
  • Apply-time: `user_data` contains the same template → `os.ExpandEnv` substitutes to the actual hex value.

`desiredStateHash(specs)` JSON-serializes the resolved specs and SHA-256s them. The two substitutions diverge → DesiredHash mismatch → `plan stale`.

Reproduction

Core-dump's infra.yaml has a Droplet whose `config.user_data` cloud-init script references `${STAGING_PG_PASSWORD}`:

```yaml

  • name: coredump-staging-pg
    type: infra.droplet
    config:
    user_data: |
    #cloud-config
    write_files:
    - path: /opt/coredump-pg/docker-compose.yml
    content: |
    services:
    postgres:
    environment:
    POSTGRES_PASSWORD: '${STAGING_PG_PASSWORD}'
    ```

Run with `STAGING_PG_PASSWORD` unset at `wfctl infra plan` (R-A4 in v0.21.2 sees the top-level `secrets.generate` declaration and skips the env-var-resolution check) and set at `wfctl infra apply`. The hash check fails.

Real example

core-dump deploy.yml run 25380846940 on commit 723a55a8 — Plan step succeeded with 2 updates planned (firewall + container_service), Apply step failed with the message above. v0.21.2 + DO plugin v0.10.1 in use.

Fix options

Option A (narrow): Add `user_data` to `infraPreserveKeys`. Pros: surgical. Cons: doesn't generalize — every new field that legitimately holds `${VAR}` needs to be added.

Option B (broad, correct): Use `TolerantEnvProvider`-style preservation sentinel for ALL string-valued config fields when the var is in `cfg.Secrets.Generate`. Plan emits the literal; Apply substitutes; hash uses the sentinel. Both Plan and Apply produce hash-identical specs. This is what `infraPreserveKeys` should have been: every config string referencing a declared secret is preserved through plan, substituted at apply-time-driver-dispatch.

Option C (workaround): Re-add `STAGING_PG_PASSWORD` (and any other top-level-secret env vars) to the Plan + Validate steps in deploy.yml. This is the pre-#541 stopgap; works but undoes the W-541 cleanup gain.

Recommended

Option B with a follow-up to deprecate `infraPreserveKeys` once `cfg.Secrets.Generate`-aware preservation is the default. Until then, downstream consumers (core-dump, BMW) should re-add the env stopgap on Plan + Validate as documented in the deploy.yml comment block.

References

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions