Skip to content

Secret/env drift is skipped when manifest content is unchanged #22

@vanducng

Description

@vanducng

Problem

A rotated provider secret in Kubernetes did not reach GoClaw even though gcplane manages provider apiKey with writeOnlyHash.

Incident evidence from SHTP on 2026-05-29:

  • Trace 019e726a-b40d-7189-8bb9-5639a9c76963 failed at first LLM call: HTTP 401: zai-coding: token expired or incorrect.
  • GoClaw provider verify reproduced the failure for both glm-5.1 and glm-5-turbo.
  • The Kubernetes secret GOCLAW_ZAI_CODING_API_KEY was present and non-empty.
  • Live GoClaw provider zai-coding had write_only_hash=58d2....
  • Desired hash from the current Kubernetes secret was 3c87....
  • gcplane plan -f shtp --force with the current env correctly showed Provider/zai-coding writeOnlyHash drift.
  • The running gcplane service logs repeatedly showed manifest unchanged, skipping for tenant shtp, so it never reconciled this secret-only drift.

Manual workaround used: update only the zai-coding provider via GoClaw API with the current secret and desired write_only_hash. Provider verify then passed for both models.

Root Cause

The controller skip is based on manifest source hash only. Secret/env values referenced by ${ENV_VAR} are not part of that hash. If a Kubernetes Secret changes while the mounted git/config content does not, gcplane treats the tenant as unchanged and skips reconciliation.

There is a second Kubernetes-specific caveat: the deployment injects secrets through envFrom, so running pods keep old env values until restarted. Even if the controller did not skip, a running pod may still hold stale secret values after K8s Secret rotation.

Expected Behavior

Provider secret rotations should converge without requiring unrelated manifest edits or manual provider updates.

Suggested Design

Options, from conservative to stronger:

  1. Include a hash of resolved write-only fields in the tenant/source hash for skip decisions. For provider apiKey, the resolved env value should affect the hash without logging/exposing the value.
  2. Re-run reconciliation periodically for resources with write-only fields even when the manifest file hash is unchanged. This can be bounded, e.g. every N intervals or when verifyProviders fails.
  3. Do not skip verifyProviders on unchanged manifests. If provider verification fails and desired write-only hash differs, force a provider update.
  4. For K8s deployments, document or automate rollout restart on gcplane-secrets updates. Better: mount secrets as files and resolve file:// on each reconcile, because mounted Secret volumes update without restarting the pod.

Safety Requirements

  • Never log resolved secret values.
  • Continue to log only hash prefixes/full write-only hashes.
  • Avoid broad forced updates of all providers when only one provider key drifted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions