Skip to content

Manifest-changed apply silently no-ops when only unobservable fields differ (e.g. CronJob.message) #9

@vanducng

Description

@vanducng

TL;DR

When a manifest edit changes only fields that are absent from the resource's list-API response (e.g. CronJob.message, CronJob.agentKey, Agent.toolsConfig, Provider.apiKey), gcplane logs a resource no-op may hide drift in unobservable fields warning and skips the apply — even though the manifest hash changed and the controller correctly entered the reconcile loop.

Net effect for GitOps users: git push of a CronJob message rewrite returns success on every layer (CI, git-sync, gcplane controller) but the new message never reaches GoClaw's DB. The cron continues firing with the previous prompt until something observable also drifts.

Reproduction

Real incident, May 21 2026, prod everest cluster, tenant annhien:

  1. Pushed goclaw-config@2666f4a — rewrote the bao-cao-doanh-so-17h cron's message field (~22 KB → ~1.1 KB). No other field touched.
  2. git-sync picked up the commit within 30s. gcplane controller fetched the new manifest, computed a new hash, entered reconcile.
  3. Reconciler observed all 3 CronJob resources, found observable surface (schedule, enabled, deliver*) matched DB → marked no-op + logged the warning:
    ```
    level=WARN msg="resource no-op may hide drift in unobservable fields" tenant=annhien kind=CronJob name=bao-cao-doanh-so-17h unobservable_fields="[agentKey message]" hint="this field is not returned by the GoClaw list API; the reconciler cannot detect drift in it. Force re-apply by toggling another observable field (e.g. enabled) or by deleting and re-creating the resource."
    ```
  4. `reconcile complete tenant=annhien creates=0 updates=3 noops=33 applied=3 failed=0` — the 3 "applied" were unrelated MCPServer drift; the CronJobs landed in the 33 noops.
  5. Subsequent reconciles correctly observed `hash unchanged, skipping`. The new prompt was effectively un-deployable through GitOps.
  6. Manual recovery: forced observable drift by normalising `tz: Asia/Ho_Chi_Minh` → `tz: Asia/Saigon` (both IANA aliases of ICT). On next reconcile the resources re-applied and the message body flowed through with them.

Why this is severe

  • GitOps trust assumption violated. Users (correctly) expect that any manifest change reachable through normal git-sync deploys. A successful push + successful reconcile cycle that silently retains stale state is invisible without log-reading.
  • The warning is unhelpful for the realistic case. The hint says "toggle another observable field" — but for CronJobs the only safe observable toggle requires editing a field that has no semantic meaning (we used a tz-alias change). For other resource kinds the only suggestion ("delete and re-create") implies downtime.
  • Compounds with the pod-restart no-op bug. A pod restart (e.g. healthcheck failure → rollout) does NOT recover this state — the fresh pod also no-ops because the observable surface still matches DB. We verified this: after `kubectl rollout restart deployment/gcplane` the new pod's first reconcile still showed `CronJob ... no-op may hide drift`.

Source-level analysis

  • `internal/controller/controller.go:174-184` — `if hash == c.lastHash` skip is correct (controller-level dedup).
  • `internal/reconciler/engine.go:380-403` — emits the warning then returns without scheduling an Update. The decision to no-op is made purely on observable diff, with no consideration that the manifest itself just changed.

Proposed fix (one of)

Option A — smart default (preferred): When a resource's observable surface diff is empty but it has non-empty unobservable fields in the manifest, AND the controller-level hash transitioned this cycle, treat it as drift and emit an Update. The Update already sends the full spec including unobservable fields, so the fix is small and contained to `stepCompare` in `engine.go`.

Sketch:
```go
// in engine.go around the unobservable-field warning
if len(present) > 0 && opts.ManifestChangedThisCycle {
// Manifest hash transitioned this cycle and we have unobservable fields
// that could have changed. Treat as drift to avoid silent no-op.
rc.action = ActionUpdate
rc.reason = "manifest hash transitioned with unobservable fields present"
return
}
```

The controller would need to pass a `ManifestChangedThisCycle bool` into `ReconcileOpts` (trivial — it already knows from the `hash != c.lastHash` branch).

Option B — explicit opt-in: Add a `gcplane.io/always-reapply: "true"` annotation users can stamp on resources known to have significant unobservable fields. Backwards-compatible but pushes the burden to users.

Option C — escalate the warning severity: Promote the warning to an ERROR + non-zero exit on `gcplane apply` when the user did NOT pass `--force`. Cheap to ship, makes the silent failure loud. Doesn't fix `serve` mode though.

Option D — push upstream fix to goclaw. Have the cron-list WS API return `message` and `agentKey` so they become observable. Cleanest long-term but blocked on goclaw repo (issues disabled, separate PR cycle).

I'd favour shipping A + D in parallel: A is a 20-line patch that prevents the entire class of silent-no-op bugs across all resource kinds; D closes the gap permanently for CronJob specifically.

Evidence bundle

Pod, reconcile timestamps, hash transitions, recovery commit — all available in the goclaw-config repo:

Related unobservable-field surfaces noted in the same logs

Worth auditing whether each of these has the same trap:

  • `Provider.apiKey` (anthropic, openai, gemini, openrouter, dashscope, zai-coding) — rotating an API key in YAML would silently no-op.
  • `Agent.contextFiles`, `Agent.toolsConfig` (van-anh, marketing-agent, sales-analyst, support, assistant) — editing tools config silently no-ops.
  • `Channel.agentKey`, `Channel.config`, `Channel.credentials` — rebinding a channel to a different agent silently no-ops.
  • `CronJob.agentKey`, `CronJob.message` — confirmed above.

Each of these is a latent footgun for normal GitOps workflows. Fix A above addresses the whole class.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions