Skip to content

tracking: extend write_only_hash drift detection to remaining 9 kinds #11

Description

@vanducng

Context

writeOnlyHash drift-detection (gcplane reconciler.WriteOnlyHashField + goclaw migrations 000059 cron + 000060 providers, goclaw PR #3 #5 #6 #7 + gcplane PR #10) is shipped for CronJob + Provider as of goclaw v3.12.1 + gcplane v1.5.0.

The remaining 9 manifest kinds with write-only fields are still on the legacy blind-noop path. gcplane v1.5.0's kindsSupportingWriteOnlyHash allowlist gate prevents an infinite update loop on those kinds (good), but drift in their write-only fields is never detected (bad) — same failure mode the zai-coding 401 incident exposed for Provider.apiKey. Rotating any of these write-only fields silently does nothing until someone bumps an observable field.

Affected kinds (from internal/manifest/field_config.go::writeOnlyFields)

Priority Kind Write-only fields Transport Pattern template
P1 Channel agentKey, credentials, config HTTP Provider (goclaw PR #5)
P1 MCPCredentials credentials HTTP Provider (goclaw PR #5)
P1 SecureCLI env (encrypted env vars) HTTP Provider (goclaw PR #5)
P2 Agent contextFiles, systemPrompt, 12+ JSONB configs HTTP Provider (goclaw PR #5)
P2 MCPServer grants HTTP Provider (goclaw PR #5)
P3 AgentTeam lead, members, displayName WS-RPC CronJob (goclaw PR #3)
P3 Skill sourceDir, version HTTP Provider (goclaw PR #5)
P3 SecureCLIGrant agentKey, binaryName (manifest refs) HTTP Provider (goclaw PR #5)
P3 AgentLink sourceAgent, targetAgent, settings WS-RPC CronJob (goclaw PR #3)

AgentTeam.displayName being write-only is suspicious — investigate whether it's actually observable but mis-classified. ‡ Skill.sourceDir/version are local-only / server-managed; hash mostly catches "manifest changed but goclaw lost it" cases.

P1 = credential-rotation impacted (same failure mode as zai-coding 401). Channel rotations are most frequent in practice (telegram/slack/zalo bot tokens).

Per-kind goclaw checklist (~11 mechanical edits each)

Follow goclaw#5 for HTTP-transport kinds, goclaw#3 for WS-RPC kinds:

  • migrations/0000XX_KIND_write_only_hash.up.sqlALTER TABLE <table> ADD COLUMN write_only_hash TEXT NOT NULL DEFAULT '';
  • migrations/0000XX_KIND_write_only_hash.down.sql — DROP COLUMN
  • internal/upgrade/version.go — bump RequiredSchemaVersion by 1 (the version_test.go guard fails CI if you forget)
  • internal/store/KIND_store.go — add WriteOnlyHash string with json:\"write_only_hash,omitempty\" db:\"write_only_hash\" (HTTP) or json:\"writeOnlyHash\" (WS-RPC)
  • internal/http/validate.go — add \"write_only_hash\": true to the kind's allowlist (HTTP only)
  • HTTP handler — Update handler binds field from updates map (usually flows through automatically once allowlist + struct align; verify Create path's JSON decode picks it up)
  • internal/gateway/methods/KIND.go — Add WriteOnlyHash to inline params struct + apply via patch (WS-RPC only — see cron.go:74-114)
  • internal/store/pg/KIND.go — INSERT column + placeholder + UPSERT EXCLUDED.write_only_hash; add to all SELECT lists
  • internal/store/sqlitestore/KIND.go — same as pg; bump SchemaVersion const + add ALTER to incremental migrations map in schema.go
  • internal/store/sqlitestore/sqlx_scan_structs.go — add field to scan struct + mapper (if HTTP kind uses sqlx scan struct)
  • internal/store/sqlitestore/schema.sql — add column to fresh-DB CREATE TABLE
  • internal/store/sqlitestore/schema_migration_test.go — DROP COLUMN fixup for targetVersion < NEW so v11→current migration path still works
  • internal/store/sqlitestore/KIND_write_only_hash_test.go — round-trip test (mirror providers_write_only_hash_test.go or cron_crud_test.go::TestSQLiteCronStore_WriteOnlyHashRoundtrip)

Then in this repo (gcplane), one-line per kind:

  • internal/manifest/field_config.go::kindsSupportingWriteOnlyHash — add the kind

Per-kind gotchas to verify before changes

  • Channel — has 3 write-only fields; ensure the hash captures all 3 in HashWriteOnlyFields ordering (it sorts by field name, so should be fine, but verify the test confirms agentKey + credentials + config round-trip)
  • MCPCredentials — single-field, simplest case
  • SecureCLIenv is a JSONB map of encrypted vars; ensure hash is stable for map ordering (HashWriteOnlyFields uses json.Marshal of a slice-of-pairs which preserves order — but map iteration in spec build must be sorted upstream)
  • Agent — 12+ write-only fields; hash captures a lot. Watch for false-drift if reconciler injects defaults that the API doesn't echo
  • AgentTeam — investigate displayName mis-classification first; the actual write-only set may be just lead + members
  • MCPServergrants is JSONB; same map-ordering concern as SecureCLI.env
  • AgentLink/SecureCLIGrant — write-only fields are manifest references (slugs) resolved to UUIDs at create time; the hash captures the manifest's intent rather than DB state. Confirm this is the right semantic before adding.
  • SkillsourceDir is local-fs path on the gcplane runner, never sent to goclaw. Verify the hash excludes it via ParseIgnoreFields annotation pattern or by removing from spec before hashing.

Suggested rollout order

  1. MCPCredentials — simplest (single field), lowest blast radius, validates the cron→provider→Nth-kind generalization
  2. SecureCLI — single field (env), high value (encrypted CLI secrets)
  3. Channel — biggest win operationally, but 3 fields means most edge cases
  4. MCPServer then Agent — broader scope
  5. AgentTeam + AgentLink + SecureCLIGrant — investigate semantics first
  6. Skill — lowest priority

Verification (per-kind, post-deploy)

# After deploying goclaw with the new migration + gcplane with the allowlist add:
kubectl logs deploy/gcplane | grep \"write-only hash mismatch\" | grep kind=KIND_NAME
# Expect: one mismatch+update per existing row, then steady-state silence.

kubectl logs deploy/gcplane | grep \"no-op may hide drift\" | grep kind=KIND_NAME
# Expect: empty (the WARN that exists today should disappear).

Open questions

  • Should we generate the boilerplate (codegen for the 5-layer touch)? After 11 kinds × 11 edits, the manual pattern is tedious but each kind has subtle differences (different SQL helper, different scan struct shape). Codegen overhead probably not worth it for ≤11 kinds.
  • Should kindsSupportingWriteOnlyHash be a per-kind annotation on the Resource definition instead of a global map? Reduces drift between gcplane and goclaw releases (currently a manual sync).
  • Worth adding a CI check that fails if a goclaw migration adds write_only_hash to a new table but gcplane's allowlist isn't updated in the same release?

References

  • Original incident report: infra/plans/reports/debugger-260521-1113-zai-coding-401-provider-resync.md
  • Provider rollout plan: gcplane/plans/260521-1127-provider-writeonly-hash-resync/
  • Memory note: infra/memory/gcplane-provider-resync.md
  • Shipped artifacts: goclaw #5 #6 #7, gcplane #10, tags goclaw v3.12.1 + gcplane v1.5.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions