Skip to content

feat(vigil): plugin-hook-liveness canary#558

Merged
cirwel merged 2 commits into
masterfrom
feat/vigil-plugin-hook-liveness
Jun 2, 2026
Merged

feat(vigil): plugin-hook-liveness canary#558
cirwel merged 2 commits into
masterfrom
feat/vigil-plugin-hook-liveness

Conversation

@cirwel
Copy link
Copy Markdown
Owner

@cirwel cirwel commented Jun 2, 2026

Why

On 2026-06-02 the unitares-governance-plugin hook chain was found dark for ~2.5 weeks. Every command in the plugin's hooks.json wrapped ${CLAUDE_PLUGIN_ROOT} in single quotes, which /bin/sh won't expand — so every hook resolved to a literal path and died "No such file". ~/.unitares/checkins.log had frozen at 2026-05-16 and nothing surfaced it. In a project whose thesis is observability of agent behavior, there was no canary asking "are my own hooks even firing?". (Plugin fix: cirwel/unitares-governance-plugin 0.4.5.)

This adds that canary.

Design

The load-bearing constraint: the canary is external to the hook chain. A heartbeat written by the hooks can't detect the hooks being un-dispatchable — the exact failure was that the wrapper never ran, so anything it was meant to touch never got touched. A self-reported signal is blind to its own death.

So PluginHookLiveness (running inside Vigil, a separate launchd process) compares two independent signals:

  • Ground truth~/.claude/history.jsonl (advances on every prompt; the plugin can't suppress it).
  • Hook artifact — newest mtime across checkins.log (send) + hook-skips.log (gated skip). Newest-wins means a healthy-but-gated chain still reads alive: the signal is "did a hook dispatch", not "did a check-in record".

It's a divergence test, not a staleness threshold: it fires only when there's recent CC activity and every hook artifact is stale — distinguishing "hooks dark" from "operator idle" so a quiet weekend doesn't page anyone.

Honest scope (no silent caps)

  • The artifact set spans plugin and user-level (~/.claude/hooks) governance hooks → it answers "is the governance hook layer alive", slightly broader than "is the plugin chain alive". That's the right default.
  • Soft spot: a single session held open >stale_hours with no dispatch could read dark. Default window (24h) makes this rare; severity is warning (advisory to verify), not critical.
  • Needs a consumer. Per the EISV write-only-signal lesson, an alert nobody reads is the same failure one layer up. This emits a standard Vigil finding (fingerprint_key=plugin_hook_chain_dark) — route it to the Discord governance bridge / Sentinel where it'll actually be seen. (Follow-up, not in this PR.)

Config

VIGIL_HOOK_ACTIVITY_HOURS (default 12), VIGIL_HOOK_STALE_HOURS (default 24), VIGIL_CC_HISTORY_PATH, VIGIL_HOOK_ARTIFACT_PATHS.

Tests

9 unit tests on the pure assess() (no clock/fs coupling), including the 2026-06-02 reproduction (CC active + checkins.log 17 days stale → plugin_hook_chain_dark) and the newest-artifact-wins gated-chain case. Full Vigil suite: 100 passed.

Verified live: against this machine the check reads "live" off real turn_stop/auto_edit rows the now-fixed plugin hooks wrote this session (UUID cc447979) — checkins.log went from frozen-May-16 to actively writing.

Note

Pre-existing unrelated failure in test_lease_plane_canonicalize.py::...on_macos (sandbox TMPDIR is /private/tmp not /private/var) — reproduces with this branch's changes stashed; not introduced here.

🤖 Generated with Claude Code

cirwel added 2 commits June 2, 2026 05:49
Expose provenance_context on process_agent_update for model-facing S22 metadata.

Recover S22 fields mangled into recent_tool_results without promoting operational args or creating bogus tool evidence.

Verified with focused pytest suite, compileall, ruff undefined-name checks, diff check, static scan, and independent review.
On 2026-06-02 the unitares-governance-plugin hook chain was found dark for
~2.5 weeks (single-quoted ${CLAUDE_PLUGIN_ROOT} in hooks.json suppressed
shell expansion; every hook died "No such file"). checkins.log had frozen
at 2026-05-16 and nothing surfaced it — no canary asked "are my own hooks
firing?".

This adds that canary as a Vigil check. The load-bearing constraint: it is
EXTERNAL to the hook chain (a heartbeat written by the hooks can't detect the
hooks being un-dispatchable — the exact failure was the wrapper never running).
It's a divergence test, not a staleness threshold: it compares Claude Code's
own activity log (~/.claude/history.jsonl, which the plugin can't suppress)
against the newest hook artifact (checkins.log / hook-skips.log), and fires
only when there is recent CC activity AND every hook artifact is stale — so
"hooks dark" is distinguished from "operator idle". Taking the newest artifact
across send+skip logs means a healthy-but-gated chain still reads alive
(signal = "did a hook dispatch", not "did a check-in record").

Severity warning (advisory to verify, not a critical page). Thresholds are
env-configurable (VIGIL_HOOK_ACTIVITY_HOURS / VIGIL_HOOK_STALE_HOURS).
Registered as a built-in; 9 unit tests on the pure assess() incl. the
2026-06-02 reproduction case.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

✅ Documentation Validation Passed

Tool Count: 7 tools tools
Version: 2.13.0

All documentation is synchronized with the codebase.

@cirwel
Copy link
Copy Markdown
Owner Author

cirwel commented Jun 2, 2026

Correction to the "needs a consumer" caveat above — verified, it's already wired.

I flagged in the description that routing plugin_hook_chain_dark to a consumer was a follow-up. That's wrong; I traced it and the canary is routed by construction, because it uses the standard Vigil finding contract:

PluginHookLiveness (ok=False, fingerprint_key=plugin_hook_chain_dark)
agent.py transition-emit (agents/vigil/agent.py:820, edge-triggered on healthy→dark via was_healthy, so one page per transition — no spam)
post_finding(event_type="vigil_finding", severity="warning", …)
POST /api/findings (src/http_api.py:1849) — passes validation: type ends in _finding, warning ∈ _FINDING_SEVERITIES, all required fields supplied
event_detector.record_eventbroadcaster_instance.broadcast_event
dashboard WS + Discord bridge WS (the existing event-visibility pipeline).

Same path resident_tag_hygiene (which also emits warning) already rides. So no additional wiring is needed — the 2.5-week-silent failure this canary targets would now surface as a Discord/dashboard event on the transition. No follow-up required.

@cirwel cirwel merged commit f2a8dd5 into master Jun 2, 2026
6 checks passed
@cirwel cirwel deleted the feat/vigil-plugin-hook-liveness branch June 2, 2026 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant