VXD includes three monitoring systems that keep the pipeline running without human intervention: the Watchdog, the Supervisor, and the TUI Dashboard. This guide explains how each works and when you need to step in.
The Watchdog monitors individual agent sessions in real time. It runs continuously while agents are executing.
Every poll_interval_ms (default: 10 seconds), the Watchdog:
- Captures the last N lines of output from each tmux session
- Computes a SHA-256 fingerprint of the output
- Compares the fingerprint to the previous check
| Condition | Detection | Action | Event |
|---|---|---|---|
| Permission prompt | permission_pattern regex matches |
Sends "Y" to session | (none) |
| Plan mode | plan_mode_pattern regex matches |
Sends Escape key | (none) |
| Agent stuck | Fingerprint unchanged for stuck_threshold_s |
Emits stuck event | AGENT_STUCK |
| Agent done | idle_pattern regex matches |
Marks story complete | STORY_COMPLETED |
monitor:
poll_interval_ms: 10000 # Check frequency
stuck_threshold_s: 600 # Seconds of unchanged output before AGENT_STUCK fires (informational only — does not kill the agent)
context_freshness_tokens: 150000 # Token limit warningEach runtime defines regex patterns for status detection:
runtimes:
claude-code:
detection:
idle_pattern: "^\\$\\s*$" # Shell prompt = done
permission_pattern: "\\[Y/n\\]" # Permission request
plan_mode_pattern: "Plan mode" # Claude entered plan modeThese patterns are compiled at startup. If an agent enters an unexpected state, adjust the patterns to match your runtime's output format.
| Model Speed | Recommended stuck_threshold_s |
|---|---|
| Fast (Haiku, GPT-4o-mini) | 60-90s |
| Medium (Sonnet) | 120-180s |
| Slow (Opus, complex stories) | 180-300s |
The Supervisor provides periodic high-level oversight across all stories in a requirement.
The Supervisor is an LLM-powered agent (Sonnet by default) that reviews:
- The original requirement
- Current status of all stories
- Progress so far
It produces a structured assessment:
{
"on_track": true/false,
"concerns": ["list of concerns"],
"reprioritize": ["story IDs to reprioritize"]
}
| Outcome | Event |
|---|---|
| Everything on track | SUPERVISOR_CHECK |
| Drift detected | SUPERVISOR_DRIFT_DETECTED |
| Reprioritization needed | SUPERVISOR_REPRIORITIZE |
If the Supervisor determines stories are drifting from the original requirement, VXD:
- Emits
SUPERVISOR_DRIFT_DETECTEDwith details - Logs concerns for visibility
- May reprioritize remaining stories
You can view supervisor findings via:
vxd events --type SUPERVISOR_DRIFT_DETECTEDWhen an agent repeatedly fails (stuck, review rejected, QA failures), VXD escalates through a 5-tier chain.
| Trigger | Threshold | Action |
|---|---|---|
| Execution failures | max_retries_before_escalation (default: 2) |
Escalate to next tier |
| QA failures | max_qa_failures_before_escalation (default: 3) |
Escalate to next tier |
| Agent stuck | After stuck detection + retry | Escalate to next tier |
Tier 0: Same-role retry with smart error analysis
- Classifies errors into 8 categories (missing_symbol, syntax, type_error,
import, test_failure, build_config, environment, timeout)
- Provides targeted fix suggestions to the retry agent
- Actual build/test/lint output is passed, not just "QA failed"
Tier 1: Senior developer (more capable model)
Tier 2: Manager diagnosis (Sonnet-class LLM)
- Analyzes full failure history across all attempts
- May rewrite the story description (STORY_REWRITTEN event)
Tier 3: Tech Lead re-planning
- Decomposes failing story into smaller sub-stories (STORY_SPLIT event)
- Updates the dependency DAG
Tier 4: Pause (human intervention required)
# List all escalations
vxd escalations
# Example output:
# ESC-001 STORY-003 junior-01 "Agent stuck after 2 retries" unresolved
# ESC-002 STORY-005 inter-02 "QA failed 3 times" resolvedEscalations also appear in the Dashboard's Escalation panel.
VXD is designed to recover gracefully when the orchestrator process dies mid-run.
On vxd resume, an advisory lock file is acquired at ~/.vxd/projects/<name>/state/run.lock. This prevents concurrent VXD instances from corrupting state.
- The lock file contains the PID of the owning process
- Stale locks (PID no longer alive) are automatically cleaned up
- Use
--forceto override a stuck lock
The monitor writes checkpoints at phase transitions:
Phase transitions: dispatching → monitoring → merging → completed
Each checkpoint records:
- Requirement ID, wave number, current phase
- Active agents (story ID, session name, worktree path, branch)
- Merging story (if mid-merge)
- PID and timestamp
Checkpoints are written atomically (temp file + rename) to prevent corruption.
On resume, VXD inspects all stories and detects 5 recovery scenarios:
| Scenario | Detection | Recovery Action |
|---|---|---|
| Lost story | in_progress, no tmux, no worktree | Reset to draft |
| Orphan agent | Dead tmux session, worktree exists | Reset to draft |
| Mid-merge crash | PR created but not merged | Resume merge |
| Pre-PR crash | Review passed, no PR | Create PR and merge |
| Stuck in review | Review passed, QA never ran | Reset to review_passed |
A RECOVERY_COMPLETED event is emitted with details of all corrective actions.
VXD parses agent output logs to extract structured trace events for monitoring and metrics.
| Kind | Detection | Example |
|---|---|---|
tool_call |
Tool invocation patterns (Read, Write, Edit, Bash, etc.) | Read /path/to/file |
file_edit |
Edited/Updated/Modified + filename | Edited main.go |
file_create |
Created/Wrote + filename | Created handler.go |
command |
Shell prompt + command | $ go test ./... |
error |
Error/FAIL/panic/fatal patterns | FAIL: TestHandler |
test |
PASS/FAIL/ok test patterns | ok pkg 0.5s |
commit |
Git commit patterns | [main abc123] |
progress |
General activity indicators | Status messages |
The vxd metrics command aggregates trace data across all stories:
- Total tool calls — how many tool invocations agents made
- Total file edits / creates — volume of code changes
- Total commands — shell commands executed
- Total errors — errors encountered during execution
- Total tests — test runs detected
VXD ships two dashboard modes: a terminal UI (TUI) and a browser-based web dashboard. Both show the same five sections and refresh every 2 seconds.
vxd dashboardThe TUI is a single-pane layout — all five sections are visible simultaneously without switching tabs.
┌─ Agents ────────────────────────────────────────────────────────┐
│ junior-01 STORY-001 working senior-01 (idle) idle │
│ junior-02 STORY-002 working │
└─────────────────────────────────────────────────────────────────┘
Pipeline: REQ-01HXYZ ████████████░░░░░░░░ 8/13 stories in_progress
┌─ Stories ───────────────────────────────────────────────────────┐
│ STORY-001 [2] Add /healthz route in_progress │
│ STORY-002 [3] Implement uptime tracking in_progress │
│ STORY-003 [2] Add integration tests blocked │
│ ... │
└─────────────────────────────────────────────────────────────────┘
┌─ Activity ──────────────────────────────────────────────────────┐
│ 14:05 STORY_REVIEW_PASSED STORY-001 │
│ 14:04 STORY_COMPLETED STORY-002 │
│ 14:02 AGENT_SPAWNED junior-01 │
└─────────────────────────────────────────────────────────────────┘
┌─ Escalations ───────────────────────────────────────────────────┐
│ (none) │
└─────────────────────────────────────────────────────────────────┘
| Key | Action |
|---|---|
j / k |
Scroll the stories table down / up |
w |
Open the web dashboard in your browser |
q / Ctrl+C |
Quit |
vxd dashboard --web
vxd dashboard --web --port 9090 # custom port (default: 8787)Opens a browser-based dashboard at http://localhost:8787. Binds to localhost only — no external access.
The web dashboard mirrors the TUI: agents, a pipeline summary bar with progress, a stories table, an activity log, and an escalations panel (collapsible).
From the web dashboard you can issue commands directly to the running pipeline:
| Command | Description |
|---|---|
| Pause | Pause requirement intake — no new stories are dispatched |
| Resume | Resume a paused requirement |
| Retry | Retry a failed or stuck story with the same agent |
| Reassign | Reassign a story to a different agent tier |
| Escalate | Manually escalate a story to the next agent tier |
| Kill agent | Terminate an agent's tmux session |
| Edit story | Update a story's description before re-dispatch |
Destructive actions (kill, reassign, edit) show a confirmation dialog before executing.
Command results appear as toast notifications.
The web dashboard connects to VXD over WebSocket on the same port. Three message types are used:
- State broadcast — full dashboard snapshot sent every 2 seconds to all connected clients
- Event push — individual events pushed immediately when they are emitted (e.g.,
STORY_COMPLETED) - Command / result — client sends a JSON command object; server replies with a result message
The client reconnects automatically on disconnect with exponential backoff.
- Vanilla HTML, CSS, and JavaScript — no external dependencies or build step
- Dark theme
- No authentication in v1 — restrict network access at the OS level if needed
- Empty states are shown for each section when there is no data yet
For quick checks without the full dashboard:
# Current status of all requirements and stories
vxd status
# Status of a specific requirement
vxd status --req REQ-01HXYZ
# List all agents, optionally filtered
vxd agents
vxd agents --status working
vxd agents --status stuck
# Recent events (newest first)
vxd events --limit 20
# Events of a specific type
vxd events --type AGENT_STUCK
# Events for a specific story
vxd events --story STORY-001
# All escalations
vxd escalationsVXD is designed to run autonomously, but some situations require human attention:
| Signal | What to do |
|---|---|
| Tier 4 pause (escalation exhausted) | Review the story requirements — they may be ambiguous or infeasible |
| Repeated QA failures across stories | Check if lint/build/test commands are correct in config |
| Supervisor drift detected | Review the original requirement and story decomposition |
Agent stuck with high stuck_threshold_s |
Check if the runtime CLI is responsive (tmux attach -t <session>) |
No progress after vxd resume |
Run vxd preflight to verify environment, check API keys and CLIs |
| Stories awaiting approval | Use vxd review, vxd approve, or vxd reject to advance the pipeline |
| Lock file blocking resume | Another VXD may be running; use --force if it is stale |