shepard-system · digitalashes · Mar 12, 2026 · Mar 11, 2026 · Mar 11, 2026 · Mar 12, 2026
@@ -0,0 +1,61 @@
+---
+name: obs-alerts
+description: Show active alerts and their resolution steps. Use this skill whenever the user asks about alerts, wants to know if anything is broken or firing, asks about Alertmanager state, mentions alert silences, or wants to troubleshoot a specific alert like OTelCollectorDown, LokiDown, HighSessionCost, or SensitiveFileAccess. Also use when the user says "any alerts?" or "is everything ok" in the context of the obs stack.
+disable-model-invocation: false
+---
+
+# Obs Alerts
+
+Active alerts, silences, and resolution guidance.
+
+## Queries to run
+
+Use `scripts/obs-api.sh`. Run independent queries in parallel.
+
+### Active alerts
+
+```bash
+scripts/obs-api.sh am /api/v2/alerts --raw --jq 'if length == 0 then "No active alerts" else .[] | "\(.labels.alertname)\t\(.labels.severity // "?")\t\(.status.state)\t\(.startsAt)" end'
+```
+
+### Silences
+
+```bash
+scripts/obs-api.sh am /api/v2/silences --raw --jq '.[] | select(.status.state == "active") | "\(.matchers | map("\(.name)=\(.value)") | join(", "))\tuntil \(.endsAt)"'
+```
+
+### Prometheus rule states (firing/pending)
+
+```bash
+scripts/obs-api.sh prom /api/v1/rules --raw --jq '.data.groups[].rules[] | select(.state == "firing" or .state == "pending") | "\(.name)\t\(.state)\t\(.annotations.summary // "")"'
+```
+
+### Total rule count
+
+```bash
+scripts/obs-api.sh prom /api/v1/rules --raw --jq '[.data.groups[].rules | length] | add'
+```
+
+## Resolution hints
+
+For alert-specific fix instructions, read `references/resolution-hints.md` in this skill directory.
+
+## Output columns
+
+| Column | Source |
+|--------|--------|
+| Alert | alertname label |
+| Severity | severity label |
+| Status | firing/pending |
+| Since | startsAt timestamp |
+| Description | summary annotation |
+
+For output format options (table/csv/json), read `.claude/skills/obs-shared/assets/output-formats.md`.
+
+## Presentation
+
+1. Alert table: Alert | Severity | Status | Since
+2. If alerts are firing, read `references/resolution-hints.md` and provide the fix
+3. If silences are active, mention them
+4. If no alerts: "All clear. 0 of 16 rules firing."
+5. Show total rule count for context
@@ -0,0 +1,41 @@
+# Alert Resolution Hints
+
+Known alerts and suggested fixes for the obs stack.
+
+## Infrastructure Tier
+
+| Alert | Fix |
+|-------|-----|
+| **OTelCollectorDown** | `docker compose restart otel-collector` — check logs: `docker compose logs otel-collector --tail=20` |
+| **CollectorExportFailedSpans** | Check collector logs for export errors, verify Tempo is up |
+| **CollectorExportFailedMetrics** | Check collector logs, verify Prometheus is up and scraping |
+| **CollectorExportFailedLogs** | Check collector logs, verify Loki is up |
+| **CollectorHighMemory** | Reduce batch size in `configs/otel-collector/config.yaml` or increase container memory limit |
+| **PrometheusHighMemory** | Reduce retention or increase memory limit in docker-compose.yaml |
+
+## Pipeline Tier
+
+| Alert | Fix |
+|-------|-----|
+| **LokiDown** | `docker compose restart loki` — check disk space, WAL at `/loki/ruler-wal` |
+| **ShepherdServicesDown** | OTel Collector exporter endpoint (:8889) down — usually means collector needs restart |
+| **TempoDown** | `docker compose restart tempo` — check memory (2G limit), verify WAL directory |
+| **PrometheusTargetDown** | Check which target is down: `scripts/obs-api.sh prom /api/v1/targets --raw --jq '.data.activeTargets[] | select(.health=="down")'` |
+| **LokiRecordingRulesFailing** | Check Loki ruler logs, verify `configs/loki/rules/fake/codex.yaml` syntax |
+
+## Services Tier
+
+| Alert | Fix |
+|-------|-----|
+| **HighSessionCost** | Review session in Grafana Cost dashboard — model choice or long session? Consider switching to cheaper model. |
+| **HighTokenBurn** | Check for runaway loops or large file reads. Look at Operations dashboard event stream. |
+| **HighToolErrorRate** | Check Quality dashboard — which tools are failing? Common: Read on deleted files, Bash timeout. |
+| **SensitiveFileAccess** | Check Operations dashboard for which files were accessed. Review PreToolUse guard patterns. |
+| **NoTelemetryReceived** | Hooks not installed or CLI not in use. Run `./hooks/install.sh` and verify with `./scripts/test-signal.sh`. |
+
+## Inhibit Rules
+
+These suppress downstream alerts to reduce noise:
+- `OTelCollectorDown` → suppresses `ShepherdServicesDown` + all business-logic alerts
+- `LokiDown` → suppresses `LokiRecordingRulesFailing` + `HighTokenBurn`
+- `ShepherdServicesDown` → suppresses `NoTelemetryReceived`, `HighToolErrorRate`, `HighSessionCost`
@@ -0,0 +1,78 @@
+---
+name: obs-cost
+description: Show AI coding cost and token usage. Use this skill whenever the user asks how much they spent, wants a cost breakdown by provider or model, asks about token consumption, wants to compare Claude vs Gemini costs, or mentions anything about budget, spend, billing, or usage dollars. Supports time ranges like today, yesterday, week, 24h. Even if the user just says "how much did I spend" or "cost report" — use this skill.
+disable-model-invocation: false
+---
+
+# Obs Cost Report
+
+Cost and token usage breakdown by provider and model.
+
+## Arguments
+
+User may specify a time range. Default: `24h`.
+Mapping: `today` → `24h`, `yesterday` → offset query, `week` → `7d`.
+Replace `[24h]` in queries below with the appropriate range.
+
+## Queries to run
+
+Use `scripts/obs-api.sh` for all queries. Run independent queries in parallel.
+
+### Claude cost by model
+
+```bash
+scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result[] | select(.metric.model != "") | "\(.metric.model)\t\(.value[1])"' --data-urlencode 'query=sort_desc(sum by (model) (max_over_time(shepherd_claude_code_cost_usage_USD_total[24h])))'
+```
+
+### Claude cost total
+
+```bash
+scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result[0].value[1] // "0"' --data-urlencode 'query=sum(max_over_time(shepherd_claude_code_cost_usage_USD_total[24h]))'
+```
+
+### Claude tokens by type
+
+```bash
+scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result[] | select(.metric.type != "") | "\(.metric.type)\t\(.value[1])"' --data-urlencode 'query=sort_desc(sum by (type) (max_over_time(shepherd_claude_code_token_usage_tokens_total[24h])))'
+```
+
+### Gemini tokens by type
+
+```bash
+scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result[] | select(.metric.type != "") | "\(.metric.type)\t\(.value[1])"' --data-urlencode 'query=sort_desc(sum by (type) (max_over_time(shepherd_gemini_cli_token_usage_total[24h])))'
+```
+
+### Codex tokens
+
+```bash
+scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result[0].value[1] // "0"' --data-urlencode 'query=sum(sum_over_time(shepherd:codex:tokens_input:1m[24h]))'
+scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result[0].value[1] // "0"' --data-urlencode 'query=sum(sum_over_time(shepherd:codex:tokens_output:1m[24h]))'
+```
+
+### Session count per provider
+
+```bash
+scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result[0].value[1] // "0"' --data-urlencode 'query=count(max_over_time(shepherd_claude_code_session_count_total[24h]))'
+scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result[0].value[1] // "0"' --data-urlencode 'query=count(max_over_time(shepherd_gemini_cli_session_count_total[24h]))'
+scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result[0].value[1] // "0"' --data-urlencode 'query=sum(sum_over_time(shepherd:codex:sessions:1m[24h]))'
+```
+
+## Output columns
+
+| Section | Columns |
+|---------|---------|
+| Cost by model | Model \| Cost ($) |
+| Tokens | Provider \| Type \| Count |
+| Sessions | Provider \| Count |
+
+For output format options (table/csv/json), read `.claude/skills/obs-shared/assets/output-formats.md`.
+
+## Presentation
+
+1. **Total cost** (sum across providers)
+2. **Cost by model** table
+3. **Token breakdown** per provider (input/output/cache)
+4. **Session count** per provider
+5. Calculate cost-per-session and tokens-per-dollar where meaningful
+6. If all values are 0: "No activity in the last 24h. Stack may not be receiving telemetry — try `/obs-status`."
+7. Note: only Claude emits cost metrics. Gemini and Codex show tokens only.
@@ -0,0 +1,51 @@
+---
+name: obs-query
+description: Execute PromQL or LogQL queries against Prometheus and Loki. Use this skill whenever the user wants to run a custom query, check a specific metric value, search logs, asks "what is the value of metric X", wants to explore data in Prometheus or Loki, or pastes a PromQL/LogQL expression. Also use when the user asks to query traces, check recording rules output, or debug metric values that don't match dashboard expectations.
+disable-model-invocation: false
+---
+
+# Obs Query
+
+Execute arbitrary PromQL or LogQL queries and present results.
+
+## Query type detection
+
+- **LogQL**: starts with `{` (stream selector) → route to Loki
+- **PromQL**: everything else → route to Prometheus
+
+## How to execute
+
+### PromQL instant query
+
+```bash
+scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result' --data-urlencode "query=<EXPR>"
+```
+
+### PromQL range query
+
+```bash
+scripts/obs-api.sh prom /api/v1/query_range --raw --jq '.data.result' --data-urlencode "query=<EXPR>" --data-urlencode "start=$(date -v-1H +%s)" --data-urlencode "end=$(date +%s)" --data-urlencode "step=60"
+```
+
+### LogQL query
+
+```bash
+scripts/obs-api.sh loki /loki/api/v1/query_range --raw --jq '.data.result' --data-urlencode "query=<EXPR>" --data-urlencode "limit=20"
+```
+
+## Examples
+
+For a comprehensive list of PromQL and LogQL examples, read `references/examples.md` in this skill directory.
+
+## Output
+
+For output format options (table/csv/json), read `.claude/skills/obs-shared/assets/output-formats.md`.
+
+## Instructions
+
+1. Take the user's query from the skill argument (everything after `/obs-query`)
+2. If no query provided, read `references/examples.md` and show common examples
+3. Detect query type and execute with appropriate endpoint
+4. Format results as table (instant vectors), summary (range), or log lines
+5. If query fails, show the error and suggest fixes
+6. **Safety**: read-only GET queries only — never POST, PUT, DELETE, or admin endpoints
@@ -0,0 +1,61 @@
+# PromQL & LogQL Examples
+
+Quick reference for querying the obs stack.
+
+## PromQL (Prometheus)
+
+### Basics
+- `up` — all scrape targets with health status
+- `shepherd_tool_calls_total` — raw tool call counters
+- `shepherd_events_total` — raw event counters
+
+### Hook Metrics (aggregated across sessions)
+- `sum(increase(shepherd_tool_calls_total[1h]))` — total tool calls in last hour
+- `topk(5, sum by (tool) (increase(shepherd_tool_calls_total[24h])))` — top 5 tools
+- `sum by (source) (increase(shepherd_events_total[1h]))` — events by provider
+- `sum(increase(shepherd_sensitive_file_access_total[24h]))` — sensitive access count
+
+### Native OTel — Claude
+- `shepherd_claude_code_cost_usage_USD_total` — cost per session (has `session_id`, `model` labels)
+- `shepherd_claude_code_token_usage_tokens_total` — tokens per session (`type`: input/output/cacheRead/cacheCreation)
+- `count(max_over_time(shepherd_claude_code_session_count_total[24h]))` — count distinct sessions
+
+### Native OTel — Gemini
+- `shepherd_gemini_cli_token_usage_total` — tokens (`type`: input/output/thought/cache/tool)
+- `shepherd_gemini_cli_tool_call_count_total` — tool calls by `function_name`
+- `shepherd_gemini_cli_api_request_count_total` — API requests by `status_code`
+
+### Recording Rules — Codex
+- `shepherd:codex:sessions:1m` — session count (1m buckets, use `sum_over_time`)
+- `shepherd:codex:tokens_input:1m` / `shepherd:codex:tokens_output:1m` — tokens
+- `shepherd:codex:tool_calls_by_tool:1m` — tool calls by `tool_name`
+
+### Span Metrics (from Tempo)
+- `traces_spanmetrics_calls_total{span_name="claude.session"}` — session trace counts
+- `traces_spanmetrics_calls_total{span_name=~"claude.tool.*"}` — tool call traces
+- `traces_spanmetrics_latency_bucket{span_name=~"*.tool.*"}` — tool duration histogram
+
+## LogQL (Loki)
+
+### Stream Selectors
+- `{service_name="claude-code"}` — Claude Code logs
+- `{service_name="codex_cli_rs"}` — Codex CLI logs
+- `{service_name="gemini-cli"}` — Gemini CLI logs
+
+### Filtering & Parsing
+- `{service_name="claude-code"} | json` — parse JSON log body
+- `{service_name="claude-code"} | json | line_format "{{.body}}"` — show just the body
+- `{service_name="claude-code"} |= "error"` — filter for errors
+- `{service_name="claude-code"} | json | body_event_type="claude_code.tool_result"` — filter by event type
+
+### Aggregations
+- `count_over_time({service_name="claude-code"}[1h])` — log count in last hour
+- `rate({service_name="claude-code"}[5m])` — log rate
+- `sum by (service_name) (count_over_time({service_name=~".+"}[1h]))` — logs per service
+
+## Key Gotchas
+
+- Native OTel metrics have per-session `session_id` label — use `max_over_time()` not `increase()` for totals
+- `increase()` returns floats — wrap in `round()` for integer counters
+- Codex recording rules: label is `tool_name` (not `tool`)
+- Empty model label: filter with `model!=""` when grouping
@@ -0,0 +1,55 @@
+---
+name: obs-sessions
+description: Show recent AI coding sessions with details. Use this skill whenever the user asks about their recent sessions, wants to see session history, asks "what did I do today/yesterday", wants to know which models were used, how many tools were called per session, or asks about session duration and traces. Also use when the user wants to find a specific session or compare sessions across providers.
+disable-model-invocation: false
+---
+
+# Obs Sessions
+
+Recent sessions across all providers with model, tools, cost, and duration.
+
+## Queries to run
+
+Use `scripts/obs-api.sh`. Run independent queries in parallel.
+
+### Claude sessions (from native OTel cost metrics — has session_id + model)
+
+```bash
+scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result[:20] | .[] | "\(.metric.session_id)\t\(.metric.model)\t\(.value[1])"' --data-urlencode 'query=sort_desc(max by (session_id, model) (shepherd_claude_code_cost_usage_USD_total))'
+```
+
+### Tempo traces (all providers — has traceID, service, duration)
+
+```bash
+scripts/obs-api.sh tempo '/api/search?q=%7Bspan%3Aname%3D%22claude.session%22%7D&limit=10' --raw --jq '.traces[:10][] | "\(.traceID)\t\(.rootServiceName)\t\(.durationMs)ms"'
+scripts/obs-api.sh tempo '/api/search?q=%7Bspan%3Aname%3D%22gemini.session%22%7D&limit=10' --raw --jq '.traces[:10][] | "\(.traceID)\t\(.rootServiceName)\t\(.durationMs)ms"'
+scripts/obs-api.sh tempo '/api/search?q=%7Bspan%3Aname%3D%22codex.session%22%7D&limit=10' --raw --jq '.traces[:10][] | "\(.traceID)\t\(.rootServiceName)\t\(.durationMs)ms"'
+```
+
+### Session count per provider
+
+```bash
+scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result[0].value[1] // "0"' --data-urlencode 'query=count(max_over_time(shepherd_claude_code_session_count_total[24h]))'
+scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result[0].value[1] // "0"' --data-urlencode 'query=count(max_over_time(shepherd_gemini_cli_session_count_total[24h]))'
+scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result[0].value[1] // "0"' --data-urlencode 'query=sum(sum_over_time(shepherd:codex:sessions:1m[24h]))'
+```
+
+## Output columns
+
+| Column | Source |
+|--------|--------|
+| Provider | claude/gemini/codex (from Tempo service name) |
+| Session | first 8 chars of session_id or traceID |
+| Model | from Claude cost metrics (others: from Tempo trace attributes if available) |
+| Duration | from Tempo trace |
+| Cost | Claude native OTel (others: `—`) |
+
+For output format options (table/csv/json), read `.claude/skills/obs-shared/assets/output-formats.md`.
+
+## Presentation
+
+1. Session table sorted by most recent, max 20 rows
+2. Truncate session IDs to first 8 chars
+3. Highlight sessions with high cost (>$1), many tools (>50), or errors
+4. Mention Grafana Session Timeline dashboard for full trace view
+5. If no sessions: "No sessions recorded. Use a CLI with hooks installed, then check `/obs-status`."