Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 61 additions & 0 deletions .claude/skills/obs-alerts/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
---
name: obs-alerts
description: Show active alerts and their resolution steps. Use this skill whenever the user asks about alerts, wants to know if anything is broken or firing, asks about Alertmanager state, mentions alert silences, or wants to troubleshoot a specific alert like OTelCollectorDown, LokiDown, HighSessionCost, or SensitiveFileAccess. Also use when the user says "any alerts?" or "is everything ok" in the context of the obs stack.
disable-model-invocation: false
---

# Obs Alerts

Active alerts, silences, and resolution guidance.

## Queries to run

Use `scripts/obs-api.sh`. Run independent queries in parallel.

### Active alerts

```bash
scripts/obs-api.sh am /api/v2/alerts --raw --jq 'if length == 0 then "No active alerts" else .[] | "\(.labels.alertname)\t\(.labels.severity // "?")\t\(.status.state)\t\(.startsAt)" end'
```

### Silences

```bash
scripts/obs-api.sh am /api/v2/silences --raw --jq '.[] | select(.status.state == "active") | "\(.matchers | map("\(.name)=\(.value)") | join(", "))\tuntil \(.endsAt)"'
```

### Prometheus rule states (firing/pending)

```bash
scripts/obs-api.sh prom /api/v1/rules --raw --jq '.data.groups[].rules[] | select(.state == "firing" or .state == "pending") | "\(.name)\t\(.state)\t\(.annotations.summary // "")"'
```

### Total rule count

```bash
scripts/obs-api.sh prom /api/v1/rules --raw --jq '[.data.groups[].rules | length] | add'
```

## Resolution hints

For alert-specific fix instructions, read `references/resolution-hints.md` in this skill directory.

## Output columns

| Column | Source |
|--------|--------|
| Alert | alertname label |
| Severity | severity label |
| Status | firing/pending |
| Since | startsAt timestamp |
| Description | summary annotation |

For output format options (table/csv/json), read `.claude/skills/obs-shared/assets/output-formats.md`.

## Presentation

1. Alert table: Alert | Severity | Status | Since
2. If alerts are firing, read `references/resolution-hints.md` and provide the fix
3. If silences are active, mention them
4. If no alerts: "All clear. 0 of 16 rules firing."
5. Show total rule count for context
41 changes: 41 additions & 0 deletions .claude/skills/obs-alerts/references/resolution-hints.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Alert Resolution Hints

Known alerts and suggested fixes for the obs stack.

## Infrastructure Tier

| Alert | Fix |
|-------|-----|
| **OTelCollectorDown** | `docker compose restart otel-collector` — check logs: `docker compose logs otel-collector --tail=20` |
| **CollectorExportFailedSpans** | Check collector logs for export errors, verify Tempo is up |
| **CollectorExportFailedMetrics** | Check collector logs, verify Prometheus is up and scraping |
| **CollectorExportFailedLogs** | Check collector logs, verify Loki is up |
| **CollectorHighMemory** | Reduce batch size in `configs/otel-collector/config.yaml` or increase container memory limit |
| **PrometheusHighMemory** | Reduce retention or increase memory limit in docker-compose.yaml |

## Pipeline Tier

| Alert | Fix |
|-------|-----|
| **LokiDown** | `docker compose restart loki` — check disk space, WAL at `/loki/ruler-wal` |
| **ShepherdServicesDown** | OTel Collector exporter endpoint (:8889) down — usually means collector needs restart |
| **TempoDown** | `docker compose restart tempo` — check memory (2G limit), verify WAL directory |
| **PrometheusTargetDown** | Check which target is down: `scripts/obs-api.sh prom /api/v1/targets --raw --jq '.data.activeTargets[] | select(.health=="down")'` |
| **LokiRecordingRulesFailing** | Check Loki ruler logs, verify `configs/loki/rules/fake/codex.yaml` syntax |

## Services Tier

| Alert | Fix |
|-------|-----|
| **HighSessionCost** | Review session in Grafana Cost dashboard — model choice or long session? Consider switching to cheaper model. |
| **HighTokenBurn** | Check for runaway loops or large file reads. Look at Operations dashboard event stream. |
| **HighToolErrorRate** | Check Quality dashboard — which tools are failing? Common: Read on deleted files, Bash timeout. |
| **SensitiveFileAccess** | Check Operations dashboard for which files were accessed. Review PreToolUse guard patterns. |
| **NoTelemetryReceived** | Hooks not installed or CLI not in use. Run `./hooks/install.sh` and verify with `./scripts/test-signal.sh`. |

## Inhibit Rules

These suppress downstream alerts to reduce noise:
- `OTelCollectorDown` → suppresses `ShepherdServicesDown` + all business-logic alerts
- `LokiDown` → suppresses `LokiRecordingRulesFailing` + `HighTokenBurn`
- `ShepherdServicesDown` → suppresses `NoTelemetryReceived`, `HighToolErrorRate`, `HighSessionCost`
78 changes: 78 additions & 0 deletions .claude/skills/obs-cost/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
---
name: obs-cost
description: Show AI coding cost and token usage. Use this skill whenever the user asks how much they spent, wants a cost breakdown by provider or model, asks about token consumption, wants to compare Claude vs Gemini costs, or mentions anything about budget, spend, billing, or usage dollars. Supports time ranges like today, yesterday, week, 24h. Even if the user just says "how much did I spend" or "cost report" — use this skill.
disable-model-invocation: false
---

# Obs Cost Report

Cost and token usage breakdown by provider and model.

## Arguments

User may specify a time range. Default: `24h`.
Mapping: `today` → `24h`, `yesterday` → offset query, `week` → `7d`.
Replace `[24h]` in queries below with the appropriate range.

## Queries to run

Use `scripts/obs-api.sh` for all queries. Run independent queries in parallel.

### Claude cost by model

```bash
scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result[] | select(.metric.model != "") | "\(.metric.model)\t\(.value[1])"' --data-urlencode 'query=sort_desc(sum by (model) (max_over_time(shepherd_claude_code_cost_usage_USD_total[24h])))'
```

### Claude cost total

```bash
scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result[0].value[1] // "0"' --data-urlencode 'query=sum(max_over_time(shepherd_claude_code_cost_usage_USD_total[24h]))'
```

### Claude tokens by type

```bash
scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result[] | select(.metric.type != "") | "\(.metric.type)\t\(.value[1])"' --data-urlencode 'query=sort_desc(sum by (type) (max_over_time(shepherd_claude_code_token_usage_tokens_total[24h])))'
```

### Gemini tokens by type

```bash
scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result[] | select(.metric.type != "") | "\(.metric.type)\t\(.value[1])"' --data-urlencode 'query=sort_desc(sum by (type) (max_over_time(shepherd_gemini_cli_token_usage_total[24h])))'
```

### Codex tokens

```bash
scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result[0].value[1] // "0"' --data-urlencode 'query=sum(sum_over_time(shepherd:codex:tokens_input:1m[24h]))'
scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result[0].value[1] // "0"' --data-urlencode 'query=sum(sum_over_time(shepherd:codex:tokens_output:1m[24h]))'
```

### Session count per provider

```bash
scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result[0].value[1] // "0"' --data-urlencode 'query=count(max_over_time(shepherd_claude_code_session_count_total[24h]))'
scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result[0].value[1] // "0"' --data-urlencode 'query=count(max_over_time(shepherd_gemini_cli_session_count_total[24h]))'
scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result[0].value[1] // "0"' --data-urlencode 'query=sum(sum_over_time(shepherd:codex:sessions:1m[24h]))'
```

## Output columns

| Section | Columns |
|---------|---------|
| Cost by model | Model \| Cost ($) |
| Tokens | Provider \| Type \| Count |
| Sessions | Provider \| Count |

For output format options (table/csv/json), read `.claude/skills/obs-shared/assets/output-formats.md`.

## Presentation

1. **Total cost** (sum across providers)
2. **Cost by model** table
3. **Token breakdown** per provider (input/output/cache)
4. **Session count** per provider
5. Calculate cost-per-session and tokens-per-dollar where meaningful
6. If all values are 0: "No activity in the last 24h. Stack may not be receiving telemetry — try `/obs-status`."
7. Note: only Claude emits cost metrics. Gemini and Codex show tokens only.
51 changes: 51 additions & 0 deletions .claude/skills/obs-query/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
---
name: obs-query
description: Execute PromQL or LogQL queries against Prometheus and Loki. Use this skill whenever the user wants to run a custom query, check a specific metric value, search logs, asks "what is the value of metric X", wants to explore data in Prometheus or Loki, or pastes a PromQL/LogQL expression. Also use when the user asks to query traces, check recording rules output, or debug metric values that don't match dashboard expectations.
disable-model-invocation: false
---

# Obs Query

Execute arbitrary PromQL or LogQL queries and present results.

## Query type detection

- **LogQL**: starts with `{` (stream selector) → route to Loki
- **PromQL**: everything else → route to Prometheus

## How to execute

### PromQL instant query

```bash
scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result' --data-urlencode "query=<EXPR>"
```

### PromQL range query

```bash
scripts/obs-api.sh prom /api/v1/query_range --raw --jq '.data.result' --data-urlencode "query=<EXPR>" --data-urlencode "start=$(date -v-1H +%s)" --data-urlencode "end=$(date +%s)" --data-urlencode "step=60"
```

### LogQL query

```bash
scripts/obs-api.sh loki /loki/api/v1/query_range --raw --jq '.data.result' --data-urlencode "query=<EXPR>" --data-urlencode "limit=20"
```

## Examples

For a comprehensive list of PromQL and LogQL examples, read `references/examples.md` in this skill directory.

## Output

For output format options (table/csv/json), read `.claude/skills/obs-shared/assets/output-formats.md`.

## Instructions

1. Take the user's query from the skill argument (everything after `/obs-query`)
2. If no query provided, read `references/examples.md` and show common examples
3. Detect query type and execute with appropriate endpoint
4. Format results as table (instant vectors), summary (range), or log lines
5. If query fails, show the error and suggest fixes
6. **Safety**: read-only GET queries only — never POST, PUT, DELETE, or admin endpoints
61 changes: 61 additions & 0 deletions .claude/skills/obs-query/references/examples.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# PromQL & LogQL Examples

Quick reference for querying the obs stack.

## PromQL (Prometheus)

### Basics
- `up` — all scrape targets with health status
- `shepherd_tool_calls_total` — raw tool call counters
- `shepherd_events_total` — raw event counters

### Hook Metrics (aggregated across sessions)
- `sum(increase(shepherd_tool_calls_total[1h]))` — total tool calls in last hour
- `topk(5, sum by (tool) (increase(shepherd_tool_calls_total[24h])))` — top 5 tools
- `sum by (source) (increase(shepherd_events_total[1h]))` — events by provider
- `sum(increase(shepherd_sensitive_file_access_total[24h]))` — sensitive access count

### Native OTel — Claude
- `shepherd_claude_code_cost_usage_USD_total` — cost per session (has `session_id`, `model` labels)
- `shepherd_claude_code_token_usage_tokens_total` — tokens per session (`type`: input/output/cacheRead/cacheCreation)
- `count(max_over_time(shepherd_claude_code_session_count_total[24h]))` — count distinct sessions

### Native OTel — Gemini
- `shepherd_gemini_cli_token_usage_total` — tokens (`type`: input/output/thought/cache/tool)
- `shepherd_gemini_cli_tool_call_count_total` — tool calls by `function_name`
- `shepherd_gemini_cli_api_request_count_total` — API requests by `status_code`

### Recording Rules — Codex
- `shepherd:codex:sessions:1m` — session count (1m buckets, use `sum_over_time`)
- `shepherd:codex:tokens_input:1m` / `shepherd:codex:tokens_output:1m` — tokens
- `shepherd:codex:tool_calls_by_tool:1m` — tool calls by `tool_name`

### Span Metrics (from Tempo)
- `traces_spanmetrics_calls_total{span_name="claude.session"}` — session trace counts
- `traces_spanmetrics_calls_total{span_name=~"claude.tool.*"}` — tool call traces
- `traces_spanmetrics_latency_bucket{span_name=~"*.tool.*"}` — tool duration histogram

## LogQL (Loki)

### Stream Selectors
- `{service_name="claude-code"}` — Claude Code logs
- `{service_name="codex_cli_rs"}` — Codex CLI logs
- `{service_name="gemini-cli"}` — Gemini CLI logs

### Filtering & Parsing
- `{service_name="claude-code"} | json` — parse JSON log body
- `{service_name="claude-code"} | json | line_format "{{.body}}"` — show just the body
- `{service_name="claude-code"} |= "error"` — filter for errors
- `{service_name="claude-code"} | json | body_event_type="claude_code.tool_result"` — filter by event type

### Aggregations
- `count_over_time({service_name="claude-code"}[1h])` — log count in last hour
- `rate({service_name="claude-code"}[5m])` — log rate
- `sum by (service_name) (count_over_time({service_name=~".+"}[1h]))` — logs per service

## Key Gotchas

- Native OTel metrics have per-session `session_id` label — use `max_over_time()` not `increase()` for totals
- `increase()` returns floats — wrap in `round()` for integer counters
- Codex recording rules: label is `tool_name` (not `tool`)
- Empty model label: filter with `model!=""` when grouping
55 changes: 55 additions & 0 deletions .claude/skills/obs-sessions/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
---
name: obs-sessions
description: Show recent AI coding sessions with details. Use this skill whenever the user asks about their recent sessions, wants to see session history, asks "what did I do today/yesterday", wants to know which models were used, how many tools were called per session, or asks about session duration and traces. Also use when the user wants to find a specific session or compare sessions across providers.
disable-model-invocation: false
---

# Obs Sessions

Recent sessions across all providers with model, tools, cost, and duration.

## Queries to run

Use `scripts/obs-api.sh`. Run independent queries in parallel.

### Claude sessions (from native OTel cost metrics — has session_id + model)

```bash
scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result[:20] | .[] | "\(.metric.session_id)\t\(.metric.model)\t\(.value[1])"' --data-urlencode 'query=sort_desc(max by (session_id, model) (shepherd_claude_code_cost_usage_USD_total))'
```

### Tempo traces (all providers — has traceID, service, duration)

```bash
scripts/obs-api.sh tempo '/api/search?q=%7Bspan%3Aname%3D%22claude.session%22%7D&limit=10' --raw --jq '.traces[:10][] | "\(.traceID)\t\(.rootServiceName)\t\(.durationMs)ms"'
scripts/obs-api.sh tempo '/api/search?q=%7Bspan%3Aname%3D%22gemini.session%22%7D&limit=10' --raw --jq '.traces[:10][] | "\(.traceID)\t\(.rootServiceName)\t\(.durationMs)ms"'
scripts/obs-api.sh tempo '/api/search?q=%7Bspan%3Aname%3D%22codex.session%22%7D&limit=10' --raw --jq '.traces[:10][] | "\(.traceID)\t\(.rootServiceName)\t\(.durationMs)ms"'
```

### Session count per provider

```bash
scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result[0].value[1] // "0"' --data-urlencode 'query=count(max_over_time(shepherd_claude_code_session_count_total[24h]))'
scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result[0].value[1] // "0"' --data-urlencode 'query=count(max_over_time(shepherd_gemini_cli_session_count_total[24h]))'
scripts/obs-api.sh prom /api/v1/query --raw --jq '.data.result[0].value[1] // "0"' --data-urlencode 'query=sum(sum_over_time(shepherd:codex:sessions:1m[24h]))'
```

## Output columns

| Column | Source |
|--------|--------|
| Provider | claude/gemini/codex (from Tempo service name) |
| Session | first 8 chars of session_id or traceID |
| Model | from Claude cost metrics (others: from Tempo trace attributes if available) |
| Duration | from Tempo trace |
| Cost | Claude native OTel (others: `—`) |

For output format options (table/csv/json), read `.claude/skills/obs-shared/assets/output-formats.md`.

## Presentation

1. Session table sorted by most recent, max 20 rows
2. Truncate session IDs to first 8 chars
3. Highlight sessions with high cost (>$1), many tools (>50), or errors
4. Mention Grafana Session Timeline dashboard for full trace view
5. If no sessions: "No sessions recorded. Use a CLI with hooks installed, then check `/obs-status`."
Loading
Loading