From 9b64aa37ff389137d9326c65b516b9ad0f62ba05 Mon Sep 17 00:00:00 2001 From: kulvirgit Date: Fri, 20 Mar 2026 17:28:13 -0700 Subject: [PATCH 1/2] =?UTF-8?q?feat:=20add=20/telemetry-report=20skill=20f?= =?UTF-8?q?or=20Azure=20App=20Insights=20=E2=86=92=20Jira=20pipeline?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Queries 6 error categories from App Insights (core failures, provider errors, app errors, agent failure rates, engine errors, SQL failures), deduplicates against a local JSON store + JQL backstop, and auto-creates Jira tickets for new issues above configurable thresholds. Co-Authored-By: Claude Opus 4.6 (1M context) --- .claude/commands/telemetry-report.md | 267 +++++++++++++++++++++++++++ data/telemetry/.gitignore | 1 + 2 files changed, 268 insertions(+) create mode 100644 .claude/commands/telemetry-report.md create mode 100644 data/telemetry/.gitignore diff --git a/.claude/commands/telemetry-report.md b/.claude/commands/telemetry-report.md new file mode 100644 index 0000000000..9a6eefcd13 --- /dev/null +++ b/.claude/commands/telemetry-report.md @@ -0,0 +1,267 @@ +--- +name: telemetry-report +description: "Query Azure App Insights, surface errors, create Jira tickets" +--- + +You are running the telemetry-report pipeline. Follow every step below in order. Do not skip steps. + +Arguments: $ARGUMENTS +- If arguments contain `dry-run`: report only, do not create Jira tickets. +- If arguments contain `lookback=Xh`: replace `2h` in all KQL queries with the specified value (e.g., `lookback=4h` → `ago(4h)`). Default: `2h`. + +Parse the lookback value now. If not provided, use `2h`. + +--- + +## Step 1: Preflight + +Run this smoke query to verify Azure auth AND App Insights resource access: + +```bash +az monitor app-insights query --app altimate-code-os --resource-group altimate-code \ + --analytics-query "customEvents | take 1" --output json +``` + +If this command fails, report the exact error to the user and **STOP** — do not proceed to Step 2. + +--- + +## Step 2: Query App Insights (6 queries) + +Run all 6 queries using `az monitor app-insights query --app altimate-code-os --resource-group altimate-code --analytics-query "..." --output json`. + +Replace `2h` with the lookback value from arguments parsing. + +**Q1: Core Failures (threshold: >10)** +```kql +customEvents +| where timestamp > ago(2h) and name == "core_failure" +| extend err = tostring(customDimensions.error_message), + tool = tostring(customDimensions.tool_name), + err_class = tostring(customDimensions.error_class) +| summarize count() by err, tool, err_class +| where count_ > 10 +| order by count_ desc +``` + +**Q2: Provider Errors (threshold: >5)** +```kql +customEvents +| where timestamp > ago(2h) and name == "provider_error" +| extend provider = tostring(customDimensions.provider_id), + model = tostring(customDimensions.model_id), + err_type = tostring(customDimensions.error_type) +| summarize count() by provider, model, err_type +| where count_ > 5 +``` + +**Q3: Application Errors (threshold: >5)** +```kql +customEvents +| where timestamp > ago(2h) and name == "error" +| extend err_name = tostring(customDimensions.error_name), + context = tostring(customDimensions.context) +| summarize count() by err_name, context +| where count_ > 5 +``` + +**Q4: Agent Failure Rate (threshold: >15% or >10 errors, min 5 sessions)** +```kql +customEvents +| where timestamp > ago(2h) and name == "agent_outcome" +| extend outcome = tostring(customDimensions.outcome), + agent = tostring(customDimensions.agent) +| summarize total = count(), + errors = countif(outcome in ("error", "abandoned", "aborted")) + by agent +| where total >= 5 +| extend error_rate = round(100.0 * errors / total, 1) +| where error_rate > 15 or errors > 10 +``` + +**Q5: Engine Errors (threshold: >2)** +```kql +customEvents +| where timestamp > ago(2h) and name == "engine_error" +| extend phase = tostring(customDimensions.phase), + err = tostring(customDimensions.error_message) +| summarize count() by phase, err +| where count_ > 2 +``` + +**Q6: SQL Failures (threshold: >5)** +```kql +customEvents +| where timestamp > ago(2h) and name == "sql_execute_failure" +| extend err = tostring(customDimensions.error_message), + wh_type = tostring(customDimensions.warehouse_type) +| summarize count() by err, wh_type +| where count_ > 5 +``` + +**IMPORTANT:** Empty results are normal and expected — they mean no issues above threshold. Report "No signals" for empty queries, not failure. Only treat non-zero exit codes as failures. + +--- + +## Step 3: Read Dedup Store & Classify + +Read `data/telemetry/seen-issues.json`. If the file is missing or contains invalid JSON, start with an empty object `{}`. + +For each row returned above threshold from any query: + +### 3a. Normalize the error message + +Apply these transformations in order to produce a normalized error string: +1. Replace file paths (anything matching `/path/to/something.ext` patterns) with `` +2. Replace UUIDs (8-4-4-4-12 hex pattern) with `` +3. Replace large numbers (>6 digits) with `` +4. Collapse consecutive whitespace to a single space +5. Lowercase the entire string +6. Truncate to 80 characters + +### 3b. Generate dedup key + +Format: `{event_type}::{dimension}::{normalized_error}` + +- Q1 rows: `core_failure::{tool}::{normalized_err}` +- Q2 rows: `provider_error::{provider}::{err_type}` +- Q3 rows: `error::{err_name}::{context}` +- Q4 rows: `agent_failure::{agent}::{error_rate}%` (use the agent name + rate bucket: "high" for >30%, "moderate" for >15%) +- Q5 rows: `engine_error::{phase}::{normalized_err}` +- Q6 rows: `sql_failure::{wh_type}::{normalized_err}` + +### 3c. Classify each issue + +- Key **exists** in the dedup store → **KNOWN** (will update `last_count` and `last_checked` in Step 6) +- Key **does not exist** → **NEW** (will create a Jira ticket in Step 5) + +--- + +## Step 4: Report + +Output a markdown report to the user in this format: + +``` +# Telemetry Report — {current ISO timestamp} + +## Summary +- Lookback: {lookback_value} | Queries run: 6 | Above threshold: {N} | New: {M} | Known: {K} + +| Sev | Issue | Count | Trend | Status | Jira | +|-----|-------|-------|-------|--------|------| +``` + +For each issue above threshold, add a row. + +**Severity assignment:** +- **P0**: count > 100, or failure rate > 30% +- **P1**: count 20–100, or failure rate 15–30% +- **P2**: all other threshold violations + +**Trend calculation** (compare current `count_` to `last_count` in dedup store): +- `↑` = current > 1.2 × last_count +- `↓` = current < 0.8 × last_count +- `→` = within 20% +- `(new)` = no previous data (key not in store) + +**Status column**: NEW or KNOWN +**Jira column**: ticket ID if known, "Creating..." if new, "—" if dry-run + +If ALL 6 queries returned empty results, output: + +``` +# Telemetry Report — {timestamp} + +All clear — no issues above threshold in the last {lookback_value}. +``` + +--- + +## Step 5: Create Jira Tickets (NEW issues only) + +**If `dry-run` was specified in arguments, skip this entire step.** Output: "Dry-run mode — skipping ticket creation." + +**Cap: maximum 5 tickets per run.** Prioritize by severity (P0 first), then by count descending. + +For each NEW issue (up to 5): + +### 5a. JQL backstop search + +Before creating, search for an existing ticket: + +Use `mcp__atlassian__searchJiraIssuesUsingJql` with: +- `cloudId`: `ae6de7ce-ca58-46e8-b583-1468bf597470` +- `jql`: `project = AI AND labels = "altimate-code" AND summary ~ "{event_type}: {short_desc}" AND status != Done AND created >= -30d` +- `maxResults`: `1` + +Where `{short_desc}` is the first 40 chars of the normalized error. + +If a matching ticket is found: treat as KNOWN, record the ticket ID, do NOT create a new ticket. Update the report table accordingly. + +### 5b. Create ticket + +If no existing ticket found, create one using `mcp__atlassian__createJiraIssue` with: +- `cloudId`: `ae6de7ce-ca58-46e8-b583-1468bf597470` +- `projectKey`: `AI` +- `issueTypeName`: `Bug` (use `Task` for Q4 agent_outcome issues) +- `summary`: `[altimate-code] {event_type}: {short_error_description}` (max 100 chars) +- `description`: see template below +- `contentFormat`: `markdown` +- `additional_fields`: `{ "labels": ["altimate-code", "telemetry-auto"] }` + +**Description template:** +``` +## Telemetry Alert + +**Event Type:** {event_type} +**Severity:** {P0/P1/P2} +**Count (last {lookback}h):** {count} +**Error Details:** {full_error_message} +**Affected:** {tool/provider/agent name} + +## KQL Query +\`\`\`kql +{the exact query used to detect this issue} +\`\`\` + +--- +Auto-generated by /telemetry-report +``` + +Record the created ticket ID (e.g., `AI-6000`) for the dedup store update. + +**If the Atlassian MCP tools are not available**, output the full report but add a warning: "Jira MCP not connected — ticket creation skipped. Connect Atlassian MCP and re-run." + +--- + +## Step 6: Update Dedup Store + +Write the updated dedup store to `data/telemetry/seen-issues.json`. + +For each issue processed: +- **NEW entries**: add with `jira_ticket` (the ticket ID or `null` if dry-run/backstop), `first_seen` (current ISO timestamp), `last_count` (current count), `last_checked` (current ISO timestamp) +- **KNOWN entries**: update `last_count` and `last_checked` + +**30-day cleanup**: Remove any entries where `last_checked` is more than 30 days ago. + +The JSON schema for each entry: +```json +{ + "dedup_key_here": { + "jira_ticket": "AI-6000", + "first_seen": "2026-03-20T00:00:00Z", + "last_count": 512, + "last_checked": "2026-03-20T02:00:00Z" + } +} +``` + +Write the file with 2-space indented JSON for readability. + +--- + +## Done + +After completing all steps, output: +- Total issues found / tickets created / known issues updated +- Reminder: use `/loop 2h /telemetry-report` for continuous monitoring diff --git a/data/telemetry/.gitignore b/data/telemetry/.gitignore new file mode 100644 index 0000000000..81bb91e3b6 --- /dev/null +++ b/data/telemetry/.gitignore @@ -0,0 +1 @@ +seen-issues.json From 717fd2b0a869681226d0850b70556e541263f8e4 Mon Sep 17 00:00:00 2001 From: kulvirgit Date: Fri, 20 Mar 2026 17:59:11 -0700 Subject: [PATCH 2/2] docs: add README for telemetry-report pipeline Co-Authored-By: Claude Opus 4.6 (1M context) --- data/telemetry/README.md | 53 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 53 insertions(+) create mode 100644 data/telemetry/README.md diff --git a/data/telemetry/README.md b/data/telemetry/README.md new file mode 100644 index 0000000000..931701c86c --- /dev/null +++ b/data/telemetry/README.md @@ -0,0 +1,53 @@ +# Telemetry-to-Jira Pipeline + +`/telemetry-report` is a Claude Code skill that queries Azure Application Insights for altimate-code errors and auto-creates Jira tickets for new issues. + +## Usage + +``` +/telemetry-report # full run — query + report + Jira tickets +/telemetry-report dry-run # report only, no tickets +/telemetry-report lookback=4h # custom lookback window (default: 2h) +/loop 2h /telemetry-report # continuous monitoring every 2 hours +``` + +## What it queries + +| # | Event | Threshold | Description | +|---|-------|-----------|-------------| +| Q1 | `core_failure` | >10 | Tool-level failures (read, edit, sql_analyze, etc.) | +| Q2 | `provider_error` | >5 | LLM provider errors (auth, rate limits, model errors) | +| Q3 | `error` | >5 | Application-level errors | +| Q4 | `agent_outcome` | >15% or >10 | Agent failure/abandon/abort rates | +| Q5 | `engine_error` | >2 | Python engine sidecar errors | +| Q6 | `sql_execute_failure` | >5 | SQL execution failures by warehouse type | + +## How it works + +1. **Preflight** — smoke query to verify Azure CLI auth + App Insights access +2. **Query** — runs 6 KQL queries against `altimate-code-os` App Insights +3. **Dedup** — checks `seen-issues.json` + JQL backstop to avoid duplicate tickets +4. **Report** — outputs markdown table with severity (P0/P1/P2), trend arrows, status +5. **Jira tickets** — creates up to 5 tickets per run in the AI project, labeled `altimate-code` + `telemetry-auto` +6. **Update store** — writes back to `seen-issues.json`, cleans entries older than 30 days + +## Severity levels + +- **P0**: count > 100 or failure rate > 30% +- **P1**: count 20–100 or failure rate 15–30% +- **P2**: all other threshold violations + +## Files + +| File | Purpose | +|------|---------| +| `.claude/commands/telemetry-report.md` | The skill definition | +| `data/telemetry/seen-issues.json` | Dedup store (git-ignored) | +| `data/telemetry/.gitignore` | Excludes the dedup store | +| `data/telemetry/README.md` | This file | + +## Prerequisites + +- Azure CLI installed and authenticated (`az login`) +- Access to the `altimate-code-os` App Insights resource in resource group `altimate-code` +- Atlassian MCP connected (for Jira ticket creation; reports still work without it)