diff --git a/agents/monitoring/api-runtime/SKILL.md b/agents/monitoring/api-runtime/SKILL.md index 84ece4a..f3d579a 100644 --- a/agents/monitoring/api-runtime/SKILL.md +++ b/agents/monitoring/api-runtime/SKILL.md @@ -1,84 +1,263 @@ --- name: api-runtime-monitor -description: Monitors LTX API runtime performance, latency, error rates, and throughput. Alerts on performance degradation or errors. -tags: [monitoring, api, performance, latency, errors] +description: "Monitor LTX API runtime performance, latency, error rates, and throughput. Detects performance degradation and errors. Use when: (1) detecting API latency issues, (2) alerting on error rate spikes, (3) investigating throughput drops by endpoint/model/org." +tags: [monitoring, api, performance, latency, errors, throughput] --- # API Runtime Monitor -## When to use - -- "Monitor API latency" -- "Alert on API errors" -- "Track API throughput" -- "Monitor inference time" -- "Alert on API performance degradation" - -## What it monitors - -- **Latency**: Request processing time, inference time, queue time -- **Error rates**: % of failed requests, error types, error sources -- **Throughput**: Requests per hour/day, by endpoint/model -- **Performance**: P50/P95/P99 latency, success rate -- **Utilization**: API usage by org, model, resolution - -## Steps - -1. **Gather requirements from user:** - - Which performance metric to monitor (latency, errors, throughput) - - Alert threshold (e.g., "P95 latency > 30s", "error rate > 5%", "throughput drops > 20%") - - Time window (hourly, daily) - - Scope (all requests, specific endpoint, specific org) - - Notification channel - -2. **Read shared files:** - - `shared/bq-schema.md` — GPU cost table (has API runtime data) and ltxvapi tables - - `shared/metric-standards.md` — Performance metric patterns - -3. **Identify data source:** - - For LTX API: Use `ltxvapi_api_requests_with_be_costs` or `gpu_request_attribution_and_cost` - - **Key columns explained:** - - `request_processing_time_ms`: Total time from request submission to completion - - `request_inference_time_ms`: GPU processing time (actual model inference) - - `request_queue_time_ms`: Time waiting in queue before processing starts - - `result`: Request outcome (success, failed, timeout, etc.) - - `error_type`: Classification of errors (infrastructure vs applicative) - - `endpoint`: API endpoint called (e.g., /generate, /upscale) - - `model_type`: Model used (ltxv2, retake, etc.) - - `org_name`: Customer organization making the request - -4. **Write monitoring SQL:** - - Query relevant performance metric - - Calculate percentiles (P50, P95, P99) for latency - - Calculate error rate (failed / total requests) - - Compare against baseline - -5. **Present to user:** - - Show SQL query - - Show example alert format with performance breakdown - - Confirm threshold values - -6. **Set up alert** (manual for now): - - Document SQL - - Configure notification to engineering team - -## Reference files - -| File | Read when | -|------|-----------| -| `shared/product-context.md` | LTX products and business context | -| `shared/bq-schema.md` | API tables and GPU cost table schema | -| `shared/metric-standards.md` | Performance metric patterns | -| `shared/event-registry.yaml` | Feature events (if analyzing event-driven metrics) | -| `shared/gpu-cost-query-templates.md` | GPU cost queries (if analyzing cost-related performance) | -| `shared/gpu-cost-analysis-patterns.md` | Cost analysis patterns (if analyzing cost-related performance) | - -## Rules - -- DO use APPROX_QUANTILES for percentile calculations (P50, P95, P99) -- DO separate errors by error_source (infrastructure vs applicative) -- DO filter by result = 'success' for success rate calculations -- DO break down by endpoint, model, and resolution for detailed analysis -- DO compare current performance against historical baseline -- DO alert engineering team for infrastructure errors, product team for applicative errors -- DO partition by dt for performance +## 1. Overview (Why?) + +LTX API performance varies by endpoint, model, and customer organization. Latency issues, error rate spikes, and throughput drops can indicate infrastructure problems, model regressions, or customer-specific issues that require engineering intervention. + +This skill provides **autonomous API runtime monitoring** that detects performance degradation (P95 latency spikes), error rate increases, throughput drops, and queue time issues — with breakdown by endpoint, model, and organization for root cause analysis. + +**Problem solved**: Detect API performance problems and errors before they impact customer experience — with segment-level (endpoint/model/org) root cause identification. + +## 2. Requirements (What?) + +Monitor these outcomes autonomously: + +- [ ] P95 latency spikes (> 2x baseline or > 60s) +- [ ] Error rate increases (> 5% or DoD increase > 50%) +- [ ] Throughput drops (> 30% DoD/WoW) +- [ ] Queue time excessive (> 50% of processing time) +- [ ] Infrastructure errors (> 10 requests/hour) +- [ ] Alerts include breakdown by endpoint, model, organization +- [ ] Results formatted by priority (infrastructure vs applicative errors) +- [ ] Findings routed to appropriate team (API team or Engineering) + +## 3. Progress Tracker + +* [ ] Read shared knowledge (schema, metrics, performance patterns) +* [ ] Identify data source (ltxvapi tables or GPU cost table) +* [ ] Write monitoring SQL with percentile calculations +* [ ] Execute query for target date range +* [ ] Analyze results by endpoint, model, organization +* [ ] Separate infrastructure vs applicative errors +* [ ] Present findings with performance breakdown +* [ ] Route alerts to appropriate team + +## 4. Implementation Plan + +### Phase 1: Read Alert Thresholds + +**Generic thresholds** (data-driven analysis pending): +- P95 latency > 2x baseline or > 60s +- Error rate > 5% or DoD increase > 50% +- Throughput drops > 30% DoD/WoW +- Queue time > 50% of processing time +- Infrastructure errors > 10 requests/hour + +[!IMPORTANT] These are generic thresholds. Consider creating production thresholds based on endpoint/model-specific analysis (similar to usage/GPU cost monitoring). + +### Phase 2: Read Shared Knowledge + +Before writing SQL, read: +- **`shared/product-context.md`** — LTX products, user types, business model, API context +- **`shared/bq-schema.md`** — GPU cost table (has API runtime data), ltxvapi tables, API request schema +- **`shared/metric-standards.md`** — Performance metric patterns (latency, error rates, throughput) +- **`shared/event-registry.yaml`** — Feature events (if analyzing event-driven metrics) +- **`shared/gpu-cost-query-templates.md`** — GPU cost queries (if analyzing cost-related performance) +- **`shared/gpu-cost-analysis-patterns.md`** — Cost analysis patterns (if analyzing cost-related performance) + +**Data nuances**: +- Primary table: `ltx-dwh-prod-processed.web.ltxvapi_api_requests_with_be_costs` +- Alternative: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost` +- Partitioned by `action_ts` (TIMESTAMP) or `dt` (DATE) — filter for performance +- Key columns: `request_processing_time_ms`, `request_inference_time_ms`, `request_queue_time_ms`, `result`, `endpoint`, `model_type`, `org_name` + +### Phase 3: Identify Data Source + +✅ **PREFERRED: Use ltxvapi_api_requests_with_be_costs for API runtime metrics** + +**Key columns**: +- `request_processing_time_ms`: Total time from request submission to completion +- `request_inference_time_ms`: GPU processing time (actual model inference) +- `request_queue_time_ms`: Time waiting in queue before processing starts +- `result`: Request outcome (success, failed, timeout, etc.) +- `error_type` or `error_source`: Classification of errors (infrastructure vs applicative) +- `endpoint`: API endpoint called (e.g., /generate, /upscale) +- `model_type`: Model used (ltxv2, retake, etc.) +- `org_name`: Customer organization making the request + +[!IMPORTANT] Verify column name: `error_type` vs `error_source` in actual schema + +### Phase 4: Write Monitoring SQL + +✅ **PREFERRED: Calculate percentiles and error rates with baseline comparisons** + +```sql +WITH api_metrics AS ( + SELECT + DATE(action_ts) AS dt, + endpoint, + model_type, + org_name, + COUNT(*) AS total_requests, + COUNTIF(result = 'success') AS successful_requests, + COUNTIF(result != 'success') AS failed_requests, + SAFE_DIVIDE(COUNTIF(result != 'success'), COUNT(*)) * 100 AS error_rate_pct, + APPROX_QUANTILES(request_processing_time_ms, 100)[OFFSET(50)] AS p50_latency_ms, + APPROX_QUANTILES(request_processing_time_ms, 100)[OFFSET(95)] AS p95_latency_ms, + APPROX_QUANTILES(request_processing_time_ms, 100)[OFFSET(99)] AS p99_latency_ms, + AVG(request_queue_time_ms) AS avg_queue_time_ms, + AVG(request_inference_time_ms) AS avg_inference_time_ms + FROM `ltx-dwh-prod-processed.web.ltxvapi_api_requests_with_be_costs` + WHERE action_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY) + AND action_ts < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY) + GROUP BY dt, endpoint, model_type, org_name +), +metrics_with_baseline AS ( + SELECT + *, + AVG(p95_latency_ms) OVER ( + PARTITION BY endpoint, model_type + ORDER BY dt ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING + ) AS p95_latency_baseline_7d, + AVG(error_rate_pct) OVER ( + PARTITION BY endpoint, model_type + ORDER BY dt ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING + ) AS error_rate_baseline_7d + FROM api_metrics +) +SELECT * FROM metrics_with_baseline +WHERE dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY); +``` + +**Key patterns**: +- **Percentiles**: Use `APPROX_QUANTILES` for P50/P95/P99 +- **Error rate**: `SAFE_DIVIDE(failed, total) * 100` +- **Baseline**: 7-day rolling average by endpoint and model +- **Time window**: Last 7 days (shorter than usage monitoring due to higher frequency data) + +### Phase 5: Execute Query + +Run query using: +```bash +bq --project_id=ltx-dwh-explore query --use_legacy_sql=false --format=pretty " + +" +``` + +### Phase 6: Analyze Results + +**For latency trends**: +- Compare P95 latency vs baseline (7-day avg) +- Flag if P95 > 2x baseline or > 60s absolute +- Identify which endpoint/model/org drove spikes + +**For error rate analysis**: +- Compare error rate vs baseline +- Separate errors by `error_type`/`error_source` (infrastructure vs applicative) +- Flag if error rate > 5% or DoD increase > 50% + +**For throughput**: +- Track requests per hour/day by endpoint +- Flag throughput drops > 30% DoD/WoW +- Identify which endpoints lost traffic + +**For queue analysis**: +- Calculate queue time as % of total processing time +- Flag if queue time > 50% of processing time +- Indicates capacity/scaling issues + +### Phase 7: Present Findings + +Format results with: +- **Summary**: Key finding (e.g., "P95 latency spiked to 85s for /v1/text-to-video") +- **Root cause**: Which endpoint/model/org drove the issue +- **Breakdown**: Performance metrics by dimension +- **Error classification**: Infrastructure vs applicative errors +- **Recommendation**: Route to API team (applicative) or Engineering team (infrastructure) + +**Alert format**: +``` +⚠️ API PERFORMANCE ALERT: + • Endpoint: /v1/text-to-video + Model: ltxv2 + Metric: P95 Latency + Current: 85s | Baseline: 30s + Change: +183% + + Error rate: 8.2% (baseline: 2.1%) + Error type: Infrastructure + +Recommendation: Alert Engineering team for infrastructure issue +``` + +### Phase 8: Route Alert + +For ongoing monitoring: +1. Save SQL query +2. Set up in BigQuery scheduled query or Hex Thread +3. Configure notification by error type: + - Infrastructure errors → Engineering team + - Applicative errors → API/Product team +4. Include endpoint, model, and org details in alert + +## 5. Context & References + +### Shared Knowledge +- **`shared/product-context.md`** — LTX products and API context +- **`shared/bq-schema.md`** — API tables and GPU cost table schema +- **`shared/metric-standards.md`** — Performance metric patterns +- **`shared/event-registry.yaml`** — Feature events (if analyzing event-driven metrics) +- **`shared/gpu-cost-query-templates.md`** — GPU cost queries (if analyzing cost-related performance) +- **`shared/gpu-cost-analysis-patterns.md`** — Cost analysis patterns + +### Data Sources + +**Primary table**: `ltx-dwh-prod-processed.web.ltxvapi_api_requests_with_be_costs` +- Partitioned by `action_ts` (TIMESTAMP) +- Key columns: `request_processing_time_ms`, `request_inference_time_ms`, `request_queue_time_ms`, `result`, `endpoint`, `model_type`, `org_name` + +**Alternative**: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost` +- Contains API runtime data but not primary source for performance metrics + +### Endpoints +Common endpoints: `/v1/text-to-video`, `/v1/image-to-video`, `/v1/upscale`, `/generate` + +### Models +Common models: `ltxv2`, `retake`, etc. + +## 6. Constraints & Done + +### DO NOT + +- **DO NOT** use absolute thresholds without baseline comparison +- **DO NOT** mix infrastructure and applicative errors in same alert +- **DO NOT** skip partition filtering — always filter on `action_ts` or `dt` for performance +- **DO NOT** forget to separate errors by error type/source + +[!IMPORTANT] Verify column name in schema: `error_type` vs `error_source` + +### DO + +- **DO** use `APPROX_QUANTILES` for percentile calculations (P50, P95, P99) +- **DO** separate errors by error_source (infrastructure vs applicative) +- **DO** filter by `result = 'success'` for success rate calculations +- **DO** break down by endpoint, model, and organization for detailed analysis +- **DO** compare current performance against historical baseline (7-day rolling avg) +- **DO** alert engineering team for infrastructure errors +- **DO** alert product/API team for applicative errors +- **DO** partition on `action_ts` or `dt` for performance +- **DO** use `ltx-dwh-explore` as execution project +- **DO** calculate error rate with `SAFE_DIVIDE(failed, total) * 100` +- **DO** flag P95 latency > 2x baseline or > 60s +- **DO** flag error rate > 5% or DoD increase > 50% +- **DO** flag throughput drops > 30% DoD/WoW +- **DO** flag queue time > 50% of processing time +- **DO** flag infrastructure errors > 10 requests/hour +- **DO** include endpoint, model, org details in all alerts +- **DO** validate unusual patterns with API/Engineering team before alerting + +### Completion Criteria + +✅ All performance metrics monitored (latency, errors, throughput, queue time) +✅ Alerts fire with thresholds (generic pending production analysis) +✅ Endpoint/model/org breakdown provided +✅ Errors separated by type (infrastructure vs applicative) +✅ Findings routed to appropriate team +✅ Partition filtering applied for performance +✅ Column name verified (error_type vs error_source) diff --git a/agents/monitoring/be-cost/SKILL.md b/agents/monitoring/be-cost/SKILL.md index c2f177e..fb5dd29 100644 --- a/agents/monitoring/be-cost/SKILL.md +++ b/agents/monitoring/be-cost/SKILL.md @@ -1,6 +1,6 @@ --- name: be-cost-monitoring -description: Monitor and analyze backend GPU costs for LTX API and LTX Studio. Use when analyzing cost trends, detecting anomalies, breaking down costs by endpoint/model/org/process, monitoring utilization, or investigating cost efficiency drift. +description: "Monitor backend GPU costs for LTX API and LTX Studio with data-driven thresholds. Detects cost anomalies and utilization issues. Use when: (1) detecting cost spikes, (2) analyzing idle vs inference costs, (3) investigating efficiency drift by endpoint/model/org." tags: [monitoring, costs, gpu, infrastructure, alerts] compatibility: - BigQuery access (ltx-dwh-prod-processed) @@ -9,52 +9,98 @@ compatibility: # Backend Cost Monitoring -## When to use +## 1. Overview (Why?) -- "Monitor GPU costs" -- "Analyze cost trends (daily/weekly/monthly)" -- "Detect cost anomalies or spikes" -- "Break down API costs by endpoint/model/org" -- "Break down Studio costs by process/workspace" -- "Compare API vs Studio cost distribution" -- "Investigate cost-per-request efficiency drift" -- "Monitor GPU utilization and idle costs" -- "Alert on cost budget breaches" -- "Day-over-day or week-over-week cost comparisons" +LTX GPU infrastructure spending is dominated by idle costs (~72% of total), with actual inference representing only ~27%. Cost patterns vary significantly by vertical (API vs Studio) and day-of-week. Generic alerting thresholds miss true anomalies while flagging normal variance. -## Steps +This skill provides **autonomous cost monitoring** with data-driven thresholds derived from 60-day statistical analysis. It detects genuine cost problems (autoscaler issues, efficiency regression, volume surges) while suppressing noise from expected patterns. -### 1. Gather Requirements +**Problem solved**: Detect cost spikes, utilization degradation, and efficiency drift that indicate infrastructure problems or wasteful spending — without manual threshold tuning or false alarms. -Ask the user: -- **What to monitor**: Total costs, cost by product/feature, cost efficiency, utilization, anomalies -- **Scope**: LTX API, LTX Studio, or both? Specific endpoint/org/process? -- **Time window**: Daily, weekly, monthly? How far back? -- **Analysis type**: Trends, comparisons (DoD/WoW), anomaly detection, breakdowns -- **Alert threshold** (if setting up alerts): Absolute ($X/day) or relative (spike > X% vs baseline) +**Critical timing**: Analyzes data from **3 days ago** (cost data needs time to finalize). -### 2. Read Shared Knowledge +## 2. Requirements (What?) -Before writing SQL: -- **`shared/product-context.md`** — LTX products and business context +Monitor these outcomes autonomously: + +- [ ] Idle cost spikes (over-provisioning or traffic drop without scaledown) +- [ ] Inference cost spikes (volume surge, costlier model, heavy customer) +- [ ] Idle-to-inference ratio degradation (utilization dropping) +- [ ] Failure rate spikes (wasted compute + service quality issue) +- [ ] Cost-per-request drift (model regression or resolution creep) +- [ ] Day-over-day cost jumps (early warning signals) +- [ ] Volume drops (potential outage) +- [ ] Overhead spikes (system anomalies) +- [ ] Alerts prioritized by tier (High/Medium/Low) +- [ ] Vertical-specific thresholds (API vs Studio) + +## 3. Progress Tracker + +* [ ] Read production thresholds and shared knowledge +* [ ] Select appropriate query template(s) +* [ ] Execute query for 3 days ago +* [ ] Analyze results by tier priority +* [ ] Identify root cause (endpoint, model, org, process) +* [ ] Present findings with cost breakdown +* [ ] Route alerts to appropriate team (API/Studio/Engineering) + +## 4. Implementation Plan + +### Phase 1: Read Production Thresholds + +Production thresholds from `/Users/dbeer/workspace/dwh-data-model-transforms/queries/gpu_cost_alerting_thresholds.md`: + +#### Tier 1 — High Priority (daily monitoring) +| Alert | Threshold | Signal | +|-------|-----------|--------| +| **Idle cost spike** | > $15,600/day | Autoscaler over-provisioning or traffic drop without scaledown | +| **Inference cost spike** | > $5,743/day | Volume surge, costlier model, or heavy new customer | +| **Idle-to-inference ratio** | > 4:1 | GPU utilization degrading (baseline ~2.7:1) | + +#### Tier 2 — Medium Priority (daily monitoring) +| Alert | Threshold | Signal | +|-------|-----------|--------| +| **Failure rate spike** | > 20.4% overall | Wasted compute + service quality issue | +| **Cost-per-request drift** | API > $0.70, Studio > $0.46 | Model regression or duration/resolution creep | +| **DoD cost jump** | API > 30%, Studio > 15% | Early warning before absolute thresholds breach | + +#### Tier 3 — Low Priority (weekly review) +| Alert | Threshold | Signal | +|-------|-----------|--------| +| **Volume drop** | < 18,936 requests/day | Possible outage or upstream feed issue | +| **Overhead spike** | > $138/day | System overhead anomaly (review weekly) | + +**Per-Vertical Thresholds:** +- **LTX API**: Total daily > $5,555, Failure rate > 5.7%, DoD change > 30% +- **LTX Studio**: Total daily > $11,928, Failure rate > 22.6%, DoD change > 15% + +**Alert logic**: Cost metric exceeds WARNING (avg+2σ) or CRITICAL (avg+3σ) threshold + +### Phase 2: Read Shared Knowledge + +Before writing SQL, read: +- **`shared/product-context.md`** — LTX products, user types, business model - **`shared/bq-schema.md`** — GPU cost table schema (lines 418-615) - **`shared/metric-standards.md`** — GPU cost metric patterns (section 13) -- **`shared/event-registry.yaml`** — Feature events (if analyzing feature-level costs) - **`shared/gpu-cost-query-templates.md`** — 11 production-ready SQL queries - **`shared/gpu-cost-analysis-patterns.md`** — Analysis workflows and benchmarks +- **`shared/event-registry.yaml`** — Feature events (if analyzing feature-level costs) -Key learnings: +**Data nuances**: - Table: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost` - Partitioned by `dt` (DATE) — always filter for performance -- `cost_category`: inference (requests), idle, overhead, unused -- Total cost = `row_cost + attributed_idle_cost + attributed_overhead_cost` (inference rows only) -- For infrastructure cost: `SUM(row_cost)` across all categories +- **Target date**: `DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)` (cost data from 3 days ago) +- Cost categories: inference, idle, overhead, unused +- Total cost per request = `row_cost + attributed_idle_cost + attributed_overhead_cost` (inference only) +- Infrastructure cost = `SUM(row_cost)` across all categories -### 3. Run Query Templates +### Phase 3: Select Query Template -**If user didn't specify anything:** Run all 11 query templates from `shared/gpu-cost-query-templates.md` to provide comprehensive cost overview. +✅ **PREFERRED: Run all 11 templates for comprehensive overview** -**If user specified specific analysis:** Select appropriate template: +If user didn't specify, run all templates from `shared/gpu-cost-query-templates.md`: + +**If user specified specific analysis**, select appropriate template: | User asks... | Use template | |-------------|-------------| @@ -68,9 +114,7 @@ Key learnings: | "Cost efficiency by model" | Cost per Request by Model | | "Which orgs cost most?" | API Cost by Organization | -See `shared/gpu-cost-query-templates.md` for all 11 query templates. - -### 4. Execute Query +### Phase 4: Execute Query Run query using: ```bash @@ -79,84 +123,107 @@ bq --project_id=ltx-dwh-explore query --use_legacy_sql=false --format=pretty " " ``` -Or use BigQuery console with project `ltx-dwh-explore`. +**Or** use BigQuery console with project `ltx-dwh-explore` + +**Critical**: Always filter on `dt = DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)` -### 5. Analyze Results +### Phase 5: Analyze Results -**For cost trends:** +**For cost trends**: - Compare current period vs baseline (7-day avg, prior week, prior month) - Calculate % change and flag significant shifts (>15-20%) -**For anomaly detection:** +**For anomaly detection**: - Flag days with Z-score > 2 (cost or volume deviates > 2 std devs from rolling avg) - Investigate root cause: specific endpoint/model/org, error rate spike, billing type change -**For breakdowns:** +**For breakdowns**: - Identify top cost drivers (endpoint, model, org, process) - Calculate cost per request to spot efficiency issues - Check failure costs (wasted spend on errors) -### 6. Present Findings +### Phase 6: Present Findings Format results with: -- **Summary**: Key finding (e.g., "GPU costs spiked 45% yesterday") +- **Summary**: Key finding (e.g., "GPU costs spiked 45% 3 days ago") - **Root cause**: What drove the change (e.g., "LTX API /v1/text-to-video requests +120%") -- **Breakdown**: Top contributors by dimension +- **Breakdown**: Top contributors by dimension (endpoint, model, org, process) - **Recommendation**: Action to take (investigate org X, optimize model Y, alert team) +- **Priority tier**: Tier 1 (High), Tier 2 (Medium), or Tier 3 (Low) -### 7. Set Up Alert (if requested) +### Phase 7: Set Up Alert (Optional) For ongoing monitoring: 1. Save SQL query 2. Set up in BigQuery scheduled query or Hex Thread -3. Configure notification threshold -4. Route alerts to Slack channel or Linear issue +3. Configure notification threshold by tier +4. Route alerts to appropriate team: + - API cost spikes → API team + - Studio cost spikes → Studio team + - Infrastructure issues → Engineering team -## Schema Reference +## 5. Context & References -For detailed table schema including all dimensions, columns, and cost calculations, see `references/schema-reference.md`. +### Production Thresholds +- **`/Users/dbeer/workspace/dwh-data-model-transforms/queries/gpu_cost_alerting_thresholds.md`** — Production thresholds for LTX division (60-day analysis) -## Reference Files +### Shared Knowledge +- **`shared/product-context.md`** — LTX products and business context +- **`shared/bq-schema.md`** — GPU cost table schema (lines 418-615) +- **`shared/metric-standards.md`** — GPU cost metric patterns (section 13) +- **`shared/gpu-cost-query-templates.md`** — 11 production-ready SQL queries +- **`shared/gpu-cost-analysis-patterns.md`** — Analysis workflows, benchmarks, investigation playbooks +- **`shared/event-registry.yaml`** — Feature events (if analyzing feature-level costs) + +### Data Source +Table: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost` +- Partitioned by `dt` (DATE) +- Division: `LTX` (LTX API + LTX Studio) +- Cost categories: inference (27%), idle (72%), overhead (0.3%) +- Key columns: `cost_category`, `row_cost`, `attributed_idle_cost`, `attributed_overhead_cost`, `result`, `endpoint`, `model_type`, `org_name`, `process_name` + +### Query Templates +See `shared/gpu-cost-query-templates.md` for 11 production-ready queries -| File | Read when | -|------|-----------| -| `references/schema-reference.md` | GPU cost table dimensions, columns, and cost calculations | -| `shared/bq-schema.md` | Understanding GPU cost table schema (lines 418-615) | -| `shared/metric-standards.md` | GPU cost metric SQL patterns (section 13) | -| `shared/gpu-cost-query-templates.md` | Selecting query template for analysis (11 production-ready queries) | -| `shared/gpu-cost-analysis-patterns.md` | Interpreting results, workflows, benchmarks, investigation playbooks | +## 6. Constraints & Done -## Rules +### DO NOT -### Query Best Practices +- **DO NOT** analyze yesterday's data — use 3 days ago (cost data needs time to finalize) +- **DO NOT** sum row_cost + attributed_* across all cost_categories — causes double-counting +- **DO NOT** mix inference and non-inference rows in same aggregation without filtering +- **DO NOT** use absolute thresholds — always compare to baseline (avg+2σ/avg+3σ) +- **DO NOT** skip partition filtering — always filter on `dt` for performance + +### DO - **DO** always filter on `dt` partition column for performance +- **DO** use `DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)` as the target date - **DO** filter `cost_category = 'inference'` for request-level analysis -- **DO** exclude Lightricks team requests with `is_lt_team IS FALSE` for customer-facing cost analysis -- **DO** include LT team requests only when analyzing total infrastructure spend or debugging +- **DO** exclude Lightricks team with `is_lt_team IS FALSE` for customer-facing cost analysis +- **DO** include LT team only when analyzing total infrastructure spend or debugging - **DO** use `ltx-dwh-explore` as execution project - **DO** calculate cost per request with `SAFE_DIVIDE` to avoid division by zero -- **DO** compare against baseline (7-day avg, prior period) for trends -- **DO** round cost values to 2 decimal places for readability - -### Cost Calculation - - **DO** sum all three cost columns (row_cost + attributed_idle + attributed_overhead) for fully loaded cost per request - **DO** use `SUM(row_cost)` across all rows for total infrastructure cost -- **DO NOT** sum row_cost + attributed_* across all cost_categories (double-counting) -- **DO NOT** mix inference and non-inference rows in same aggregation without filtering - -### Analysis - +- **DO** compare against baseline (7-day avg, prior period) for trends +- **DO** round cost values to 2 decimal places for readability - **DO** flag anomalies with Z-score > 2 (cost or volume deviation > 2 std devs) - **DO** investigate failure costs (wasted spend on errors) - **DO** break down by endpoint/model for API, by process for Studio - **DO** check cost per request trends to spot efficiency degradation - **DO** validate results against total infrastructure spend - -### Alerts - - **DO** set thresholds based on historical baseline, not absolute values - **DO** alert engineering team for cost spikes > 30% vs baseline - **DO** include cost breakdown and root cause in alerts - **DO** route API cost alerts to API team, Studio alerts to Studio team + +### Completion Criteria + +✅ All cost categories analyzed (idle, inference, overhead) +✅ Production thresholds applied by tier (High/Medium/Low) +✅ Alerts prioritized and routed to appropriate teams +✅ Root cause investigation completed (endpoint/model/org/process) +✅ Findings presented with cost breakdown and recommendations +✅ Query uses 3-day lookback (not yesterday) +✅ Partition filtering applied for performance diff --git a/agents/monitoring/enterprise/SKILL.md b/agents/monitoring/enterprise/SKILL.md index 2cf15ce..39c2403 100644 --- a/agents/monitoring/enterprise/SKILL.md +++ b/agents/monitoring/enterprise/SKILL.md @@ -1,67 +1,255 @@ --- name: enterprise-monitor -description: Monitors enterprise account health, usage, and contract compliance. Alerts on low engagement, quota breaches, or churn risk. -tags: [monitoring, enterprise, accounts, contracts] +description: "Monitor enterprise account health, usage, and contract compliance. Detects low engagement, quota breaches, and churn risk. Use when: (1) detecting enterprise account usage drops, (2) alerting on inactive accounts, (3) investigating power user engagement changes." +tags: [monitoring, enterprise, accounts, contracts, churn-risk] --- # Enterprise Monitor -## When to use - -- "Monitor enterprise account usage" -- "Monitor enterprise churn risk" -- "Alert when enterprise account is inactive" - -## What it monitors - -- **Account usage**: DAU, WAU, MAU per enterprise org -- **Token consumption**: Usage vs contracted quota, historical consumption trends -- **User activation**: % of seats active -- **Engagement**: Video generations, image generations, downloads per org -- **Churn signals**: Declining usage, inactive users - -## Steps - -1. **Gather requirements from user:** - - Which enterprise org(s) to monitor (or all) - - Alert threshold based on historical usage of each account (e.g., "usage drops > 30% vs their baseline", "MAU below their 30-day average", "< 50% of contracted quota") - - Time window (weekly, monthly) - - Notification channel (Slack, email, Linear issue) - -2. **Read shared files:** - - `shared/product-context.md` — LTX products, enterprise business model, user types - - `shared/bq-schema.md` — Enterprise user segmentation queries - - `shared/metric-standards.md` — Enterprise metrics, quota tracking - - `shared/event-registry.yaml` — Feature events (if analyzing engagement) - - `shared/gpu-cost-query-templates.md` — GPU cost queries (if analyzing infrastructure costs) - - `shared/gpu-cost-analysis-patterns.md` — Cost analysis patterns (if analyzing infrastructure costs) - -3. **Identify enterprise users:** - - Use enterprise segmentation CTE from bq-schema.md (lines 441-461) - - Apply McCann split (McCann_NY vs McCann_Paris) - - Exclude Lightricks and Popular Pays - -4. **Write monitoring SQL:** - - Query org-level usage metrics - - Set baseline for each org based on their historical usage (e.g., 30-day average, 90-day trend) - - Compare current usage against org-specific baseline or contracted quota - - Flag orgs below threshold or showing decline - - Flag meaningful drops for power users (users with top usage within each org) - -5. **Present to user:** - - Show SQL query - - Show example alert format with org name and metrics - - Confirm threshold values and alert logic - -6. **Set up alert** (manual for now): - - Document SQL - - Configure notification to customer success team - -## Rules - -- DO use EXACT enterprise segmentation CTE from bq-schema.md without modification -- DO apply McCann split (McCann_NY vs McCann_Paris) -- DO exclude Lightricks and Popular Pays from enterprise orgs -- DO break out pilot vs contracted accounts -- DO NOT alert on free/self-serve users — this agent is enterprise-only -- DO include org name in alert for easy customer success follow-up +## 1. Overview (Why?) + +Enterprise accounts (contract and pilot) represent high-value customers with negotiated quotas and specific engagement patterns. Unlike self-serve users, enterprise usage should be monitored per-organization with org-specific baselines, since each has different team sizes, use cases, and contract terms. + +This skill provides **autonomous enterprise account monitoring** that detects declining usage, underutilization of quotas, inactive periods, and power user drops — all of which signal churn risk or engagement problems that require customer success intervention. + +**Problem solved**: Identify enterprise churn risk early through usage signals — before contracts end or accounts go completely inactive — with org-level root cause analysis. + +## 2. Requirements (What?) + +Monitor these outcomes autonomously: + +- [ ] DAU/WAU/MAU drops per enterprise org (> 30% vs org baseline) +- [ ] Token consumption vs contracted quota (underutilization < 50%) +- [ ] User activation (% of seats active) +- [ ] Video/image generation engagement per org +- [ ] Power user drops within org (> 20% decline) +- [ ] Zero activity for 7+ consecutive days +- [ ] Alerts include org name for customer success follow-up +- [ ] Pilot vs contract accounts separated +- [ ] McCann split applied (McCann_NY vs McCann_Paris) +- [ ] Lightricks and Popular Pays excluded + +## 3. Progress Tracker + +* [ ] Read shared knowledge (enterprise segmentation, schema, metrics) +* [ ] Identify enterprise users with segmentation CTE +* [ ] Write monitoring SQL with org-specific baselines +* [ ] Execute query for target date range +* [ ] Analyze results by org and account type +* [ ] Identify power user drops within each org +* [ ] Present findings with org-level details +* [ ] Route alerts to customer success team + +## 4. Implementation Plan + +### Phase 1: Read Alert Thresholds + +**Generic thresholds** (data-driven analysis pending): +- DoD/WoW usage drops > 30% vs org's baseline +- MAU below org's 30-day average +- Token consumption < 50% of contracted quota (underutilization) +- Power user drops > 20% within org +- Zero activity for 7+ consecutive days + +[!IMPORTANT] These are generic thresholds. Consider creating production thresholds based on org-specific analysis. + +### Phase 2: Read Shared Knowledge + +Before writing SQL, read: +- **`shared/product-context.md`** — LTX products, enterprise business model, user types +- **`shared/bq-schema.md`** — Enterprise user segmentation queries (lines 441-461) +- **`shared/metric-standards.md`** — Enterprise metrics, quota tracking +- **`shared/event-registry.yaml`** — Feature events (if analyzing engagement) + +**Data nuances**: +- Use EXACT enterprise segmentation CTE from bq-schema.md (lines 441-461) without modification +- Apply McCann split: `McCann_NY` vs `McCann_Paris` +- Exclude: `Lightricks`, `Popular Pays`, `None` +- Contract accounts: Indegene, HearWell_BeWell, Novig, Cylndr Studios, Miroma, Deriv, McCann_Paris +- Pilot accounts: All other enterprise orgs + +### Phase 3: Identify Enterprise Users + +✅ **PREFERRED: Use exact segmentation CTE from bq-schema.md** + +```sql +WITH ent_users AS ( + SELECT DISTINCT + lt_id, + CASE + WHEN COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) = 'McCann_NY' THEN 'McCann_NY' + WHEN COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) LIKE '%McCann%' THEN 'McCann_Paris' + ELSE COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) + END AS org + FROM `ltx-dwh-prod-processed.web.ltxstudio_users` + WHERE is_enterprise_user + AND current_customer_plan_type IN ('contract', 'pilot') + AND COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) NOT IN ('Lightricks', 'Popular Pays', 'None') +), +enterprise_users AS ( + SELECT DISTINCT + lt_id, + CASE + WHEN org IN ('Indegene', 'HearWell_BeWell', 'Novig', 'Cylndr Studios', 'Miroma', 'Deriv', 'McCann_Paris') + THEN org + ELSE CONCAT(org, ' Pilot') + END AS org, + CASE + WHEN org IN ('Indegene', 'HearWell_BeWell', 'Novig', 'Cylndr Studios', 'Miroma', 'Deriv', 'McCann_Paris') + THEN 'Contract' + ELSE 'Pilot' + END AS account_type + FROM ent_users + WHERE org NOT IN ('Lightricks', 'Popular Pays', 'None') +) +``` + +### Phase 4: Write Monitoring SQL + +✅ **PREFERRED: Monitor all enterprise orgs with org-specific baselines** + +```sql +WITH org_metrics AS ( + SELECT + a.dt, + e.org, + e.account_type, + COUNT(DISTINCT a.lt_id) AS dau, + SUM(a.num_tokens_consumed) AS tokens, + SUM(a.num_generate_image) AS image_gens, + SUM(a.num_generate_video) AS video_gens + FROM `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` a + JOIN enterprise_users e ON a.lt_id = e.lt_id + WHERE a.dt >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) + AND a.dt <= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) + GROUP BY a.dt, e.org, e.account_type +), +metrics_with_baseline AS ( + SELECT + *, + AVG(dau) OVER ( + PARTITION BY org + ORDER BY dt ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING + ) AS dau_baseline_30d, + AVG(tokens) OVER ( + PARTITION BY org + ORDER BY dt ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING + ) AS tokens_baseline_30d + FROM org_metrics +) +SELECT * FROM metrics_with_baseline +WHERE dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY); +``` + +**Key patterns**: +- **Org-specific baselines**: Each org compared to its own 30-day average +- **McCann split**: Separate McCann_NY and McCann_Paris +- **Account type**: Contract vs Pilot + +### Phase 5: Analyze Results + +**For usage trends**: +- Compare org's current usage vs their 30-day baseline +- Flag orgs with DoD/WoW drops > 30% vs baseline +- Flag orgs with MAU below their 30-day average + +**For quota analysis**: +- Compare token consumption vs contracted quota +- Flag underutilization (< 50% of quota) +- Identify orgs approaching or exceeding quota + +**For engagement**: +- Track power users within each org (top 20% by token usage) +- Flag power user drops > 20% within org +- Flag zero activity for 7+ consecutive days + +### Phase 6: Present Findings + +Format results with: +- **Summary**: Key finding (e.g., "Novig enterprise account inactive for 7 days") +- **Org details**: Org name, account type (Contract/Pilot), baseline usage +- **Metrics**: DAU, tokens, generations vs baseline +- **Recommendation**: Customer success action (reach out, investigate, adjust quota) + +**Alert format**: +``` +⚠️ ENTERPRISE ALERT: + • Org: Novig (Contract) + Metric: Token consumption + Current: 0 tokens | Baseline: 27K/day + Drop: -100% | Zero activity for 7 days + +Recommendation: Contact account manager immediately +``` + +### Phase 7: Route Alert + +For ongoing monitoring: +1. Save SQL query +2. Set up in BigQuery scheduled query or Hex Thread +3. Configure notification for customer success team +4. Include org name and account manager contact in alert + +## 5. Context & References + +### Shared Knowledge +- **`shared/product-context.md`** — LTX products, enterprise business model, user types +- **`shared/bq-schema.md`** — Enterprise user segmentation queries (lines 441-461) +- **`shared/metric-standards.md`** — Enterprise metrics, quota tracking +- **`shared/event-registry.yaml`** — Feature events for engagement analysis + +### Data Sources +- **Users table**: `ltx-dwh-prod-processed.web.ltxstudio_users` +- **Usage table**: `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` +- Key columns: `lt_id`, `enterprise_name_at_purchase`, `current_enterprise_name`, `organization_name`, `current_customer_plan_type` + +### Enterprise Orgs + +**Contract accounts**: +- Indegene +- HearWell_BeWell +- Novig +- Cylndr Studios +- Miroma +- Deriv +- McCann_Paris + +**Pilot accounts**: All other enterprise orgs (suffixed with " Pilot") + +**Excluded**: Lightricks, Popular Pays, None + +## 6. Constraints & Done + +### DO NOT + +- **DO NOT** modify enterprise segmentation CTE — use exact version from bq-schema.md +- **DO NOT** alert on free/self-serve users — this agent is enterprise-only +- **DO NOT** combine McCann_NY and McCann_Paris — keep them separate +- **DO NOT** include Lightricks or Popular Pays in enterprise monitoring +- **DO NOT** use generic baselines — each org compared to its own historical usage + +### DO + +- **DO** use EXACT enterprise segmentation CTE from bq-schema.md (lines 441-461) without modification +- **DO** apply McCann split (McCann_NY vs McCann_Paris) +- **DO** exclude Lightricks, Popular Pays, and None from enterprise orgs +- **DO** break out pilot vs contracted accounts +- **DO** include org name in alert for customer success follow-up +- **DO** use org-specific baselines (each org's 30-day average) +- **DO** flag DoD/WoW usage drops > 30% vs org baseline +- **DO** flag MAU below org's 30-day average +- **DO** flag token consumption < 50% of contracted quota +- **DO** flag power user drops > 20% within org +- **DO** flag zero activity for 7+ consecutive days +- **DO** route alerts to customer success team with org details +- **DO** validate unusual patterns with customer success before alerting + +### Completion Criteria + +✅ All enterprise orgs monitored (Contract and Pilot) +✅ Org-specific baselines applied +✅ McCann split applied (NY vs Paris) +✅ Lightricks and Popular Pays excluded +✅ Alerts include org name and account type +✅ Usage drops, quota issues, and engagement drops detected +✅ Findings routed to customer success team diff --git a/agents/monitoring/revenue/SKILL.md b/agents/monitoring/revenue/SKILL.md index 07f45b4..804d688 100644 --- a/agents/monitoring/revenue/SKILL.md +++ b/agents/monitoring/revenue/SKILL.md @@ -1,64 +1,224 @@ --- name: revenue-monitor -description: Monitors revenue metrics, tracks subscription changes, and alerts on revenue anomalies or threshold breaches. -tags: [monitoring, revenue, subscriptions] +description: "Monitor LTX Studio revenue metrics and subscription changes. Detects revenue anomalies, churn spikes, and refund issues. Use when: (1) detecting revenue drops, (2) alerting on churn rate changes, (3) investigating subscription tier movements." +tags: [monitoring, revenue, subscriptions, churn, mrr] --- # Revenue Monitor -## When to use +## 1. Overview (Why?) -- "Monitor revenue trends" -- "Alert when revenue drops" -- "Track subscription churn" -- "Monitor MRR/ARR changes" -- "Alert on refund spikes" +LTX Studio revenue varies by tier (Free/Lite/Standard/Pro/Enterprise) and plan type (self-serve/contract/pilot). Revenue monitoring requires tracking both top-line metrics (MRR, ARR) and operational indicators (churn, refunds, tier movements) to detect problems early. -## What it monitors +This skill provides **autonomous revenue monitoring** with alerting on revenue drops, churn rate spikes, refund increases, and subscription changes. It monitors all segments and identifies which tiers or plan types drive changes. -- **Revenue metrics**: MRR, ARR, daily revenue -- **Subscription metrics**: New subscriptions, cancellations, churns, renewals -- **Refunds**: Refund rate, refund amount -- **Tier changes**: Upgrades, downgrades -- **Enterprise contracts**: Contract value, renewals +**Problem solved**: Detect revenue problems, churn risk, and subscription health issues before they compound — with segment-level root cause analysis. -## Steps +## 2. Requirements (What?) -1. **Gather requirements from user:** - - Which revenue metric to monitor - - Alert threshold (e.g., "drop > 10%", "< $X per day") - - Time window (daily, weekly, monthly) - - Notification channel (Slack, email) +Monitor these outcomes autonomously: -2. **Read shared files:** - - `shared/bq-schema.md` — Subscription tables (ltxstudio_user_tiers_dates, etc.) - - `shared/metric-standards.md` — Revenue metric definitions +- [ ] Revenue drops by segment (tier, plan type) +- [ ] MRR and ARR trend changes +- [ ] Churn rate increases above baseline +- [ ] Refund rate spikes +- [ ] New subscription volume drops +- [ ] Tier movements (upgrades vs downgrades) +- [ ] Enterprise contract renewals approaching (< 90 days) +- [ ] Alerts fire only when changes exceed thresholds +- [ ] Root cause identifies which tiers/plans drove changes -3. **Write monitoring SQL:** - - Query current metric value - - Compare against historical baseline or threshold - - Flag anomalies or breaches +## 3. Progress Tracker -4. **Present to user:** - - Show SQL query - - Show example alert format - - Confirm threshold values +* [ ] Read shared knowledge (schema, metrics, business context) +* [ ] Write monitoring SQL with baseline comparisons +* [ ] Execute query for target date range +* [ ] Analyze results by segment (tier, plan type) +* [ ] Identify root cause (which segments drove changes) +* [ ] Present findings with severity and recommendations +* [ ] Set up ongoing alerts (if requested) -5. **Set up alert** (manual for now): - - Document SQL in Hex or BigQuery scheduled query - - Configure Slack webhook or notification +## 4. Implementation Plan -## Reference files +### Phase 1: Read Alert Thresholds -| File | Read when | -|------|-----------| -| `shared/bq-schema.md` | Writing SQL for subscription/revenue tables | -| `shared/metric-standards.md` | Defining revenue metrics | +**Generic thresholds** (data-driven analysis pending): +- Revenue drop > 15% DoD or > 10% WoW +- Churn rate > 5% or increase > 2x baseline +- Refund rate > 3% or spike > 50% DoD +- New subscriptions drop > 20% WoW +- Enterprise renewals < 90 days out -## Rules +[!IMPORTANT] These are generic thresholds. Consider creating production thresholds based on 60-day analysis (similar to usage/GPU cost monitoring). -- DO use LTX Studio subscription tables from bq-schema.md -- DO exclude is_lt_team unless explicitly requested -- DO validate thresholds with user before setting alerts -- DO NOT hardcode dates — use rolling windows -- DO account for timezone differences in daily revenue calculations +### Phase 2: Read Shared Knowledge + +Before writing SQL, read: +- **`shared/product-context.md`** — LTX products, user types, business model, enterprise context +- **`shared/bq-schema.md`** — Subscription tables (ltxstudio_user_tiers_dates, ltxstudio_subscriptions, etc.), user segmentation queries +- **`shared/metric-standards.md`** — Revenue metric definitions (MRR, ARR, churn, LTV) +- **`shared/event-registry.yaml`** — Feature events (if analyzing feature-driven revenue) + +**Data nuances**: +- Tables: `ltxstudio_user_tiers_dates`, `ltxstudio_subscriptions` +- Key columns: `lt_id`, `griffin_tier_name`, `subscription_start_date`, `subscription_end_date`, `plan_type` +- Exclude `is_lt_team` unless explicitly requested +- Revenue calculations vary by plan type (self-serve vs contract/pilot) + +### Phase 3: Write Monitoring SQL + +✅ **PREFERRED: Monitor all segments with baseline comparisons** + +```sql +WITH revenue_metrics AS ( + SELECT + dt, + griffin_tier_name, + plan_type, + COUNT(DISTINCT lt_id) AS active_subscribers, + SUM(CASE WHEN subscription_start_date = dt THEN 1 ELSE 0 END) AS new_subs, + SUM(CASE WHEN subscription_end_date = dt THEN 1 ELSE 0 END) AS churned_subs, + SUM(mrr_amount) AS total_mrr, + SUM(arr_amount) AS total_arr + FROM subscription_table + WHERE dt >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) + AND dt <= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) + AND is_lt_team IS FALSE + GROUP BY dt, griffin_tier_name, plan_type +), +metrics_with_baseline AS ( + SELECT + *, + AVG(total_mrr) OVER ( + PARTITION BY griffin_tier_name, plan_type + ORDER BY dt ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING + ) AS mrr_baseline_7d, + AVG(churned_subs) OVER ( + PARTITION BY griffin_tier_name, plan_type + ORDER BY dt ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING + ) AS churn_baseline_7d + FROM revenue_metrics +) +SELECT * FROM metrics_with_baseline +WHERE dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY); +``` + +**Key patterns**: +- **Segmentation**: By tier (Free/Lite/Standard/Pro/Enterprise) and plan type (self-serve/contract/pilot) +- **Baseline**: 7-day rolling average for comparison +- **Time window**: Yesterday only + +### Phase 4: Execute Query + +Run query using: +```bash +bq --project_id=ltx-dwh-explore query --use_legacy_sql=false --format=pretty " + +" +``` + +### Phase 5: Analyze Results + +**For revenue trends**: +- Compare current period vs baseline (7-day avg, prior week, prior month) +- Calculate % change and flag significant shifts (>15% DoD, >10% WoW) +- Identify which tiers/plans drove changes + +**For churn analysis**: +- Calculate churn rate = churned_subs / active_subscribers +- Compare to baseline and flag if > 5% or increase > 2x baseline +- Identify which tiers have highest churn + +**For subscription health**: +- Track new subscription volume by tier +- Monitor upgrade vs downgrade ratios +- Flag enterprise renewal dates < 90 days out + +### Phase 6: Present Findings + +Format results with: +- **Summary**: Key finding (e.g., "MRR dropped 18% yesterday") +- **Root cause**: Which segment drove the change (e.g., "Standard tier churned 12 users") +- **Breakdown**: Metrics by tier and plan type +- **Recommendation**: Action to take (investigate churn reason, contact at-risk accounts) + +**Alert format**: +``` +⚠️ REVENUE ALERT: + • Metric: MRR Drop + Current: $X | Baseline: $Y + Change: -Z% + Segment: Standard tier, self-serve + +Recommendation: Investigate recent Standard tier churns +``` + +### Phase 7: Set Up Alert (Optional) + +For ongoing monitoring: +1. Save SQL query +2. Set up in BigQuery scheduled query or Hex Thread +3. Configure notification threshold +4. Route alerts to Revenue/Growth team + +## 5. Context & References + +### Shared Knowledge +- **`shared/product-context.md`** — LTX products, user types, business model, enterprise context +- **`shared/bq-schema.md`** — Subscription tables, user segmentation queries +- **`shared/metric-standards.md`** — Revenue metric definitions (MRR, ARR, churn, LTV, retention) +- **`shared/event-registry.yaml`** — Feature events for feature-driven revenue analysis + +### Data Sources +Tables: `ltxstudio_user_tiers_dates`, `ltxstudio_subscriptions` +- Key columns: `lt_id`, `griffin_tier_name`, `subscription_start_date`, `subscription_end_date`, `plan_type`, `mrr_amount`, `arr_amount` +- Filter: `is_lt_team IS FALSE` for customer revenue + +### Tiers +- Free (no revenue) +- Lite (lowest paid tier) +- Standard (mid-tier) +- Pro (high-tier) +- Enterprise (contract/pilot) + +### Plan Types +- Self-serve (automated subscription) +- Contract (enterprise contract) +- Pilot (enterprise pilot) + +## 6. Constraints & Done + +### DO NOT + +- **DO NOT** include is_lt_team users unless explicitly requested +- **DO NOT** hardcode dates — use rolling windows +- **DO NOT** use absolute thresholds — compare to baseline +- **DO NOT** mix plan types without proper segmentation +- **DO NOT** ignore timezone differences in daily revenue calculations + +### DO + +- **DO** use LTX Studio subscription tables from bq-schema.md +- **DO** exclude is_lt_team unless explicitly requested +- **DO** validate thresholds with user before setting alerts +- **DO** use rolling windows (7-day, 30-day baselines) +- **DO** account for timezone differences in daily revenue calculations +- **DO** segment by tier AND plan type for root cause analysis +- **DO** compare against baseline (7-day avg, prior period) for trends +- **DO** calculate churn rate = churned / active subscribers +- **DO** flag MRR drops > 15% DoD or > 10% WoW +- **DO** flag churn rate > 5% or increase > 2x baseline +- **DO** flag refund rate > 3% or spike > 50% DoD +- **DO** flag new subscription drops > 20% WoW +- **DO** track enterprise renewal dates < 90 days out +- **DO** include segment breakdown in all alerts +- **DO** validate unusual patterns with Revenue/Growth team before alerting + +### Completion Criteria + +✅ All revenue metrics monitored (MRR, ARR, churn, refunds, new subs) +✅ Alerts fire with thresholds (generic pending production analysis) +✅ Segment-level root cause identified +✅ Findings presented with recommendations +✅ Timezone handling applied to daily revenue +✅ Enterprise renewals tracked diff --git a/agents/monitoring/usage/README.md b/agents/monitoring/usage/README.md new file mode 100644 index 0000000..2fff6c2 --- /dev/null +++ b/agents/monitoring/usage/README.md @@ -0,0 +1,184 @@ +# Usage Monitor + +Autonomous usage monitoring for LTX Studio using statistical anomaly detection. + +## Overview + +Detects **data spikes** (both increases and decreases) in user engagement metrics by comparing yesterday's values against the last 10 same-day-of-week data points using 2 standard deviations (2σ) threshold. + +**Problem solved:** Early detection of significant changes in user behavior, product adoption, feature launches, enterprise churn risk, or engagement shifts. + +## Quick Start + +```bash +# Install dependency (one-time) +pip3 install google-cloud-bigquery + +# Monitor yesterday (default) +python3 usage_monitor.py + +# Monitor specific date +python3 usage_monitor.py --date 2026-03-09 + +# Show help +python3 usage_monitor.py --help +``` + +## What It Monitors + +### Segments (prioritized) +1. **Enterprise Contract** - Contracted enterprise accounts (weekdays only) +2. **Enterprise Pilot** - Pilot enterprise accounts (weekdays only) +3. **Heavy Users** - Active 4+ weeks, paying, consuming tokens +4. **Paying non-Enterprise** - Paid tiers excluding enterprise +5. **Free** - Free tier users + +### Metrics per Segment +- **DAU** (Daily Active Users) +- **Token consumption** +- **Image generations** +- **Video generations** (skipped for Enterprise - too volatile) + +## Statistical Method + +**Alert Logic:** `|yesterday_value - μ| > 2σ` + +Where: +- **μ (mean)** = average of last 10 same-day-of-week values +- **σ (stddev)** = standard deviation of last 10 same-day-of-week values +- **z-score** = (yesterday - μ) / σ + +**Why same-day-of-week?** +- Monday usage differs from Friday usage +- Comparing Monday to last 10 Mondays (not all days) +- Accounts for weekly seasonality + +**Why 2 standard deviations?** +- Captures 95.4% of normal variance +- Auto-adapts to each segment's natural patterns +- Balances early detection with false positive reduction + +**Severity Levels:** +- **NOTICE**: `2 < |z| ≤ 3` - Minor deviation, monitor +- **WARNING**: `3 < |z| ≤ 4.5` - Significant anomaly, investigate +- **CRITICAL**: `|z| > 4.5` - Extreme anomaly, immediate action + +## Example Output + +``` +================================================================================ + LTX STUDIO USAGE MONITORING - 2026-03-09 + Day: WEEKDAY + Method: 2 Standard Deviations (2σ) from last 10 same-day-of-week +================================================================================ + +⏳ Running BigQuery query... +✅ Query complete (12,345,678 bytes processed) + +🔴 3 ALERTS DETECTED + +================================================================================ + +⚠️ WARNING ALERTS (2): + + • Free - Tokens + Current: 4,497,947 | Mean (μ): 3,068,455 | Std Dev (σ): 426,074 + Z-score: 3.36 (3σ < |z| ≤ 4.5σ) + Change: +46.6% from mean + + • Heavy Users - Tokens + Current: 2,157,891 | Mean (μ): 1,246,532 | Std Dev (σ): 263,847 + Z-score: 3.45 (3σ < |z| ≤ 4.5σ) + Change: +73.0% from mean + +ℹ️ NOTICE ALERTS (1): + + • Paying non-Enterprise - Dau + Current: 1,234 | Mean (μ): 1,450 | Std Dev (σ): 89 + Z-score: -2.43 (2σ < |z| ≤ 3σ) + Change: -14.9% from mean + +================================================================================ +Total: 0 CRITICAL, 2 WARNING, 1 NOTICE +================================================================================ + +Note: 2σ threshold captures 95.4% of normal variance + CRITICAL: |z| > 4.5, WARNING: 3 < |z| ≤ 4.5, NOTICE: 2 < |z| ≤ 3 +``` + +## Root Cause Investigation + +When alerts fire for **Enterprise** segments, use `investigate_root_cause.sql` to drill down to organization level: + +```bash +# Copy SQL to BigQuery console or bq CLI +# Shows week-over-week change by organization +``` + +For other segments (Heavy, Paying, Free), investigate: +- Tier distribution (Standard vs Pro vs Lite) +- Feature launches or product changes +- Marketing campaigns or promotional activity + +## Exception Handling + +**Enterprise weekends are suppressed:** +- Enterprise Contract and Pilot alerts only fire on weekdays (Mon-Fri) +- Weekend usage is too sparse and volatile for statistical reliability +- Prevents false positives from low weekend activity + +## Files + +- **`SKILL.md`** - Agent instructions and workflow documentation +- **`usage_monitor.py`** - Combined SQL + alerting logic (303 lines) +- **`investigate_root_cause.sql`** - Org-level drill-down for Enterprise alerts +- **`README.md`** - This file + +## Interpreting Results + +**Positive spikes (increase):** +- Feature launches driving adoption +- Marketing campaigns succeeding +- Product improvements resonating +- Enterprise expansion + +**Negative spikes (decrease):** +- Churn events +- Product issues or bugs +- Competitive threats +- Enterprise account going dormant + +**For CRITICAL alerts:** +- Immediate investigation required +- Contact account managers (if Enterprise) +- Check for system outages or data pipeline issues + +**For WARNING alerts:** +- Monitor for persistence (repeats next day?) +- Investigate root cause during business hours + +**For NOTICE alerts:** +- Track trend over next few days +- Document for pattern analysis + +## Integration + +**Future enhancements:** +- Slack notifications for CRITICAL/WARNING alerts +- Linear issue creation for persistent alerts +- Historical alert log and trend analysis +- Automated root cause recommendations + +## Requirements + +- Python 3.7+ +- `google-cloud-bigquery` package +- BigQuery access to `ltx-dwh-prod-processed` dataset +- Execution project: `ltx-dwh-explore` + +## Data Source + +- **Table:** `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` +- **Partitioned by:** `dt` (DATE) +- **Lookback window:** 70 days (ensures 10+ same-DOW data points) +- **LT team excluded:** Already filtered at table level