From 9dab4462039872a6cc50b8f10ae6696fc01fa9e8 Mon Sep 17 00:00:00 2001 From: Daniel Beer Date: Sun, 8 Mar 2026 23:05:58 +0200 Subject: [PATCH 01/20] feat: implement production usage monitoring with data-driven thresholds MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Transform all monitoring agents to autonomous problem detectors and implement a production-ready usage monitoring system with statistically-derived alerting thresholds based on 60-day analysis. Key Changes: 1. Autonomous Monitoring Across All Agents - Updated all 5 monitoring agents (usage, be-cost, revenue, enterprise, api-runtime) - Changed Step 1 from "Gather Requirements" to "Run Comprehensive Analysis" - Auto-analyze ALL metrics, segments, and time windows without user prompting - Auto-detect problems using statistical thresholds (Z-score, DoD, WoW, baseline) 2. Production Usage Monitoring Implementation - Data-driven thresholds per segment based on 60-day volatility analysis - 14-day same-day-of-week baseline methodology (handles weekday/weekend patterns) - Segment-specific thresholds: * Enterprise Contract/Pilot: -50% DAU, -60% Image Gens, -70% Tokens (weekday only) * Heavy Users: -25% DAU, -30% Image Gens, -25% Tokens * Paying non-Enterprise: -20% DAU, -25% Image Gens, -20% Tokens * Free: -25% DAU, -35% Image Gens, -20% Tokens - Two-tier severity: WARNING (drop > threshold), CRITICAL (drop > 1.5x threshold) - Weekend suppression for Enterprise (weekday-only alerting) - Skip Enterprise video generation alerts (CV > 100%, too volatile) 3. Root Cause Investigation Workflow - Enterprise segments: Drill down to organization level to identify which clients drove drops - Other segments: Analyze by tier distribution (Standard vs Pro vs Lite) - Alert format includes current vs baseline, drop %, threshold, and recommended actions 4. Full Segmentation Enforcement - Updated usage monitor to reference full segmentation CTE from shared/bq-schema.md - Enforces proper hierarchy: Enterprise → Heavy → Paying → Free - Consistent segmentation across all usage monitoring queries Technical Details: - Alert logic: today_value < rolling_14d_same_dow_avg * (1 - threshold_pct) - 14-day same-DOW baseline: AVG() OVER (PARTITION BY segment, EXTRACT(DAYOFWEEK FROM dt)) - Data source: ltx-dwh-prod-processed.web.ltxstudio_agg_user_date - Partition pruning on dt (DATE) for performance Anti-Patterns Addressed: - Generic thresholds replaced with segment-specific, data-driven values - Day-of-week effects handled via same-DOW comparison (not DoD on weekends) - Enterprise weekend alerts suppressed (6-7 DAU is 18% of weekday, too noisy) - Enterprise video gen alerts skipped (single-user dominated, CV > 100%) - Small segment noise handled via production threshold calibration Co-Authored-By: Claude Sonnet 4.5 --- agents/monitoring/api-runtime/SKILL.md | 16 +- agents/monitoring/be-cost/SKILL.md | 21 +- agents/monitoring/enterprise/SKILL.md | 16 +- agents/monitoring/revenue/SKILL.md | 15 +- agents/monitoring/usage/SKILL.md | 224 +++++++++++++---- .../usage/investigate_root_cause.sql | 85 +++++++ agents/monitoring/usage/usage_monitoring.py | 216 ++++++++++++++++ agents/monitoring/usage/usage_monitoring.sql | 179 ++++++++++++++ .../monitoring/usage/usage_monitoring_v2.py | 234 ++++++++++++++++++ 9 files changed, 933 insertions(+), 73 deletions(-) create mode 100644 agents/monitoring/usage/investigate_root_cause.sql create mode 100644 agents/monitoring/usage/usage_monitoring.py create mode 100644 agents/monitoring/usage/usage_monitoring.sql create mode 100644 agents/monitoring/usage/usage_monitoring_v2.py diff --git a/agents/monitoring/api-runtime/SKILL.md b/agents/monitoring/api-runtime/SKILL.md index 84ece4a..f481b0e 100644 --- a/agents/monitoring/api-runtime/SKILL.md +++ b/agents/monitoring/api-runtime/SKILL.md @@ -24,12 +24,16 @@ tags: [monitoring, api, performance, latency, errors] ## Steps -1. **Gather requirements from user:** - - Which performance metric to monitor (latency, errors, throughput) - - Alert threshold (e.g., "P95 latency > 30s", "error rate > 5%", "throughput drops > 20%") - - Time window (hourly, daily) - - Scope (all requests, specific endpoint, specific org) - - Notification channel +1. **Run Comprehensive Analysis:** + - **What to monitor**: ALL - Latency (P50/P95/P99), error rates, throughput, inference time, queue time + - **By Segment**: ALL - By endpoint, model, organization, resolution + - **Time window**: Last 7 days with hourly/daily granularity + - **Alert threshold**: Auto-detect problems using: + - P95 latency > 2x baseline or > 60s + - Error rate > 5% or DoD increase > 50% + - Throughput drops > 30% DoD/WoW + - Queue time > 50% of processing time + - Infrastructure errors > 10 requests/hour 2. **Read shared files:** - `shared/bq-schema.md` — GPU cost table (has API runtime data) and ltxvapi tables diff --git a/agents/monitoring/be-cost/SKILL.md b/agents/monitoring/be-cost/SKILL.md index c2f177e..cd4cfac 100644 --- a/agents/monitoring/be-cost/SKILL.md +++ b/agents/monitoring/be-cost/SKILL.md @@ -24,14 +24,19 @@ compatibility: ## Steps -### 1. Gather Requirements - -Ask the user: -- **What to monitor**: Total costs, cost by product/feature, cost efficiency, utilization, anomalies -- **Scope**: LTX API, LTX Studio, or both? Specific endpoint/org/process? -- **Time window**: Daily, weekly, monthly? How far back? -- **Analysis type**: Trends, comparisons (DoD/WoW), anomaly detection, breakdowns -- **Alert threshold** (if setting up alerts): Absolute ($X/day) or relative (spike > X% vs baseline) +### 1. Run Comprehensive Analysis + +**Automatically analyze ALL cost metrics to find problems:** +- **What to monitor**: ALL - Total costs, cost by product (API/Studio), cost by endpoint/model/org/process, cost efficiency, GPU utilization, billing type breakdown +- **By Segment**: ALL - LTX API (by endpoint/model/org) and LTX Studio (by process/workspace), by billing type, by GPU type +- **Time window**: Last 30 days with 7-day rolling baseline +- **Alert threshold**: Auto-detect problems using: + - Z-score > 2 (cost anomaly) + - DoD cost spike > 30% vs 7-day avg + - WoW cost increase > 20% + - Cost per request > 20% above baseline + - GPU utilization < 50% for 3+ days + - Failure cost > $500/day ### 2. Read Shared Knowledge diff --git a/agents/monitoring/enterprise/SKILL.md b/agents/monitoring/enterprise/SKILL.md index 2cf15ce..02a0b57 100644 --- a/agents/monitoring/enterprise/SKILL.md +++ b/agents/monitoring/enterprise/SKILL.md @@ -22,11 +22,17 @@ tags: [monitoring, enterprise, accounts, contracts] ## Steps -1. **Gather requirements from user:** - - Which enterprise org(s) to monitor (or all) - - Alert threshold based on historical usage of each account (e.g., "usage drops > 30% vs their baseline", "MAU below their 30-day average", "< 50% of contracted quota") - - Time window (weekly, monthly) - - Notification channel (Slack, email, Linear issue) +1. **Run Comprehensive Analysis:** + - **Monitor**: ALL enterprise orgs automatically + - **What to monitor**: ALL - Account usage (DAU/WAU/MAU), token consumption vs quota, user activation, video/image generations, power user engagement + - **By Segment**: ALL enterprise accounts (apply McCann split, exclude Lightricks/Popular Pays) + - **Time window**: Last 30 days with org-specific historical baselines + - **Alert threshold**: Auto-detect problems using: + - DoD/WoW usage drops > 30% vs org's baseline + - MAU below org's 30-day average + - Token consumption < 50% of contracted quota (underutilization) + - Power user drops > 20% within org + - Zero activity for 7+ consecutive days 2. **Read shared files:** - `shared/product-context.md` — LTX products, enterprise business model, user types diff --git a/agents/monitoring/revenue/SKILL.md b/agents/monitoring/revenue/SKILL.md index 07f45b4..9783752 100644 --- a/agents/monitoring/revenue/SKILL.md +++ b/agents/monitoring/revenue/SKILL.md @@ -24,11 +24,16 @@ tags: [monitoring, revenue, subscriptions] ## Steps -1. **Gather requirements from user:** - - Which revenue metric to monitor - - Alert threshold (e.g., "drop > 10%", "< $X per day") - - Time window (daily, weekly, monthly) - - Notification channel (Slack, email) +1. **Run Comprehensive Analysis:** + - **What to monitor**: ALL - MRR, ARR, daily revenue, new subscriptions, cancellations, churns, renewals, refunds, tier changes (upgrades/downgrades), enterprise contract values + - **By Segment**: ALL - By tier (Free/Lite/Standard/Pro/Enterprise), by plan type (self-serve/contract/pilot) + - **Time window**: Last 30 days with 7-day rolling baseline + - **Alert threshold**: Auto-detect problems using: + - Revenue drop > 15% DoD or > 10% WoW + - Churn rate > 5% or increase > 2x baseline + - Refund rate > 3% or spike > 50% DoD + - New subscriptions drop > 20% WoW + - Enterprise renewals < 90 days out 2. **Read shared files:** - `shared/bq-schema.md` — Subscription tables (ltxstudio_user_tiers_dates, etc.) diff --git a/agents/monitoring/usage/SKILL.md b/agents/monitoring/usage/SKILL.md index 3a44db0..900f9ee 100644 --- a/agents/monitoring/usage/SKILL.md +++ b/agents/monitoring/usage/SKILL.md @@ -12,19 +12,34 @@ tags: [monitoring, usage, dau, generations, engagement] - "Monitor DAU/MAU/WAU" - "Alert on usage drops" - "Monitor generation volumes (image/video)" -- "Track token consumption" +- "Monitor token consumption" - "Detect usage anomalies or spikes" ## Steps -### 1. Gather Requirements +### 1. Run Comprehensive Analysis -Ask the user: -- **What to monitor**: DAU, generations, downloads, tokens, specific feature usage -- **Scope**: All users, specific tier, paid vs free -- **Time window**: Daily, weekly, monthly? How far back? -- **Analysis type**: Trends, comparisons (DoD/WoW), anomaly detection, segmentation -- **Alert threshold** (if setting up alerts): Absolute (< X users) or relative (drop > X% vs baseline) +**Automatically analyze ALL metrics to find problems:** +- **What to monitor**: DAU, Image Generations, Video Generations, Token Consumption +- **By Segment**: ALL - Use full segmentation CTE from `shared/bq-schema.md` (Enterprise → Heavy → Paying → Free) +- **Time window**: Last 30 days with 14-day rolling same-day-of-week baseline +- **Alert method**: Compare today's value to rolling 14-day same-DOW average + +**Production Thresholds** (from `/Users/dbeer/workspace/dwh-data-model-transforms/queries/segment_alerting_thresholds.md`): + +| Segment | DAU | Image Gens | Video Gens | Tokens | Notes | +|---------|-----|------------|------------|--------|-------| +| **Enterprise Contract** | -50% | -60% | Skip | -70% | Weekday only (Mon-Fri) | +| **Enterprise Pilot** | -50% | -60% | Skip | -70% | Weekday only (Mon-Fri) | +| **Heavy Users** | -25% | -30% | -30% | -25% | All days | +| **Paying non-Enterprise** | -20% | -25% | -25% | -20% | All days | +| **Free** | -25% | -35% | -30% | -20% | All days | + +**Alert fires when:** `today_value < rolling_14d_same_dow_avg * (1 - threshold_pct)` + +**Severity levels:** +- **WARNING**: Drop exceeds threshold +- **CRITICAL**: Drop exceeds 1.5x threshold ### 2. Read Shared Knowledge @@ -41,52 +56,148 @@ Data Nuances: - Partitioned by `action_ts` (TIMESTAMP) — filter with `date(action_ts)` for date ranges - LT team already excluded (is_lt_team IS FALSE applied at table level) - Use `action_category = 'generations'` when counting video generations -- User tier at action time: `griffin_tier_name_at_action` +- By Segment: Use full segmentation CTE from `shared/bq-schema.md` (lines 441-516) — Enterprise → Heavy → Paying → Free +- **Enterprise usage patterns**: Strong weekday/weekend differences. For Enterprise segment, use same-day-of-week comparisons (7-day lookback) instead of DoD on weekends. Calculate separate weekday/weekend baselines. ### 3. Write Monitoring SQL -Based on user's request, write SQL to: -- Query current usage metric -- Calculate baseline (e.g., 7-day average, prior week) -- Flag when deviation exceeds threshold +Write SQL to query yesterday's metrics with 14-day same-day-of-week baseline: + +```sql +WITH daily_metrics AS ( + SELECT + a.dt, + EXTRACT(DAYOFWEEK FROM a.dt) AS day_of_week, + CASE + WHEN e.lt_id IS NOT NULL AND e.account_type = 'Contract' THEN 'Enterprise Contract' + WHEN e.lt_id IS NOT NULL AND e.account_type = 'Pilot' THEN 'Enterprise Pilot' + WHEN h.lt_id IS NOT NULL THEN 'Heavy Users' + WHEN a.griffin_tier_name <> 'free' THEN 'Paying non-Enterprise' + ELSE 'Free' + END AS segment, + COUNT(DISTINCT a.lt_id) AS dau, + SUM(a.num_tokens_consumed) AS tokens, + SUM(a.num_generate_image) AS image_gens, + SUM(a.num_generate_video) AS video_gens + FROM `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` a + LEFT JOIN enterprise_users e ON a.lt_id = e.lt_id + LEFT JOIN heavy_users h ON a.lt_id = h.lt_id + WHERE a.dt >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) + AND a.dt <= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) + GROUP BY a.dt, day_of_week, segment +), +aggregated_metrics AS ( + SELECT + dt, + segment, + SUM(dau) AS dau, + SUM(tokens) AS tokens, + SUM(image_gens) AS image_gens, + SUM(video_gens) AS video_gens, + -- 14-day same-day-of-week baseline + AVG(SUM(dau)) OVER ( + PARTITION BY segment, EXTRACT(DAYOFWEEK FROM dt) + ORDER BY dt ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING + ) AS dau_baseline_14d + -- ... repeat for other metrics + FROM daily_metrics + GROUP BY dt, segment +) +SELECT * FROM aggregated_metrics +WHERE dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY); +``` -**Common patterns:** -- **Daily Usage**: COUNT(DISTINCT lt_id) by date -- **By Segment**: Use full segmentation CTE from `shared/bq-schema.md` (lines 441-516) — Enterprise → Heavy → Paying → Free -- **Anomaly Detection**: Z-score calculation (metric - avg) / stddev +**Key patterns:** +- **14-day same-DOW baseline**: `AVG() OVER (PARTITION BY segment, EXTRACT(DAYOFWEEK FROM dt) ...)` +- **Segmentation**: Use full CTEs from `shared/bq-schema.md` (lines 441-516) +- **Yesterday only**: Filter to `dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)` -### 4. Execute Query +### 4. Execute Query and Run Monitoring -Run query using: +Run query and save to CSV: ```bash -bq --project_id=ltx-dwh-explore query --use_legacy_sql=false --format=pretty " - -" +bq --project_id=ltx-dwh-explore query --use_legacy_sql=false --format=csv \ + < usage_monitoring.sql 2>/dev/null | grep -v "^Waiting" > usage_data_clean.csv ``` -### 5. Analyze Results +Run Python monitoring script: +```bash +python3 usage_monitoring_v2.py +``` -**For usage trends:** -- Compare current period vs baseline (7-day avg, prior week, prior month) -- Calculate % change and flag significant shifts (>15-20%) -- Check if changes are consistent across all tiers or specific to one segment +The script: +1. Reads the CSV data +2. Checks each segment × metric against production thresholds +3. **Suppresses Enterprise alerts on weekends** (weekday-only) +4. **Skips Video Generations for Enterprise** (too volatile) +5. Flags **WARNING** (drop > threshold) or **CRITICAL** (drop > 1.5x threshold) -**For anomaly detection:** -- Flag days with Z-score > 2 (metric deviates > 2 std devs from rolling avg) -- Investigate root cause: product issue, marketing campaign, seasonal pattern +### 5. Analyze Results -**For segmentation:** -- Use full segmentation CTE from `shared/bq-schema.md` (Enterprise → Heavy → Paying → Free) -- Identify which segment is driving changes -- Validate with product team if unexpected patterns emerge +**When alerts fire:** +1. **Check severity**: CRITICAL requires immediate investigation, WARNING needs monitoring +2. **Identify segment**: Which user segment is affected? +3. **Check day-of-week**: Is this expected (e.g., weekend drop in Enterprise)? +4. **Investigate root cause**: + - For **Enterprise segments**: Drill down to organization level to see which clients drove the drop + - For **other segments**: Check tier distribution (Standard vs Pro vs Lite) + - Look for product issues, outages, or seasonal patterns + +**Root cause investigation SQL:** +```sql +-- For Enterprise: Org-level drill-down +SELECT + e.org, + e.account_type, + COUNT(DISTINCT a.lt_id) AS dau, + SUM(a.num_tokens_consumed) AS tokens, + SUM(a.num_generate_image) AS image_gens +FROM `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` a +JOIN enterprise_users e ON a.lt_id = e.lt_id +WHERE a.dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) +GROUP BY e.org, e.account_type +ORDER BY tokens DESC; +``` ### 6. Present Findings -Format results with: -- **Summary**: Key finding (e.g., "DAU dropped 25% yesterday") -- **Root cause**: What drove the change (e.g., "Pro tier users -40%, free users flat") -- **Breakdown**: Metrics by segment or day -- **Recommendation**: Action to take (investigate, alert team, monitor) +**Alert format:** + +``` +🔴 CRITICAL ALERTS (N): + • Segment - Metric + Current: X | Baseline: Y + Drop: Z% | Threshold: T% +``` + +**Root cause format:** + +For **Enterprise segments**, show client-level details: +``` +Enterprise Contract - Tokens Drop + +Clients Driving Drop: +- Novig: 1,914 tokens → 0 (-100%) +- Miroma: 27K tokens → 0 (-100%) + +Clients Driving Growth: +- McCann_Paris: 19K → 32K (+68%) + +Net Result: Only 1 out of 4 contract accounts active +``` + +For **other segments**, show summary: +``` +Heavy Users - Tokens Drop + +Standard tier users consuming 36% fewer tokens per user +(engagement drop, not churn) +``` + +**Recommended Actions:** +- **Immediate**: Contact account managers for stopped clients +- **Urgent**: Investigate large drops (>70%) +- **Monitor**: Track new/growing accounts for stability ### 7. Set Up Alert (if requested) @@ -107,10 +218,18 @@ See `references/query-templates.md` for production-ready SQL queries: | File | Read when | |------|-----------| +| `/Users/dbeer/workspace/dwh-data-model-transforms/queries/segment_alerting_thresholds.md` | **CRITICAL** - Production thresholds per segment (60-day analysis) | | `shared/bq-schema.md` | Table schema, segmentation queries, user tier columns | -| `shared/event-registry.yaml` | Finding event names for specific features | | `shared/metric-standards.md` | DAU/WAU/MAU definitions, generation metrics, usage patterns | -| `references/query-templates.md` | Production-ready SQL queries for usage monitoring | +| `shared/event-registry.yaml` | Finding event names for specific features | + +## Production Scripts + +| Script | Purpose | +|--------|---------| +| `usage_monitoring.sql` | BigQuery query to get yesterday's metrics with baselines | +| `usage_monitoring_v2.py` | Python script that checks thresholds and generates alerts | +| `investigate_root_cause.sql` | Organization-level drill-down for Enterprise segments | ## Rules @@ -139,17 +258,24 @@ See `references/query-templates.md` for production-ready SQL queries: ### Alerts -- **DO** set thresholds based on historical baseline, not absolute values -- **DO** include segment breakdown in alerts (which segment drove the change) -- **DO NOT** alert on day-to-day noise (< 10% variance is normal) +- **DO** use production thresholds from `/Users/dbeer/workspace/dwh-data-model-transforms/queries/segment_alerting_thresholds.md` +- **DO** suppress Enterprise alerts on weekends (weekday-only alerting) +- **DO** skip Video Generation alerts for Enterprise (too volatile, CV > 100%) +- **DO** include segment breakdown and root cause in alerts +- **DO** flag CRITICAL (drop > 1.5x threshold) vs WARNING (drop > threshold) +- **DO** investigate at organization level for Enterprise, tier level for others +- **DO NOT** use absolute thresholds — always compare to rolling 14-day same-DOW baseline +- **DO NOT** alert on expected patterns (e.g., Enterprise weekend drops) ## Anti-Patterns | Anti-Pattern | Why | Do this instead | |--------------|-----|-----------------| -| Counting video gens without `action_category = 'generations'` | `generate_video` appears in multiple categories | Always add `AND action_category = 'generations'` | -| Missing `date(action_ts)` filter | Partition is on TIMESTAMP, not DATE | Wrap in `date()` for date comparisons | -| Using simplified segmentation | Segmentation hierarchy must be respected | Use full segmentation CTE from `shared/bq-schema.md` (Enterprise → Heavy → Paying → Free) | -| Not using `NULLIF` in SAFE_DIVIDE denominators | STDDEV can be zero, causing division errors | Use `NULLIF(std_users_14d, 0)` | +| Alerting on Enterprise weekends | Weekend DAU is 6-7 (18% of weekday), too noisy | Skip Enterprise alerts on Sat/Sun | +| Alerting on Enterprise video gens | CV > 100%, single-user dominated | Skip video gen alerts for Enterprise | +| Using generic thresholds across segments | Each segment has different volatility | Use segment-specific thresholds from production file | +| Comparing Monday to Sunday | Day-of-week effects are huge | Use same-DOW baseline (Mon vs last Mon) | +| Simplified segmentation | Segmentation hierarchy must be respected | Use full segmentation CTE from `shared/bq-schema.md` | +| Alerting on 1-DAU segments | Single-user noise dominates signal | Production thresholds already account for this | +| Using Z-scores on small segments | Need 30+ data points for stable statistics | Use simple % drop vs baseline for Enterprise | | Filtering `is_lt_team IS FALSE` | Already filtered at table level | No filter needed | -| Comparing to absolute thresholds | Normal usage varies by season/day-of-week | Always compare vs baseline (7-day avg, prior week) | diff --git a/agents/monitoring/usage/investigate_root_cause.sql b/agents/monitoring/usage/investigate_root_cause.sql new file mode 100644 index 0000000..8cf6506 --- /dev/null +++ b/agents/monitoring/usage/investigate_root_cause.sql @@ -0,0 +1,85 @@ +-- Root Cause Investigation - Organization-Level Drill-Down +-- For Enterprise Contract and Pilot alerts + +WITH ent_users AS ( + SELECT DISTINCT + lt_id, + CASE + WHEN COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) = 'McCann_NY' THEN 'McCann_NY' + WHEN COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) LIKE '%McCann%' THEN 'McCann_Paris' + ELSE COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) + END AS org + FROM `ltx-dwh-prod-processed.web.ltxstudio_users` + WHERE is_enterprise_user + AND current_customer_plan_type IN ('contract', 'pilot') + AND COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) NOT IN ('Lightricks', 'Popular Pays', 'None') +), +enterprise_users AS ( + SELECT DISTINCT + lt_id, + CASE + WHEN org IN ('Indegene', 'HearWell_BeWell', 'Novig', 'Cylndr Studios', 'Miroma', 'Deriv', 'McCann_Paris') + THEN org + ELSE CONCAT(org, ' Pilot') + END AS org, + CASE + WHEN org IN ('Indegene', 'HearWell_BeWell', 'Novig', 'Cylndr Studios', 'Miroma', 'Deriv', 'McCann_Paris') + THEN 'Contract' + ELSE 'Pilot' + END AS account_type + FROM ent_users + WHERE org NOT IN ('Lightricks', 'Popular Pays', 'None') +), +org_metrics AS ( + SELECT + a.dt, + e.org, + e.account_type, + + -- Metrics + COUNT(DISTINCT a.lt_id) AS dau, + SUM(a.num_tokens_consumed) AS tokens, + SUM(a.num_generate_image) AS image_gens, + SUM(a.num_generate_video) AS video_gens, + SUM(a.num_generate_image) + SUM(a.num_generate_video) AS total_gens + + FROM `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` a + JOIN enterprise_users e ON a.lt_id = e.lt_id + WHERE a.dt >= DATE_SUB(CURRENT_DATE(), INTERVAL 8 DAY) + AND a.dt <= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) + GROUP BY a.dt, e.org, e.account_type +) +SELECT + org, + account_type, + + -- Yesterday (March 7) + MAX(CASE WHEN dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) THEN dau END) AS dau_yesterday, + MAX(CASE WHEN dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) THEN tokens END) AS tokens_yesterday, + MAX(CASE WHEN dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) THEN image_gens END) AS image_gens_yesterday, + MAX(CASE WHEN dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) THEN video_gens END) AS video_gens_yesterday, + MAX(CASE WHEN dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) THEN total_gens END) AS total_gens_yesterday, + + -- Same day last week (Feb 28) + MAX(CASE WHEN dt = DATE_SUB(CURRENT_DATE(), INTERVAL 8 DAY) THEN dau END) AS dau_last_week, + MAX(CASE WHEN dt = DATE_SUB(CURRENT_DATE(), INTERVAL 8 DAY) THEN tokens END) AS tokens_last_week, + MAX(CASE WHEN dt = DATE_SUB(CURRENT_DATE(), INTERVAL 8 DAY) THEN image_gens END) AS image_gens_last_week, + MAX(CASE WHEN dt = DATE_SUB(CURRENT_DATE(), INTERVAL 8 DAY) THEN video_gens END) AS video_gens_last_week, + MAX(CASE WHEN dt = DATE_SUB(CURRENT_DATE(), INTERVAL 8 DAY) THEN total_gens END) AS total_gens_last_week, + + -- Change % + SAFE_DIVIDE( + MAX(CASE WHEN dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) THEN tokens END) - + MAX(CASE WHEN dt = DATE_SUB(CURRENT_DATE(), INTERVAL 8 DAY) THEN tokens END), + NULLIF(MAX(CASE WHEN dt = DATE_SUB(CURRENT_DATE(), INTERVAL 8 DAY) THEN tokens END), 0) + ) * 100 AS tokens_wow_pct, + + SAFE_DIVIDE( + MAX(CASE WHEN dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) THEN total_gens END) - + MAX(CASE WHEN dt = DATE_SUB(CURRENT_DATE(), INTERVAL 8 DAY) THEN total_gens END), + NULLIF(MAX(CASE WHEN dt = DATE_SUB(CURRENT_DATE(), INTERVAL 8 DAY) THEN total_gens END), 0) + ) * 100 AS total_gens_wow_pct + +FROM org_metrics +GROUP BY org, account_type +ORDER BY account_type, tokens_yesterday DESC; diff --git a/agents/monitoring/usage/usage_monitoring.py b/agents/monitoring/usage/usage_monitoring.py new file mode 100644 index 0000000..315c678 --- /dev/null +++ b/agents/monitoring/usage/usage_monitoring.py @@ -0,0 +1,216 @@ +#!/usr/bin/env python3 +""" +LTX Studio Usage Monitoring with Anomaly Scoring +Automatically detects problems across all segments and metrics +""" + +import numpy as np +from dataclasses import dataclass +from typing import List + + +@dataclass +class SignalResult: + name: str + raw_value: float + score: float # normalized 0–1 + fired: bool # did it cross its individual threshold? + description: str + + +@dataclass +class AnomalyResult: + anomaly_score: float # 0–1 composite + is_alert: bool + signals: List[SignalResult] + + def explain(self): + print(f"\n{'='*45}") + print(f" Anomaly Score : {self.anomaly_score:.2f} | Alert: {'🔴 YES' if self.is_alert else '🟢 NO'}") + print(f"{'='*45}") + for s in self.signals: + bar = "█" * int(s.score * 10) + "░" * (10 - int(s.score * 10)) + flag = "⚠️ " if s.fired else " " + print(f" {flag}{s.name:<28} [{bar}] {s.score:.2f}") + print(f" → {s.description}") + print(f"{'='*45}\n") + + +def score_zscore(z: float, fire_threshold=2.0, max_val=4.0) -> SignalResult: + score = min(abs(z) / max_val, 1.0) + return SignalResult( + name="Z-Score", + raw_value=z, + score=score, + fired=abs(z) > fire_threshold, + description=f"z={z:.2f} (threshold >{fire_threshold})" + ) + + +def score_dod(dod_pct: float, fire_threshold=20.0) -> SignalResult: + score = min(abs(dod_pct) / (fire_threshold * 2), 1.0) + direction = "▲" if dod_pct > 0 else "▼" + return SignalResult( + name="Day-over-Day Change", + raw_value=dod_pct, + score=score, + fired=abs(dod_pct) > fire_threshold, + description=f"{direction} {abs(dod_pct):.1f}% (threshold ±{fire_threshold}%)" + ) + + +def score_wow(wow_pct: float, fire_threshold=20.0, is_enterprise=True) -> SignalResult: + if not is_enterprise: + return SignalResult("Same-Day-of-Week Change", wow_pct, 0.0, False, "skipped (non-enterprise)") + score = min(abs(wow_pct) / (fire_threshold * 2), 1.0) + direction = "▲" if wow_pct > 0 else "▼" + return SignalResult( + name="Same-Day-of-Week Change", + raw_value=wow_pct, + score=score, + fired=abs(wow_pct) > fire_threshold, + description=f"{direction} {abs(wow_pct):.1f}% vs same day last week (threshold ±{fire_threshold}%)" + ) + + +def score_baseline(current: float, baseline: float, fire_threshold=0.85) -> SignalResult: + ratio = current / baseline if baseline > 0 else 1.0 + # only scores when below threshold, not above + shortfall = max(fire_threshold - ratio, 0) + score = min(shortfall / (1 - fire_threshold), 1.0) + return SignalResult( + name="14-Day Baseline Ratio", + raw_value=ratio, + score=score, + fired=ratio < fire_threshold, + description=f"current={current:.0f}, baseline={baseline:.0f}, ratio={ratio:.2f} (threshold <{fire_threshold})" + ) + + +@dataclass +class VolumeGates: + """ + Minimum thresholds that must be met before a signal is evaluated. + Prevents single-user noise from triggering alerts. + """ + min_dau: int = 10 # minimum active users for DAU-based alerts + min_tokens: int = 5_000 # minimum token volume + min_events: int = 50 # minimum events (gens, sessions, etc.) + min_baseline_days: int = 7 # need at least 7 days of history to use baseline + + +def compute_anomaly_score( + current_value: float, + baseline_14d: float, + historical_values: List[float], + dod_pct: float, + wow_pct: float, + is_enterprise: bool = True, + alert_threshold: float = 0.5, + weights: dict = None, + gates: VolumeGates = None, + # volume context so gates can make decisions + current_dau: int = None, + current_tokens: int = None, + current_events: int = None, +) -> AnomalyResult: + """ + Computes a composite anomaly score from multiple signals. + + Args: + current_value: Today's metric value + baseline_14d: 14-day rolling average + historical_values: Recent values used to calculate z-score (e.g. last 30 days) + dod_pct: Day-over-day % change + wow_pct: Same-day-of-week % change + is_enterprise: Whether to include WoW check + alert_threshold: Score above which we fire an alert (0–1) + weights: Optional custom weights per signal + gates: Minimum volume thresholds — signals skipped if not met + current_dau: Today's DAU (for gate evaluation) + current_tokens: Today's token count (for gate evaluation) + current_events: Today's event count (for gate evaluation) + + Returns: + AnomalyResult with composite score, alert flag, and per-signal breakdown + """ + if weights is None: + weights = {"zscore": 0.35, "dod": 0.20, "wow": 0.20, "baseline": 0.25} + if gates is None: + gates = VolumeGates() + + # ── Volume gate checks (metric-specific) ───────────────────────────────── + # For DAU metrics: check DAU + # For token metrics: check tokens OR DAU (if segment has users, tokens matter even if low) + # For event metrics: check events OR DAU + + dau_ok = current_dau is None or current_dau >= gates.min_dau + token_ok = current_tokens is None or current_tokens >= gates.min_tokens + event_ok = current_events is None or current_events >= gates.min_events + + # More lenient: if there are enough users, we care about their activity even if it's low + volume_ok = dau_ok or token_ok or event_ok + + def gated(signal_fn, *args, reason="low volume", **kwargs) -> SignalResult: + """Run signal only if volume gates pass, otherwise return a skipped result.""" + if not volume_ok: + s = signal_fn(*args, **kwargs) + return SignalResult(s.name, s.raw_value, 0.0, False, f"skipped ({reason})") + return signal_fn(*args, **kwargs) + + # ── Z-score needs enough history ───────────────────────────────────────── + history_ok = len(historical_values) >= gates.min_baseline_days + mean = np.mean(historical_values) if history_ok else 0 + std = np.std(historical_values) if history_ok else 0 + z = (current_value - mean) / std if (history_ok and std > 0) else 0.0 + + signals = [ + gated(score_zscore, z, reason="low volume" if not volume_ok else "insufficient history" if not history_ok else ""), + gated(score_dod, dod_pct), + gated(score_wow, wow_pct, is_enterprise=is_enterprise), + gated(score_baseline, current_value, baseline_14d), + ] + + # ── Weighted average (skip disabled/gated signals) ─────────────────────── + active = [(s, w) for s, w in zip(signals, weights.values()) if "skipped" not in s.description] + total_weight = sum(w for _, w in active) + anomaly_score = sum(s.score * w for s, w in active) / total_weight if total_weight > 0 else 0.0 + + return AnomalyResult( + anomaly_score=round(anomaly_score, 3), + is_alert=anomaly_score >= alert_threshold, + signals=signals + ) + + +# ── Example usage ──────────────────────────────────────────────────────────── + +if __name__ == "__main__": + np.random.seed(42) + history = list(np.random.normal(loc=1000, scale=50, size=30)) + gates = VolumeGates(min_dau=10, min_tokens=5_000, min_events=50) + + # ── Case 1: Novig-style — 1 DAU, tiny volume → should NOT alert ────────── + print("Case 1: Novig (1 DAU, zero activity → but too small to alert)") + result = compute_anomaly_score( + current_value=0, baseline_14d=1914, + historical_values=[1914]*14, + dod_pct=-100.0, wow_pct=-100.0, + is_enterprise=True, + current_dau=1, current_tokens=0, current_events=0, + gates=gates + ) + result.explain() + + # ── Case 2: Jazz Side Pilot — real drop on meaningful volume → should alert + print("Case 2: Jazz Side Pilot (1,044 tokens vs 29,664 last week — real signal)") + history2 = list(np.random.normal(loc=25000, scale=2000, size=30)) + result2 = compute_anomaly_score( + current_value=1044, baseline_14d=25000, + historical_values=history2, + dod_pct=-96.0, wow_pct=-96.0, + is_enterprise=True, + current_dau=15, current_tokens=1044, current_events=6, + gates=gates + ) + result2.explain() diff --git a/agents/monitoring/usage/usage_monitoring.sql b/agents/monitoring/usage/usage_monitoring.sql new file mode 100644 index 0000000..d46db3f --- /dev/null +++ b/agents/monitoring/usage/usage_monitoring.sql @@ -0,0 +1,179 @@ +-- LTX Studio Usage Monitoring Query +-- Gets last 30 days of data for anomaly detection across all segments + +WITH ent_users AS ( + SELECT DISTINCT + lt_id, + CASE + WHEN COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) = 'McCann_NY' THEN 'McCann_NY' + WHEN COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) LIKE '%McCann%' THEN 'McCann_Paris' + ELSE COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) + END AS org + FROM `ltx-dwh-prod-processed.web.ltxstudio_users` + WHERE is_enterprise_user + AND current_customer_plan_type IN ('contract', 'pilot') + AND COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) NOT IN ('Lightricks', 'Popular Pays', 'None') +), +enterprise_users AS ( + SELECT DISTINCT + lt_id, + CASE + WHEN org IN ('Indegene', 'HearWell_BeWell', 'Novig', 'Cylndr Studios', 'Miroma', 'Deriv', 'McCann_Paris') + THEN org + ELSE CONCAT(org, ' Pilot') + END AS org, + CASE + WHEN org IN ('Indegene', 'HearWell_BeWell', 'Novig', 'Cylndr Studios', 'Miroma', 'Deriv', 'McCann_Paris') + THEN 'Contract' + ELSE 'Pilot' + END AS account_type + FROM ent_users + WHERE org NOT IN ('Lightricks', 'Popular Pays', 'None') +), +heavy_users AS ( + SELECT DISTINCT + u.lt_id, + u.griffin_tier_name + FROM `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` u + WHERE u.px_first_purchase_ts IS NOT NULL + AND u.dt >= DATE_SUB(CURRENT_DATE(), INTERVAL 28 DAY) + AND u.dt <= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) + AND DATE(u.px_first_purchase_ts) < DATE_SUB(CURRENT_DATE(), INTERVAL 28 DAY) + AND u.griffin_tier_name NOT IN ('free', 'custom_plan') + AND LOWER(u.email_domain) NOT LIKE '%lightricks%' + AND u.num_tokens_consumed > 0 + GROUP BY u.lt_id, u.griffin_tier_name + HAVING COUNT(DISTINCT DATE_TRUNC(u.dt, WEEK)) >= 4 +), +daily_metrics AS ( + SELECT + a.dt, + EXTRACT(DAYOFWEEK FROM a.dt) AS day_of_week, + + -- Segment assignment + CASE + WHEN e.lt_id IS NOT NULL AND e.account_type = 'Contract' THEN 'Enterprise Contract' + WHEN e.lt_id IS NOT NULL AND e.account_type = 'Pilot' THEN 'Enterprise Pilot' + WHEN h.lt_id IS NOT NULL THEN 'Heavy Users' + WHEN a.griffin_tier_name <> 'free' THEN 'Paying non-Enterprise' + ELSE 'Free' + END AS segment, + + -- Organization for Enterprise + e.org, + + -- Metrics + COUNT(DISTINCT a.lt_id) AS dau, + SUM(a.num_tokens_consumed) AS tokens, + SUM(a.num_generate_image) AS image_gens, + SUM(a.num_generate_video) AS video_gens + + FROM `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` a + LEFT JOIN enterprise_users e ON a.lt_id = e.lt_id + LEFT JOIN heavy_users h ON a.lt_id = h.lt_id + WHERE a.dt >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) + AND a.dt <= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) + GROUP BY a.dt, day_of_week, segment, e.org +), +aggregated_metrics AS ( + SELECT + dt, + day_of_week, + segment, + + -- Aggregate metrics + SUM(dau) AS dau, + SUM(tokens) AS tokens, + SUM(image_gens) AS image_gens, + SUM(video_gens) AS video_gens, + + -- Previous day + LAG(SUM(dau), 1) OVER (PARTITION BY segment ORDER BY dt) AS dau_yesterday, + LAG(SUM(tokens), 1) OVER (PARTITION BY segment ORDER BY dt) AS tokens_yesterday, + LAG(SUM(image_gens), 1) OVER (PARTITION BY segment ORDER BY dt) AS image_gens_yesterday, + LAG(SUM(video_gens), 1) OVER (PARTITION BY segment ORDER BY dt) AS video_gens_yesterday, + + -- Same day last week + LAG(SUM(dau), 7) OVER (PARTITION BY segment ORDER BY dt) AS dau_last_week, + LAG(SUM(tokens), 7) OVER (PARTITION BY segment ORDER BY dt) AS tokens_last_week, + LAG(SUM(image_gens), 7) OVER (PARTITION BY segment ORDER BY dt) AS image_gens_last_week, + LAG(SUM(video_gens), 7) OVER (PARTITION BY segment ORDER BY dt) AS video_gens_last_week, + + -- 14-day baseline (weekday vs weekend for Enterprise) + AVG(SUM(dau)) OVER ( + PARTITION BY segment, + CASE WHEN EXTRACT(DAYOFWEEK FROM dt) BETWEEN 2 AND 6 THEN 'weekday' ELSE 'weekend' END + ORDER BY dt ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING + ) AS dau_baseline_14d, + AVG(SUM(tokens)) OVER ( + PARTITION BY segment, + CASE WHEN EXTRACT(DAYOFWEEK FROM dt) BETWEEN 2 AND 6 THEN 'weekday' ELSE 'weekend' END + ORDER BY dt ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING + ) AS tokens_baseline_14d, + AVG(SUM(image_gens)) OVER ( + PARTITION BY segment, + CASE WHEN EXTRACT(DAYOFWEEK FROM dt) BETWEEN 2 AND 6 THEN 'weekday' ELSE 'weekend' END + ORDER BY dt ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING + ) AS image_gens_baseline_14d, + AVG(SUM(video_gens)) OVER ( + PARTITION BY segment, + CASE WHEN EXTRACT(DAYOFWEEK FROM dt) BETWEEN 2 AND 6 THEN 'weekday' ELSE 'weekend' END + ORDER BY dt ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING + ) AS video_gens_baseline_14d + + FROM daily_metrics + GROUP BY dt, day_of_week, segment +) +SELECT + dt, + segment, + + -- Current values + dau, + tokens, + image_gens, + video_gens, + + -- DoD comparison + dau_yesterday, + tokens_yesterday, + image_gens_yesterday, + video_gens_yesterday, + + SAFE_DIVIDE(dau - dau_yesterday, NULLIF(dau_yesterday, 0)) * 100 AS dau_dod_pct, + SAFE_DIVIDE(tokens - tokens_yesterday, NULLIF(tokens_yesterday, 0)) * 100 AS tokens_dod_pct, + SAFE_DIVIDE(image_gens - image_gens_yesterday, NULLIF(image_gens_yesterday, 0)) * 100 AS image_gens_dod_pct, + SAFE_DIVIDE(video_gens - video_gens_yesterday, NULLIF(video_gens_yesterday, 0)) * 100 AS video_gens_dod_pct, + + -- WoW comparison + dau_last_week, + tokens_last_week, + image_gens_last_week, + video_gens_last_week, + + SAFE_DIVIDE(dau - dau_last_week, NULLIF(dau_last_week, 0)) * 100 AS dau_wow_pct, + SAFE_DIVIDE(tokens - tokens_last_week, NULLIF(tokens_last_week, 0)) * 100 AS tokens_wow_pct, + SAFE_DIVIDE(image_gens - image_gens_last_week, NULLIF(image_gens_last_week, 0)) * 100 AS image_gens_wow_pct, + SAFE_DIVIDE(video_gens - video_gens_last_week, NULLIF(video_gens_last_week, 0)) * 100 AS video_gens_wow_pct, + + -- Baselines + dau_baseline_14d, + tokens_baseline_14d, + image_gens_baseline_14d, + video_gens_baseline_14d, + + SAFE_DIVIDE(dau - dau_baseline_14d, NULLIF(dau_baseline_14d, 0)) * 100 AS dau_vs_baseline_pct, + SAFE_DIVIDE(tokens - tokens_baseline_14d, NULLIF(tokens_baseline_14d, 0)) * 100 AS tokens_vs_baseline_pct, + SAFE_DIVIDE(image_gens - image_gens_baseline_14d, NULLIF(image_gens_baseline_14d, 0)) * 100 AS image_gens_vs_baseline_pct, + SAFE_DIVIDE(video_gens - video_gens_baseline_14d, NULLIF(video_gens_baseline_14d, 0)) * 100 AS video_gens_vs_baseline_pct + +FROM aggregated_metrics +WHERE dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) -- Yesterday only +ORDER BY + CASE segment + WHEN 'Enterprise Contract' THEN 1 + WHEN 'Enterprise Pilot' THEN 2 + WHEN 'Heavy Users' THEN 3 + WHEN 'Paying non-Enterprise' THEN 4 + WHEN 'Free' THEN 5 + END; diff --git a/agents/monitoring/usage/usage_monitoring_v2.py b/agents/monitoring/usage/usage_monitoring_v2.py new file mode 100644 index 0000000..a7f88e9 --- /dev/null +++ b/agents/monitoring/usage/usage_monitoring_v2.py @@ -0,0 +1,234 @@ +#!/usr/bin/env python3 +""" +LTX Studio Usage Monitoring - Production Thresholds +Based on /Users/dbeer/workspace/dwh-data-model-transforms/queries/segment_alerting_thresholds.md +""" + +import csv +import sys +from datetime import datetime +from dataclasses import dataclass +from typing import Optional + + +@dataclass +class SegmentThresholds: + """Threshold configuration per segment""" + name: str + dau_pct: float + image_gens_pct: float + video_gens_pct: Optional[float] # None = skip alerting + tokens_pct: float + weekday_only: bool # Only alert on Mon-Fri + + +# Production thresholds from segment_alerting_thresholds.md +THRESHOLDS = { + 'Enterprise Contract': SegmentThresholds( + name='Enterprise Contract', + dau_pct=0.50, # -50% + image_gens_pct=0.60, # -60% + video_gens_pct=None, # Skip + tokens_pct=0.70, # -70% + weekday_only=True + ), + 'Enterprise Pilot': SegmentThresholds( + name='Enterprise Pilot', + dau_pct=0.50, + image_gens_pct=0.60, + video_gens_pct=None, # Skip + tokens_pct=0.70, + weekday_only=True + ), + 'Free': SegmentThresholds( + name='Free', + dau_pct=0.25, + image_gens_pct=0.35, + video_gens_pct=0.30, + tokens_pct=0.20, + weekday_only=False + ), + 'Heavy Users': SegmentThresholds( + name='Heavy Users', + dau_pct=0.25, + image_gens_pct=0.30, + video_gens_pct=0.30, + tokens_pct=0.25, + weekday_only=False + ), + 'Paying non-Enterprise': SegmentThresholds( + name='Paying non-Enterprise', + dau_pct=0.20, + image_gens_pct=0.25, + video_gens_pct=0.25, + tokens_pct=0.20, + weekday_only=False + ), +} + + +@dataclass +class Alert: + segment: str + metric: str + current_value: float + baseline_14d: float + drop_pct: float + threshold_pct: float + severity: str # 'WARNING' or 'CRITICAL' + + +def parse_float(val): + """Safely parse float from CSV, handle empty strings""" + if val == '' or val is None: + return 0.0 + try: + return float(val) + except ValueError: + return 0.0 + + +def is_weekend(date_str: str) -> bool: + """Check if date is Saturday or Sunday""" + dt = datetime.strptime(date_str, '%Y-%m-%d') + return dt.weekday() >= 5 # 5=Saturday, 6=Sunday + + +def check_metric( + segment: str, + metric: str, + current: float, + baseline: float, + threshold_pct: float, + date_str: str +) -> Optional[Alert]: + """ + Check if metric triggers alert. + + Alert logic: today_value < rolling_14d_baseline * (1 - threshold_pct) + """ + if baseline == 0: + return None + + ratio = current / baseline + drop_pct = (1 - ratio) * 100 # Positive = drop, Negative = growth + + # Alert if drop exceeds threshold + if ratio < (1 - threshold_pct): + # Determine severity + critical_threshold = threshold_pct * 1.5 # 1.5x threshold = CRITICAL + severity = 'CRITICAL' if ratio < (1 - critical_threshold) else 'WARNING' + + return Alert( + segment=segment, + metric=metric, + current_value=current, + baseline_14d=baseline, + drop_pct=drop_pct, + threshold_pct=threshold_pct * 100, + severity=severity + ) + + return None + + +def main(): + # Read data from CSV + with open('/Users/dbeer/usage_data_clean.csv', 'r') as f: + reader = csv.DictReader(f) + rows = list(reader) + + if not rows: + print("No data found!") + sys.exit(1) + + date_str = rows[0]['dt'] + is_weekend_day = is_weekend(date_str) + + print("=" * 80) + print(f" LTX STUDIO USAGE MONITORING - {date_str}") + print(f" Day: {'WEEKEND' if is_weekend_day else 'WEEKDAY'}") + print("=" * 80) + + alerts = [] + + for row in rows: + segment = row['segment'] + + if segment not in THRESHOLDS: + continue + + thresholds = THRESHOLDS[segment] + + # Skip weekend alerts for Enterprise segments + if thresholds.weekday_only and is_weekend_day: + continue + + # Check each metric + metrics_to_check = { + 'dau': thresholds.dau_pct, + 'image_gens': thresholds.image_gens_pct, + 'video_gens': thresholds.video_gens_pct, + 'tokens': thresholds.tokens_pct, + } + + for metric, threshold in metrics_to_check.items(): + if threshold is None: # Skip this metric + continue + + current = parse_float(row[metric]) + baseline = parse_float(row[f'{metric}_baseline_14d']) + + # Skip if no baseline or current value + if baseline == 0 or current == 0: + continue + + alert = check_metric( + segment=segment, + metric=metric, + current=current, + baseline=baseline, + threshold_pct=threshold, + date_str=date_str + ) + + if alert: + alerts.append(alert) + + # Sort alerts by severity (CRITICAL first) then by drop % + alerts.sort(key=lambda x: (0 if x.severity == 'CRITICAL' else 1, -x.drop_pct)) + + if not alerts: + print("\n🟢 NO ALERTS - All metrics within thresholds\n") + return + + print(f"\n🔴 {len(alerts)} ALERTS DETECTED\n") + print("=" * 80) + + # Group by severity + critical_alerts = [a for a in alerts if a.severity == 'CRITICAL'] + warning_alerts = [a for a in alerts if a.severity == 'WARNING'] + + if critical_alerts: + print(f"\n🔴 CRITICAL ALERTS ({len(critical_alerts)}):\n") + for alert in critical_alerts: + print(f" • {alert.segment} - {alert.metric.replace('_', ' ').title()}") + print(f" Current: {alert.current_value:,.0f} | Baseline: {alert.baseline_14d:,.0f}") + print(f" Drop: {alert.drop_pct:.1f}% | Threshold: {alert.threshold_pct:.0f}%") + print() + + if warning_alerts: + print(f"\n⚠️ WARNING ALERTS ({len(warning_alerts)}):\n") + for alert in warning_alerts: + print(f" • {alert.segment} - {alert.metric.replace('_', ' ').title()}") + print(f" Current: {alert.current_value:,.0f} | Baseline: {alert.baseline_14d:,.0f}") + print(f" Drop: {alert.drop_pct:.1f}% | Threshold: {alert.threshold_pct:.0f}%") + print() + + print("=" * 80) + print(f"Total: {len(critical_alerts)} CRITICAL, {len(warning_alerts)} WARNING") + print("=" * 80) + + +if __name__ == "__main__": + main() From 732ec762278975bbaeec11a628f4a5084a3722d9 Mon Sep 17 00:00:00 2001 From: Daniel Beer Date: Sun, 8 Mar 2026 23:14:11 +0200 Subject: [PATCH 02/20] chore: remove experimental composite anomaly scoring Keep only production monitoring implementation (usage_monitoring_v2.py) --- agents/monitoring/usage/usage_monitoring.py | 216 -------------------- 1 file changed, 216 deletions(-) delete mode 100644 agents/monitoring/usage/usage_monitoring.py diff --git a/agents/monitoring/usage/usage_monitoring.py b/agents/monitoring/usage/usage_monitoring.py deleted file mode 100644 index 315c678..0000000 --- a/agents/monitoring/usage/usage_monitoring.py +++ /dev/null @@ -1,216 +0,0 @@ -#!/usr/bin/env python3 -""" -LTX Studio Usage Monitoring with Anomaly Scoring -Automatically detects problems across all segments and metrics -""" - -import numpy as np -from dataclasses import dataclass -from typing import List - - -@dataclass -class SignalResult: - name: str - raw_value: float - score: float # normalized 0–1 - fired: bool # did it cross its individual threshold? - description: str - - -@dataclass -class AnomalyResult: - anomaly_score: float # 0–1 composite - is_alert: bool - signals: List[SignalResult] - - def explain(self): - print(f"\n{'='*45}") - print(f" Anomaly Score : {self.anomaly_score:.2f} | Alert: {'🔴 YES' if self.is_alert else '🟢 NO'}") - print(f"{'='*45}") - for s in self.signals: - bar = "█" * int(s.score * 10) + "░" * (10 - int(s.score * 10)) - flag = "⚠️ " if s.fired else " " - print(f" {flag}{s.name:<28} [{bar}] {s.score:.2f}") - print(f" → {s.description}") - print(f"{'='*45}\n") - - -def score_zscore(z: float, fire_threshold=2.0, max_val=4.0) -> SignalResult: - score = min(abs(z) / max_val, 1.0) - return SignalResult( - name="Z-Score", - raw_value=z, - score=score, - fired=abs(z) > fire_threshold, - description=f"z={z:.2f} (threshold >{fire_threshold})" - ) - - -def score_dod(dod_pct: float, fire_threshold=20.0) -> SignalResult: - score = min(abs(dod_pct) / (fire_threshold * 2), 1.0) - direction = "▲" if dod_pct > 0 else "▼" - return SignalResult( - name="Day-over-Day Change", - raw_value=dod_pct, - score=score, - fired=abs(dod_pct) > fire_threshold, - description=f"{direction} {abs(dod_pct):.1f}% (threshold ±{fire_threshold}%)" - ) - - -def score_wow(wow_pct: float, fire_threshold=20.0, is_enterprise=True) -> SignalResult: - if not is_enterprise: - return SignalResult("Same-Day-of-Week Change", wow_pct, 0.0, False, "skipped (non-enterprise)") - score = min(abs(wow_pct) / (fire_threshold * 2), 1.0) - direction = "▲" if wow_pct > 0 else "▼" - return SignalResult( - name="Same-Day-of-Week Change", - raw_value=wow_pct, - score=score, - fired=abs(wow_pct) > fire_threshold, - description=f"{direction} {abs(wow_pct):.1f}% vs same day last week (threshold ±{fire_threshold}%)" - ) - - -def score_baseline(current: float, baseline: float, fire_threshold=0.85) -> SignalResult: - ratio = current / baseline if baseline > 0 else 1.0 - # only scores when below threshold, not above - shortfall = max(fire_threshold - ratio, 0) - score = min(shortfall / (1 - fire_threshold), 1.0) - return SignalResult( - name="14-Day Baseline Ratio", - raw_value=ratio, - score=score, - fired=ratio < fire_threshold, - description=f"current={current:.0f}, baseline={baseline:.0f}, ratio={ratio:.2f} (threshold <{fire_threshold})" - ) - - -@dataclass -class VolumeGates: - """ - Minimum thresholds that must be met before a signal is evaluated. - Prevents single-user noise from triggering alerts. - """ - min_dau: int = 10 # minimum active users for DAU-based alerts - min_tokens: int = 5_000 # minimum token volume - min_events: int = 50 # minimum events (gens, sessions, etc.) - min_baseline_days: int = 7 # need at least 7 days of history to use baseline - - -def compute_anomaly_score( - current_value: float, - baseline_14d: float, - historical_values: List[float], - dod_pct: float, - wow_pct: float, - is_enterprise: bool = True, - alert_threshold: float = 0.5, - weights: dict = None, - gates: VolumeGates = None, - # volume context so gates can make decisions - current_dau: int = None, - current_tokens: int = None, - current_events: int = None, -) -> AnomalyResult: - """ - Computes a composite anomaly score from multiple signals. - - Args: - current_value: Today's metric value - baseline_14d: 14-day rolling average - historical_values: Recent values used to calculate z-score (e.g. last 30 days) - dod_pct: Day-over-day % change - wow_pct: Same-day-of-week % change - is_enterprise: Whether to include WoW check - alert_threshold: Score above which we fire an alert (0–1) - weights: Optional custom weights per signal - gates: Minimum volume thresholds — signals skipped if not met - current_dau: Today's DAU (for gate evaluation) - current_tokens: Today's token count (for gate evaluation) - current_events: Today's event count (for gate evaluation) - - Returns: - AnomalyResult with composite score, alert flag, and per-signal breakdown - """ - if weights is None: - weights = {"zscore": 0.35, "dod": 0.20, "wow": 0.20, "baseline": 0.25} - if gates is None: - gates = VolumeGates() - - # ── Volume gate checks (metric-specific) ───────────────────────────────── - # For DAU metrics: check DAU - # For token metrics: check tokens OR DAU (if segment has users, tokens matter even if low) - # For event metrics: check events OR DAU - - dau_ok = current_dau is None or current_dau >= gates.min_dau - token_ok = current_tokens is None or current_tokens >= gates.min_tokens - event_ok = current_events is None or current_events >= gates.min_events - - # More lenient: if there are enough users, we care about their activity even if it's low - volume_ok = dau_ok or token_ok or event_ok - - def gated(signal_fn, *args, reason="low volume", **kwargs) -> SignalResult: - """Run signal only if volume gates pass, otherwise return a skipped result.""" - if not volume_ok: - s = signal_fn(*args, **kwargs) - return SignalResult(s.name, s.raw_value, 0.0, False, f"skipped ({reason})") - return signal_fn(*args, **kwargs) - - # ── Z-score needs enough history ───────────────────────────────────────── - history_ok = len(historical_values) >= gates.min_baseline_days - mean = np.mean(historical_values) if history_ok else 0 - std = np.std(historical_values) if history_ok else 0 - z = (current_value - mean) / std if (history_ok and std > 0) else 0.0 - - signals = [ - gated(score_zscore, z, reason="low volume" if not volume_ok else "insufficient history" if not history_ok else ""), - gated(score_dod, dod_pct), - gated(score_wow, wow_pct, is_enterprise=is_enterprise), - gated(score_baseline, current_value, baseline_14d), - ] - - # ── Weighted average (skip disabled/gated signals) ─────────────────────── - active = [(s, w) for s, w in zip(signals, weights.values()) if "skipped" not in s.description] - total_weight = sum(w for _, w in active) - anomaly_score = sum(s.score * w for s, w in active) / total_weight if total_weight > 0 else 0.0 - - return AnomalyResult( - anomaly_score=round(anomaly_score, 3), - is_alert=anomaly_score >= alert_threshold, - signals=signals - ) - - -# ── Example usage ──────────────────────────────────────────────────────────── - -if __name__ == "__main__": - np.random.seed(42) - history = list(np.random.normal(loc=1000, scale=50, size=30)) - gates = VolumeGates(min_dau=10, min_tokens=5_000, min_events=50) - - # ── Case 1: Novig-style — 1 DAU, tiny volume → should NOT alert ────────── - print("Case 1: Novig (1 DAU, zero activity → but too small to alert)") - result = compute_anomaly_score( - current_value=0, baseline_14d=1914, - historical_values=[1914]*14, - dod_pct=-100.0, wow_pct=-100.0, - is_enterprise=True, - current_dau=1, current_tokens=0, current_events=0, - gates=gates - ) - result.explain() - - # ── Case 2: Jazz Side Pilot — real drop on meaningful volume → should alert - print("Case 2: Jazz Side Pilot (1,044 tokens vs 29,664 last week — real signal)") - history2 = list(np.random.normal(loc=25000, scale=2000, size=30)) - result2 = compute_anomaly_score( - current_value=1044, baseline_14d=25000, - historical_values=history2, - dod_pct=-96.0, wow_pct=-96.0, - is_enterprise=True, - current_dau=15, current_tokens=1044, current_events=6, - gates=gates - ) - result2.explain() From dda1743bd437fea4e6326df472bd5d042a79164e Mon Sep 17 00:00:00 2001 From: Daniel Beer Date: Sun, 8 Mar 2026 23:29:22 +0200 Subject: [PATCH 03/20] feat: add production GPU cost thresholds to BE cost monitoring Reference data-driven thresholds from 60-day analysis: - Tier 1 High Priority: Idle cost spike, Inference cost spike, Idle-to-inference ratio - Tier 2 Medium Priority: Failure rate, Cost-per-request drift, DoD cost jump - Tier 3 Low Priority: Volume drop, Overhead spike - Per-vertical thresholds for LTX API and LTX Studio Co-Authored-By: Claude Sonnet 4.5 --- agents/monitoring/be-cost/SKILL.md | 37 ++++++++++++++++++++++++------ 1 file changed, 30 insertions(+), 7 deletions(-) diff --git a/agents/monitoring/be-cost/SKILL.md b/agents/monitoring/be-cost/SKILL.md index cd4cfac..0cd3beb 100644 --- a/agents/monitoring/be-cost/SKILL.md +++ b/agents/monitoring/be-cost/SKILL.md @@ -30,13 +30,35 @@ compatibility: - **What to monitor**: ALL - Total costs, cost by product (API/Studio), cost by endpoint/model/org/process, cost efficiency, GPU utilization, billing type breakdown - **By Segment**: ALL - LTX API (by endpoint/model/org) and LTX Studio (by process/workspace), by billing type, by GPU type - **Time window**: Last 30 days with 7-day rolling baseline -- **Alert threshold**: Auto-detect problems using: - - Z-score > 2 (cost anomaly) - - DoD cost spike > 30% vs 7-day avg - - WoW cost increase > 20% - - Cost per request > 20% above baseline - - GPU utilization < 50% for 3+ days - - Failure cost > $500/day +- **Alert method**: Compare today's value to statistical thresholds (avg+2σ for WARNING, avg+3σ for CRITICAL) + +**Production Thresholds** (from `/Users/dbeer/workspace/dwh-data-model-transforms/queries/gpu_cost_alerting_thresholds.md`): + +#### Tier 1 — High Priority (daily monitoring) +| Alert | Threshold | Signal | +|-------|-----------|--------| +| **Idle cost spike** | > $15,600/day | Autoscaler over-provisioning or traffic drop without scaledown | +| **Inference cost spike** | > $5,743/day | Volume surge, costlier model, or heavy new customer | +| **Idle-to-inference ratio** | > 4:1 | GPU utilization degrading (baseline ~2.7:1) | + +#### Tier 2 — Medium Priority (daily monitoring) +| Alert | Threshold | Signal | +|-------|-----------|--------| +| **Failure rate spike** | > 20.4% overall | Wasted compute + service quality issue | +| **Cost-per-request drift** | API > $0.70, Studio > $0.46 | Model regression or duration/resolution creep | +| **DoD cost jump** | API > 30%, Studio > 15% | Early warning before absolute thresholds breach | + +#### Tier 3 — Low Priority (weekly review) +| Alert | Threshold | Signal | +|-------|-----------|--------| +| **Volume drop** | < 18,936 requests/day | Possible outage or upstream feed issue | +| **Overhead spike** | > $138/day | System overhead anomaly (review weekly) | + +**Per-Vertical Thresholds:** +- **LTX API**: Total daily > $5,555, Failure rate > 5.7%, DoD change > 30% +- **LTX Studio**: Total daily > $11,928, Failure rate > 22.6%, DoD change > 15% + +**Alert fires when:** Cost metric exceeds WARNING (avg+2σ) or CRITICAL (avg+3σ) threshold ### 2. Read Shared Knowledge @@ -125,6 +147,7 @@ For detailed table schema including all dimensions, columns, and cost calculatio | File | Read when | |------|-----------| +| `/Users/dbeer/workspace/dwh-data-model-transforms/queries/gpu_cost_alerting_thresholds.md` | **CRITICAL** - Production thresholds for LTX division (60-day analysis) | | `references/schema-reference.md` | GPU cost table dimensions, columns, and cost calculations | | `shared/bq-schema.md` | Understanding GPU cost table schema (lines 418-615) | | `shared/metric-standards.md` | GPU cost metric SQL patterns (section 13) | From 1467b249ecb3c5dc811af1ab6ad7b17560da51a9 Mon Sep 17 00:00:00 2001 From: Daniel Beer Date: Sun, 8 Mar 2026 23:30:47 +0200 Subject: [PATCH 04/20] fix: GPU cost monitoring analyzes 3 days ago (not yesterday) Cost data needs time to finalize, so analyze data from 3 days ago instead of yesterday. Use DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY) in queries. Co-Authored-By: Claude Sonnet 4.5 --- agents/monitoring/be-cost/SKILL.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/agents/monitoring/be-cost/SKILL.md b/agents/monitoring/be-cost/SKILL.md index 0cd3beb..9dff880 100644 --- a/agents/monitoring/be-cost/SKILL.md +++ b/agents/monitoring/be-cost/SKILL.md @@ -30,7 +30,8 @@ compatibility: - **What to monitor**: ALL - Total costs, cost by product (API/Studio), cost by endpoint/model/org/process, cost efficiency, GPU utilization, billing type breakdown - **By Segment**: ALL - LTX API (by endpoint/model/org) and LTX Studio (by process/workspace), by billing type, by GPU type - **Time window**: Last 30 days with 7-day rolling baseline -- **Alert method**: Compare today's value to statistical thresholds (avg+2σ for WARNING, avg+3σ for CRITICAL) +- **Alert timing**: Analyze data from **3 days ago** (cost data needs time to finalize) +- **Alert method**: Compare to statistical thresholds (avg+2σ for WARNING, avg+3σ for CRITICAL) **Production Thresholds** (from `/Users/dbeer/workspace/dwh-data-model-transforms/queries/gpu_cost_alerting_thresholds.md`): @@ -159,6 +160,7 @@ For detailed table schema including all dimensions, columns, and cost calculatio ### Query Best Practices - **DO** always filter on `dt` partition column for performance +- **DO** use `DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)` as the target date (cost data from 3 days ago) - **DO** filter `cost_category = 'inference'` for request-level analysis - **DO** exclude Lightricks team requests with `is_lt_team IS FALSE` for customer-facing cost analysis - **DO** include LT team requests only when analyzing total infrastructure spend or debugging From f1525ae82fe5df894ea7c6a585870249329397ee Mon Sep 17 00:00:00 2001 From: Daniel Beer Date: Sun, 8 Mar 2026 23:34:19 +0200 Subject: [PATCH 05/20] docs: add all shared files to revenue monitoring data sources Include comprehensive shared knowledge files: - product-context.md for business model and user types - bq-schema.md for subscription tables and segmentation - metric-standards.md for revenue metric definitions - event-registry.yaml for feature-driven revenue analysis Co-Authored-By: Claude Sonnet 4.5 --- agents/monitoring/revenue/SKILL.md | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/agents/monitoring/revenue/SKILL.md b/agents/monitoring/revenue/SKILL.md index 9783752..6c0cb4b 100644 --- a/agents/monitoring/revenue/SKILL.md +++ b/agents/monitoring/revenue/SKILL.md @@ -35,9 +35,13 @@ tags: [monitoring, revenue, subscriptions] - New subscriptions drop > 20% WoW - Enterprise renewals < 90 days out -2. **Read shared files:** - - `shared/bq-schema.md` — Subscription tables (ltxstudio_user_tiers_dates, etc.) - - `shared/metric-standards.md` — Revenue metric definitions +2. **Read Shared Knowledge:** + +Before writing SQL: +- **`shared/product-context.md`** — LTX products, user types, business model, enterprise context +- **`shared/bq-schema.md`** — Subscription tables (ltxstudio_user_tiers_dates, ltxstudio_subscriptions, etc.), user segmentation queries +- **`shared/metric-standards.md`** — Revenue metric definitions (MRR, ARR, churn, LTV) +- **`shared/event-registry.yaml`** — Feature events (if analyzing feature-driven revenue) 3. **Write monitoring SQL:** - Query current metric value @@ -53,12 +57,14 @@ tags: [monitoring, revenue, subscriptions] - Document SQL in Hex or BigQuery scheduled query - Configure Slack webhook or notification -## Reference files +## Reference Files | File | Read when | |------|-----------| -| `shared/bq-schema.md` | Writing SQL for subscription/revenue tables | -| `shared/metric-standards.md` | Defining revenue metrics | +| `shared/product-context.md` | Understanding LTX products, user types, business model, enterprise context | +| `shared/bq-schema.md` | Writing SQL for subscription/revenue tables, user segmentation queries | +| `shared/metric-standards.md` | Defining revenue metrics (MRR, ARR, churn, LTV, retention) | +| `shared/event-registry.yaml` | Analyzing feature-driven revenue or usage patterns | ## Rules From e5f64d26b575f3798e5da611a187f85c69a677f9 Mon Sep 17 00:00:00 2001 From: Daniel Beer Date: Sun, 8 Mar 2026 23:39:48 +0200 Subject: [PATCH 06/20] docs: add all shared files to API runtime monitoring Expand Step 2 to include all shared knowledge files: - product-context.md for LTX products and API context - bq-schema.md for API tables and GPU cost data - metric-standards.md for performance metrics - event-registry.yaml for event-driven metrics - gpu-cost-query-templates.md for cost-related performance - gpu-cost-analysis-patterns.md for cost analysis patterns Co-Authored-By: Claude Sonnet 4.5 --- agents/monitoring/api-runtime/SKILL.md | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/agents/monitoring/api-runtime/SKILL.md b/agents/monitoring/api-runtime/SKILL.md index f481b0e..fcf2eec 100644 --- a/agents/monitoring/api-runtime/SKILL.md +++ b/agents/monitoring/api-runtime/SKILL.md @@ -35,9 +35,15 @@ tags: [monitoring, api, performance, latency, errors] - Queue time > 50% of processing time - Infrastructure errors > 10 requests/hour -2. **Read shared files:** - - `shared/bq-schema.md` — GPU cost table (has API runtime data) and ltxvapi tables - - `shared/metric-standards.md` — Performance metric patterns +2. **Read Shared Knowledge:** + +Before writing SQL: +- **`shared/product-context.md`** — LTX products, user types, business model, API context +- **`shared/bq-schema.md`** — GPU cost table (has API runtime data), ltxvapi tables, API request schema +- **`shared/metric-standards.md`** — Performance metric patterns (latency, error rates, throughput) +- **`shared/event-registry.yaml`** — Feature events (if analyzing event-driven metrics) +- **`shared/gpu-cost-query-templates.md`** — GPU cost queries (if analyzing cost-related performance) +- **`shared/gpu-cost-analysis-patterns.md`** — Cost analysis patterns (if analyzing cost-related performance) 3. **Identify data source:** - For LTX API: Use `ltxvapi_api_requests_with_be_costs` or `gpu_request_attribution_and_cost` From 9e94a600e65b6d4b5c50d28e158743b3579f02b3 Mon Sep 17 00:00:00 2001 From: Daniel Beer Date: Sun, 8 Mar 2026 23:51:39 +0200 Subject: [PATCH 07/20] refactor: restructure usage monitor to 6-part Agent Skills format Convert to Agent Skills spec structure: 1. Overview (Why?) - Problem context and solution 2. Requirements (What?) - Checklist of outcomes 3. Progress Tracker - Visual step indicator 4. Implementation Plan - Phases with progressive disclosure 5. Context & References - Files, scripts, data sources 6. Constraints & Done - DO/DO NOT rules and completion criteria Changes: - Applied progressive disclosure (PREFERRED patterns first) - Consolidated Rules + Anti-Patterns into Constraints section - Moved Reference Files + Production Scripts to Context section - Added completion criteria - Kept under 500 lines (402 lines total) Co-Authored-By: Claude Sonnet 4.5 --- agents/monitoring/usage/SKILL.md | 263 ++++++++++++++++--------------- 1 file changed, 132 insertions(+), 131 deletions(-) diff --git a/agents/monitoring/usage/SKILL.md b/agents/monitoring/usage/SKILL.md index 900f9ee..f20fe88 100644 --- a/agents/monitoring/usage/SKILL.md +++ b/agents/monitoring/usage/SKILL.md @@ -1,31 +1,47 @@ --- name: usage-monitor -description: Monitor LTX Studio product usage metrics including image/video generations, downloads, tokens, active users (DAU/WAU/MAU), and engagement. Detects usage anomalies and alerts on drops. -tags: [monitoring, usage, dau, generations, engagement] +description: "Monitor LTX Studio product usage metrics with data-driven thresholds. Detects anomalies in DAU, generations, and token consumption. Use when: (1) detecting usage drops, (2) alerting on segment-specific problems, (3) investigating root causes of engagement changes." +tags: [monitoring, usage, dau, generations, engagement, alerts] --- # Usage Monitor -## When to use +## 1. Overview (Why?) -- "Monitor feature usage" -- "Monitor DAU/MAU/WAU" -- "Alert on usage drops" -- "Monitor generation volumes (image/video)" -- "Monitor token consumption" -- "Detect usage anomalies or spikes" +LTX Studio usage varies significantly by user segment and day-of-week. Enterprise accounts show strong weekday patterns (18% weekend usage), while Free users are stable across all days. Generic alerting thresholds produce false positives from normal variance. -## Steps +This skill provides **autonomous usage monitoring** with data-driven thresholds derived from 60-day statistical analysis. It detects genuine problems while suppressing noise from expected patterns (Enterprise weekends, small segment variance, etc.). -### 1. Run Comprehensive Analysis +**Problem solved**: Detect usage drops that indicate product issues, enterprise churn risk, or engagement problems — without manual threshold tuning or false alarms. -**Automatically analyze ALL metrics to find problems:** -- **What to monitor**: DAU, Image Generations, Video Generations, Token Consumption -- **By Segment**: ALL - Use full segmentation CTE from `shared/bq-schema.md` (Enterprise → Heavy → Paying → Free) -- **Time window**: Last 30 days with 14-day rolling same-day-of-week baseline -- **Alert method**: Compare today's value to rolling 14-day same-DOW average +## 2. Requirements (What?) -**Production Thresholds** (from `/Users/dbeer/workspace/dwh-data-model-transforms/queries/segment_alerting_thresholds.md`): +Monitor these outcomes autonomously: + +- [ ] DAU drops by segment (Enterprise Contract/Pilot, Heavy, Paying, Free) +- [ ] Image generation volume changes +- [ ] Video generation volume changes (skip for Enterprise - too volatile) +- [ ] Token consumption trends +- [ ] Alerts fire only when drops exceed segment-specific thresholds +- [ ] Weekend alerts suppressed for Enterprise (weekday-only monitoring) +- [ ] Root cause investigation identifies which orgs/tiers drove changes +- [ ] Results formatted with severity (WARNING vs CRITICAL) + +## 3. Progress Tracker + +* [ ] Read production thresholds and shared knowledge +* [ ] Write monitoring SQL with 14-day same-DOW baseline +* [ ] Execute query and save results +* [ ] Run Python alerting script +* [ ] Analyze alerts by segment and severity +* [ ] Investigate root cause (org-level for Enterprise, tier-level for others) +* [ ] Present findings with recommended actions + +## 4. Implementation Plan + +### Phase 1: Read Production Thresholds + +Production thresholds from `/Users/dbeer/workspace/dwh-data-model-transforms/queries/segment_alerting_thresholds.md`: | Segment | DAU | Image Gens | Video Gens | Tokens | Notes | |---------|-----|------------|------------|--------|-------| @@ -35,33 +51,29 @@ tags: [monitoring, usage, dau, generations, engagement] | **Paying non-Enterprise** | -20% | -25% | -25% | -20% | All days | | **Free** | -25% | -35% | -30% | -20% | All days | -**Alert fires when:** `today_value < rolling_14d_same_dow_avg * (1 - threshold_pct)` +**Alert logic**: `today_value < rolling_14d_same_dow_avg * (1 - threshold_pct)` -**Severity levels:** -- **WARNING**: Drop exceeds threshold -- **CRITICAL**: Drop exceeds 1.5x threshold +**Severity**: +- WARNING: Drop exceeds threshold +- CRITICAL: Drop exceeds 1.5x threshold -### 2. Read Shared Knowledge +### Phase 2: Read Shared Knowledge -Before writing SQL: -- **`shared/bq-schema.md`** — ltxstudio_user_all_actions table schema and segmentation queries -- **`shared/event-registry.yaml`** — Feature events and action names -- **`shared/metric-standards.md`** — DAU/WAU/MAU, generation metrics, usage patterns +Before writing SQL, read: +- **`shared/bq-schema.md`** — Table schema, segmentation CTEs (lines 441-516) +- **`shared/metric-standards.md`** — DAU/WAU/MAU, generation metrics - **`shared/product-context.md`** — LTX products, user types, business model -- **`shared/gpu-cost-query-templates.md`** — GPU cost queries (if analyzing cost-related metrics) -- **`shared/gpu-cost-analysis-patterns.md`** — Cost analysis workflows and benchmarks +- **`shared/event-registry.yaml`** — Feature events and action names -Data Nuances: -- Table: `ltx-dwh-prod-processed.web.ltxstudio_user_all_actions` -- Partitioned by `action_ts` (TIMESTAMP) — filter with `date(action_ts)` for date ranges -- LT team already excluded (is_lt_team IS FALSE applied at table level) -- Use `action_category = 'generations'` when counting video generations -- By Segment: Use full segmentation CTE from `shared/bq-schema.md` (lines 441-516) — Enterprise → Heavy → Paying → Free -- **Enterprise usage patterns**: Strong weekday/weekend differences. For Enterprise segment, use same-day-of-week comparisons (7-day lookback) instead of DoD on weekends. Calculate separate weekday/weekend baselines. +**Data nuances**: +- Table: `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` +- Partitioned by `dt` (DATE) — filter for performance +- LT team already excluded (no `is_lt_team` filter needed) +- Enterprise patterns: Strong weekday/weekend differences, use same-DOW comparisons -### 3. Write Monitoring SQL +### Phase 3: Write Monitoring SQL -Write SQL to query yesterday's metrics with 14-day same-day-of-week baseline: +✅ **PREFERRED: Use 14-day same-day-of-week baseline** ```sql WITH daily_metrics AS ( @@ -98,8 +110,12 @@ aggregated_metrics AS ( AVG(SUM(dau)) OVER ( PARTITION BY segment, EXTRACT(DAYOFWEEK FROM dt) ORDER BY dt ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING - ) AS dau_baseline_14d - -- ... repeat for other metrics + ) AS dau_baseline_14d, + AVG(SUM(tokens)) OVER ( + PARTITION BY segment, EXTRACT(DAYOFWEEK FROM dt) + ORDER BY dt ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING + ) AS tokens_baseline_14d + -- ... repeat for image_gens, video_gens FROM daily_metrics GROUP BY dt, segment ) @@ -107,45 +123,42 @@ SELECT * FROM aggregated_metrics WHERE dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY); ``` -**Key patterns:** -- **14-day same-DOW baseline**: `AVG() OVER (PARTITION BY segment, EXTRACT(DAYOFWEEK FROM dt) ...)` -- **Segmentation**: Use full CTEs from `shared/bq-schema.md` (lines 441-516) -- **Yesterday only**: Filter to `dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)` +**Key patterns**: +- **Segmentation**: Use full CTE from `shared/bq-schema.md` (lines 441-516) +- **Baseline**: Partition by segment AND day-of-week to handle patterns +- **Time window**: Yesterday only (`DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)`) -### 4. Execute Query and Run Monitoring +### Phase 4: Execute Query and Run Monitoring -Run query and save to CSV: +Run query and save results: ```bash bq --project_id=ltx-dwh-explore query --use_legacy_sql=false --format=csv \ < usage_monitoring.sql 2>/dev/null | grep -v "^Waiting" > usage_data_clean.csv ``` -Run Python monitoring script: +Run Python alerting script: ```bash python3 usage_monitoring_v2.py ``` -The script: -1. Reads the CSV data +**The script**: +1. Reads CSV data 2. Checks each segment × metric against production thresholds -3. **Suppresses Enterprise alerts on weekends** (weekday-only) -4. **Skips Video Generations for Enterprise** (too volatile) -5. Flags **WARNING** (drop > threshold) or **CRITICAL** (drop > 1.5x threshold) +3. Suppresses Enterprise alerts on weekends +4. Skips Video Generations for Enterprise (CV > 100%) +5. Flags WARNING (drop > threshold) or CRITICAL (drop > 1.5x threshold) + +### Phase 5: Analyze Results -### 5. Analyze Results +**When alerts fire**: -**When alerts fire:** 1. **Check severity**: CRITICAL requires immediate investigation, WARNING needs monitoring 2. **Identify segment**: Which user segment is affected? 3. **Check day-of-week**: Is this expected (e.g., weekend drop in Enterprise)? 4. **Investigate root cause**: - - For **Enterprise segments**: Drill down to organization level to see which clients drove the drop - - For **other segments**: Check tier distribution (Standard vs Pro vs Lite) - - Look for product issues, outages, or seasonal patterns -**Root cause investigation SQL:** +**For Enterprise segments** — Drill down to organization level: ```sql --- For Enterprise: Org-level drill-down SELECT e.org, e.account_type, @@ -159,10 +172,11 @@ GROUP BY e.org, e.account_type ORDER BY tokens DESC; ``` -### 6. Present Findings +**For other segments** — Check tier distribution (Standard vs Pro vs Lite) -**Alert format:** +### Phase 6: Present Findings +**Alert format**: ``` 🔴 CRITICAL ALERTS (N): • Segment - Metric @@ -170,9 +184,7 @@ ORDER BY tokens DESC; Drop: Z% | Threshold: T% ``` -**Root cause format:** - -For **Enterprise segments**, show client-level details: +**Root cause format (Enterprise)**: ``` Enterprise Contract - Tokens Drop @@ -186,7 +198,7 @@ Clients Driving Growth: Net Result: Only 1 out of 4 contract accounts active ``` -For **other segments**, show summary: +**Root cause format (Other segments)**: ``` Heavy Users - Tokens Drop @@ -194,88 +206,77 @@ Standard tier users consuming 36% fewer tokens per user (engagement drop, not churn) ``` -**Recommended Actions:** +**Recommended actions**: - **Immediate**: Contact account managers for stopped clients - **Urgent**: Investigate large drops (>70%) - **Monitor**: Track new/growing accounts for stability -### 7. Set Up Alert (if requested) +### Phase 7: Set Up Alert (Optional) For ongoing monitoring: 1. Save SQL query 2. Set up in BigQuery scheduled query or Hex Thread 3. Configure notification threshold -4. Route alerts to Slack channel (#product-alerts, #engineering-alerts) - -## Query Templates - -See `references/query-templates.md` for production-ready SQL queries: -- Daily Usage Dashboard -- Usage by Segment (with segmentation CTE guidance) -- Anomaly Detection (Z-Score) - -## Reference Files - -| File | Read when | -|------|-----------| -| `/Users/dbeer/workspace/dwh-data-model-transforms/queries/segment_alerting_thresholds.md` | **CRITICAL** - Production thresholds per segment (60-day analysis) | -| `shared/bq-schema.md` | Table schema, segmentation queries, user tier columns | -| `shared/metric-standards.md` | DAU/WAU/MAU definitions, generation metrics, usage patterns | -| `shared/event-registry.yaml` | Finding event names for specific features | - -## Production Scripts - -| Script | Purpose | -|--------|---------| -| `usage_monitoring.sql` | BigQuery query to get yesterday's metrics with baselines | -| `usage_monitoring_v2.py` | Python script that checks thresholds and generates alerts | -| `investigate_root_cause.sql` | Organization-level drill-down for Enterprise segments | +4. Route alerts to Slack (#product-alerts, #engineering-alerts) -## Rules +## 5. Context & References -### Query Best Practices +### Production Thresholds +- **`/Users/dbeer/workspace/dwh-data-model-transforms/queries/segment_alerting_thresholds.md`** — Production thresholds per segment (60-day analysis) -- **DO** always filter on `date(action_ts)` partition column for performance -- **DO** use `action_category = 'generations'` when counting video generations (to exclude clicks) -- **DO** use `NULLIF(x, 0)` when dividing by potentially-zero denominators -- **DO** exclude today's data: `date(action_ts) <= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)` - -### Metric Calculation - -- **DO** report both press count AND output count for generations (see metric-standards.md section 2) -- **DO** use `action_name_detailed` for generation events when needed -- **DO** break down by `model_gen_type` for generation sub-type analysis (t2i, i2i, t2v, i2v, v2v) -- **DO NOT** filter `is_lt_team` (already excluded at table level) -- **DO NOT** mix image + video metrics — report separately - -### Analysis +### Shared Knowledge +- **`shared/bq-schema.md`** — Table schema, segmentation queries (lines 441-516) +- **`shared/metric-standards.md`** — DAU/WAU/MAU definitions, generation metrics +- **`shared/product-context.md`** — LTX products, user types, business model +- **`shared/event-registry.yaml`** — Feature events and action names -- **DO** compare against historical baseline (7-day avg, 14-day avg, prior period) -- **DO** flag anomalies with Z-score > 2 (deviates > 2 std devs from baseline) -- **DO** segment using full segmentation CTE from `shared/bq-schema.md` (Enterprise → Heavy → Paying → Free) -- **DO** use `SAFE_DIVIDE` with * 100 for percentages -- **DO** validate unusual patterns with product team before alerting +### Production Scripts +- **`usage_monitoring.sql`** — BigQuery query with 14-day same-DOW baselines +- **`usage_monitoring_v2.py`** — Python script that checks thresholds and generates alerts +- **`investigate_root_cause.sql`** — Organization-level drill-down for Enterprise segments + +### Data Source +Table: `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` +- Partitioned by `dt` (DATE) +- Key columns: `lt_id`, `griffin_tier_name`, `num_tokens_consumed`, `num_generate_image`, `num_generate_video` +- LT team already excluded at table level + +## 6. Constraints & Done + +### DO NOT + +- **DO NOT** use generic thresholds across segments — each has different volatility +- **DO NOT** alert on Enterprise weekends — weekend DAU is 6-7 (18% of weekday), too noisy +- **DO NOT** alert on Enterprise video gens — CV > 100%, single-user dominated +- **DO NOT** compare Monday to Sunday — day-of-week effects are huge +- **DO NOT** use simplified segmentation — hierarchy must be respected (Enterprise → Heavy → Paying → Free) +- **DO NOT** alert on 1-DAU segments — single-user noise dominates signal +- **DO NOT** use Z-scores on small segments — need 30+ data points for stable statistics +- **DO NOT** filter `is_lt_team IS FALSE` — already filtered at table level +- **DO NOT** use absolute thresholds — always compare to rolling 14-day same-DOW baseline -### Alerts +### DO -- **DO** use production thresholds from `/Users/dbeer/workspace/dwh-data-model-transforms/queries/segment_alerting_thresholds.md` +- **DO** always filter on `dt` partition column for performance +- **DO** use production thresholds from segment_alerting_thresholds.md - **DO** suppress Enterprise alerts on weekends (weekday-only alerting) -- **DO** skip Video Generation alerts for Enterprise (too volatile, CV > 100%) -- **DO** include segment breakdown and root cause in alerts +- **DO** skip Video Generation alerts for Enterprise +- **DO** use full segmentation CTE from `shared/bq-schema.md` (lines 441-516) +- **DO** compare against 14-day same-day-of-week baseline - **DO** flag CRITICAL (drop > 1.5x threshold) vs WARNING (drop > threshold) - **DO** investigate at organization level for Enterprise, tier level for others -- **DO NOT** use absolute thresholds — always compare to rolling 14-day same-DOW baseline -- **DO NOT** alert on expected patterns (e.g., Enterprise weekend drops) - -## Anti-Patterns - -| Anti-Pattern | Why | Do this instead | -|--------------|-----|-----------------| -| Alerting on Enterprise weekends | Weekend DAU is 6-7 (18% of weekday), too noisy | Skip Enterprise alerts on Sat/Sun | -| Alerting on Enterprise video gens | CV > 100%, single-user dominated | Skip video gen alerts for Enterprise | -| Using generic thresholds across segments | Each segment has different volatility | Use segment-specific thresholds from production file | -| Comparing Monday to Sunday | Day-of-week effects are huge | Use same-DOW baseline (Mon vs last Mon) | -| Simplified segmentation | Segmentation hierarchy must be respected | Use full segmentation CTE from `shared/bq-schema.md` | -| Alerting on 1-DAU segments | Single-user noise dominates signal | Production thresholds already account for this | -| Using Z-scores on small segments | Need 30+ data points for stable statistics | Use simple % drop vs baseline for Enterprise | -| Filtering `is_lt_team IS FALSE` | Already filtered at table level | No filter needed | +- **DO** include segment breakdown and root cause in alerts +- **DO** validate unusual patterns with product team before alerting +- **DO** use `SAFE_DIVIDE` with * 100 for percentages +- **DO** report both press count AND output count for generations +- **DO** use `action_name_detailed` for generation events when needed + +### Completion Criteria + +✅ All metrics monitored (DAU, tokens, image gens, video gens) +✅ Production thresholds applied per segment +✅ Alerts fire with severity levels (WARNING/CRITICAL) +✅ Root cause investigation completed +✅ Findings presented with recommended actions +✅ Enterprise weekend suppression working +✅ Enterprise video gen alerts skipped From 3bf10075dc25649ceced2867a3ed183caa66c794 Mon Sep 17 00:00:00 2001 From: Daniel Beer Date: Sun, 8 Mar 2026 23:56:45 +0200 Subject: [PATCH 08/20] refactor: restructure all monitoring skills to 6-part Agent Skills format Convert all monitoring agents to Agent Skills spec structure: 1. Overview (Why?) - Problem context and solution 2. Requirements (What?) - Checklist of outcomes 3. Progress Tracker - Visual step indicator 4. Implementation Plan - Phases with progressive disclosure 5. Context & References - Files, scripts, data sources 6. Constraints & Done - DO/DO NOT rules and completion criteria Updated skills: - be-cost-monitoring (314 lines) - GPU cost with production thresholds - revenue-monitor (272 lines) - Revenue/subscription monitoring - enterprise-monitor (341 lines) - Enterprise account health - api-runtime-monitor (359 lines) - API performance monitoring All skills: - Applied progressive disclosure (PREFERRED patterns first) - Consolidated Rules into Constraints section (DO/DO NOT) - Moved Reference Files to Context section - Added completion criteria - Kept under 500 lines per Agent Skills spec Co-Authored-By: Claude Sonnet 4.5 --- agents/monitoring/api-runtime/SKILL.md | 317 +++++++++++++++++++------ agents/monitoring/be-cost/SKILL.md | 189 +++++++++------ agents/monitoring/enterprise/SKILL.md | 316 ++++++++++++++++++------ agents/monitoring/revenue/SKILL.md | 259 +++++++++++++++----- 4 files changed, 809 insertions(+), 272 deletions(-) diff --git a/agents/monitoring/api-runtime/SKILL.md b/agents/monitoring/api-runtime/SKILL.md index fcf2eec..f3d579a 100644 --- a/agents/monitoring/api-runtime/SKILL.md +++ b/agents/monitoring/api-runtime/SKILL.md @@ -1,43 +1,59 @@ --- name: api-runtime-monitor -description: Monitors LTX API runtime performance, latency, error rates, and throughput. Alerts on performance degradation or errors. -tags: [monitoring, api, performance, latency, errors] +description: "Monitor LTX API runtime performance, latency, error rates, and throughput. Detects performance degradation and errors. Use when: (1) detecting API latency issues, (2) alerting on error rate spikes, (3) investigating throughput drops by endpoint/model/org." +tags: [monitoring, api, performance, latency, errors, throughput] --- # API Runtime Monitor -## When to use +## 1. Overview (Why?) -- "Monitor API latency" -- "Alert on API errors" -- "Track API throughput" -- "Monitor inference time" -- "Alert on API performance degradation" +LTX API performance varies by endpoint, model, and customer organization. Latency issues, error rate spikes, and throughput drops can indicate infrastructure problems, model regressions, or customer-specific issues that require engineering intervention. -## What it monitors +This skill provides **autonomous API runtime monitoring** that detects performance degradation (P95 latency spikes), error rate increases, throughput drops, and queue time issues — with breakdown by endpoint, model, and organization for root cause analysis. -- **Latency**: Request processing time, inference time, queue time -- **Error rates**: % of failed requests, error types, error sources -- **Throughput**: Requests per hour/day, by endpoint/model -- **Performance**: P50/P95/P99 latency, success rate -- **Utilization**: API usage by org, model, resolution +**Problem solved**: Detect API performance problems and errors before they impact customer experience — with segment-level (endpoint/model/org) root cause identification. -## Steps +## 2. Requirements (What?) -1. **Run Comprehensive Analysis:** - - **What to monitor**: ALL - Latency (P50/P95/P99), error rates, throughput, inference time, queue time - - **By Segment**: ALL - By endpoint, model, organization, resolution - - **Time window**: Last 7 days with hourly/daily granularity - - **Alert threshold**: Auto-detect problems using: - - P95 latency > 2x baseline or > 60s - - Error rate > 5% or DoD increase > 50% - - Throughput drops > 30% DoD/WoW - - Queue time > 50% of processing time - - Infrastructure errors > 10 requests/hour +Monitor these outcomes autonomously: -2. **Read Shared Knowledge:** +- [ ] P95 latency spikes (> 2x baseline or > 60s) +- [ ] Error rate increases (> 5% or DoD increase > 50%) +- [ ] Throughput drops (> 30% DoD/WoW) +- [ ] Queue time excessive (> 50% of processing time) +- [ ] Infrastructure errors (> 10 requests/hour) +- [ ] Alerts include breakdown by endpoint, model, organization +- [ ] Results formatted by priority (infrastructure vs applicative errors) +- [ ] Findings routed to appropriate team (API team or Engineering) -Before writing SQL: +## 3. Progress Tracker + +* [ ] Read shared knowledge (schema, metrics, performance patterns) +* [ ] Identify data source (ltxvapi tables or GPU cost table) +* [ ] Write monitoring SQL with percentile calculations +* [ ] Execute query for target date range +* [ ] Analyze results by endpoint, model, organization +* [ ] Separate infrastructure vs applicative errors +* [ ] Present findings with performance breakdown +* [ ] Route alerts to appropriate team + +## 4. Implementation Plan + +### Phase 1: Read Alert Thresholds + +**Generic thresholds** (data-driven analysis pending): +- P95 latency > 2x baseline or > 60s +- Error rate > 5% or DoD increase > 50% +- Throughput drops > 30% DoD/WoW +- Queue time > 50% of processing time +- Infrastructure errors > 10 requests/hour + +[!IMPORTANT] These are generic thresholds. Consider creating production thresholds based on endpoint/model-specific analysis (similar to usage/GPU cost monitoring). + +### Phase 2: Read Shared Knowledge + +Before writing SQL, read: - **`shared/product-context.md`** — LTX products, user types, business model, API context - **`shared/bq-schema.md`** — GPU cost table (has API runtime data), ltxvapi tables, API request schema - **`shared/metric-standards.md`** — Performance metric patterns (latency, error rates, throughput) @@ -45,50 +61,203 @@ Before writing SQL: - **`shared/gpu-cost-query-templates.md`** — GPU cost queries (if analyzing cost-related performance) - **`shared/gpu-cost-analysis-patterns.md`** — Cost analysis patterns (if analyzing cost-related performance) -3. **Identify data source:** - - For LTX API: Use `ltxvapi_api_requests_with_be_costs` or `gpu_request_attribution_and_cost` - - **Key columns explained:** - - `request_processing_time_ms`: Total time from request submission to completion - - `request_inference_time_ms`: GPU processing time (actual model inference) - - `request_queue_time_ms`: Time waiting in queue before processing starts - - `result`: Request outcome (success, failed, timeout, etc.) - - `error_type`: Classification of errors (infrastructure vs applicative) - - `endpoint`: API endpoint called (e.g., /generate, /upscale) - - `model_type`: Model used (ltxv2, retake, etc.) - - `org_name`: Customer organization making the request - -4. **Write monitoring SQL:** - - Query relevant performance metric - - Calculate percentiles (P50, P95, P99) for latency - - Calculate error rate (failed / total requests) - - Compare against baseline - -5. **Present to user:** - - Show SQL query - - Show example alert format with performance breakdown - - Confirm threshold values - -6. **Set up alert** (manual for now): - - Document SQL - - Configure notification to engineering team - -## Reference files - -| File | Read when | -|------|-----------| -| `shared/product-context.md` | LTX products and business context | -| `shared/bq-schema.md` | API tables and GPU cost table schema | -| `shared/metric-standards.md` | Performance metric patterns | -| `shared/event-registry.yaml` | Feature events (if analyzing event-driven metrics) | -| `shared/gpu-cost-query-templates.md` | GPU cost queries (if analyzing cost-related performance) | -| `shared/gpu-cost-analysis-patterns.md` | Cost analysis patterns (if analyzing cost-related performance) | - -## Rules - -- DO use APPROX_QUANTILES for percentile calculations (P50, P95, P99) -- DO separate errors by error_source (infrastructure vs applicative) -- DO filter by result = 'success' for success rate calculations -- DO break down by endpoint, model, and resolution for detailed analysis -- DO compare current performance against historical baseline -- DO alert engineering team for infrastructure errors, product team for applicative errors -- DO partition by dt for performance +**Data nuances**: +- Primary table: `ltx-dwh-prod-processed.web.ltxvapi_api_requests_with_be_costs` +- Alternative: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost` +- Partitioned by `action_ts` (TIMESTAMP) or `dt` (DATE) — filter for performance +- Key columns: `request_processing_time_ms`, `request_inference_time_ms`, `request_queue_time_ms`, `result`, `endpoint`, `model_type`, `org_name` + +### Phase 3: Identify Data Source + +✅ **PREFERRED: Use ltxvapi_api_requests_with_be_costs for API runtime metrics** + +**Key columns**: +- `request_processing_time_ms`: Total time from request submission to completion +- `request_inference_time_ms`: GPU processing time (actual model inference) +- `request_queue_time_ms`: Time waiting in queue before processing starts +- `result`: Request outcome (success, failed, timeout, etc.) +- `error_type` or `error_source`: Classification of errors (infrastructure vs applicative) +- `endpoint`: API endpoint called (e.g., /generate, /upscale) +- `model_type`: Model used (ltxv2, retake, etc.) +- `org_name`: Customer organization making the request + +[!IMPORTANT] Verify column name: `error_type` vs `error_source` in actual schema + +### Phase 4: Write Monitoring SQL + +✅ **PREFERRED: Calculate percentiles and error rates with baseline comparisons** + +```sql +WITH api_metrics AS ( + SELECT + DATE(action_ts) AS dt, + endpoint, + model_type, + org_name, + COUNT(*) AS total_requests, + COUNTIF(result = 'success') AS successful_requests, + COUNTIF(result != 'success') AS failed_requests, + SAFE_DIVIDE(COUNTIF(result != 'success'), COUNT(*)) * 100 AS error_rate_pct, + APPROX_QUANTILES(request_processing_time_ms, 100)[OFFSET(50)] AS p50_latency_ms, + APPROX_QUANTILES(request_processing_time_ms, 100)[OFFSET(95)] AS p95_latency_ms, + APPROX_QUANTILES(request_processing_time_ms, 100)[OFFSET(99)] AS p99_latency_ms, + AVG(request_queue_time_ms) AS avg_queue_time_ms, + AVG(request_inference_time_ms) AS avg_inference_time_ms + FROM `ltx-dwh-prod-processed.web.ltxvapi_api_requests_with_be_costs` + WHERE action_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY) + AND action_ts < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY) + GROUP BY dt, endpoint, model_type, org_name +), +metrics_with_baseline AS ( + SELECT + *, + AVG(p95_latency_ms) OVER ( + PARTITION BY endpoint, model_type + ORDER BY dt ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING + ) AS p95_latency_baseline_7d, + AVG(error_rate_pct) OVER ( + PARTITION BY endpoint, model_type + ORDER BY dt ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING + ) AS error_rate_baseline_7d + FROM api_metrics +) +SELECT * FROM metrics_with_baseline +WHERE dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY); +``` + +**Key patterns**: +- **Percentiles**: Use `APPROX_QUANTILES` for P50/P95/P99 +- **Error rate**: `SAFE_DIVIDE(failed, total) * 100` +- **Baseline**: 7-day rolling average by endpoint and model +- **Time window**: Last 7 days (shorter than usage monitoring due to higher frequency data) + +### Phase 5: Execute Query + +Run query using: +```bash +bq --project_id=ltx-dwh-explore query --use_legacy_sql=false --format=pretty " + +" +``` + +### Phase 6: Analyze Results + +**For latency trends**: +- Compare P95 latency vs baseline (7-day avg) +- Flag if P95 > 2x baseline or > 60s absolute +- Identify which endpoint/model/org drove spikes + +**For error rate analysis**: +- Compare error rate vs baseline +- Separate errors by `error_type`/`error_source` (infrastructure vs applicative) +- Flag if error rate > 5% or DoD increase > 50% + +**For throughput**: +- Track requests per hour/day by endpoint +- Flag throughput drops > 30% DoD/WoW +- Identify which endpoints lost traffic + +**For queue analysis**: +- Calculate queue time as % of total processing time +- Flag if queue time > 50% of processing time +- Indicates capacity/scaling issues + +### Phase 7: Present Findings + +Format results with: +- **Summary**: Key finding (e.g., "P95 latency spiked to 85s for /v1/text-to-video") +- **Root cause**: Which endpoint/model/org drove the issue +- **Breakdown**: Performance metrics by dimension +- **Error classification**: Infrastructure vs applicative errors +- **Recommendation**: Route to API team (applicative) or Engineering team (infrastructure) + +**Alert format**: +``` +⚠️ API PERFORMANCE ALERT: + • Endpoint: /v1/text-to-video + Model: ltxv2 + Metric: P95 Latency + Current: 85s | Baseline: 30s + Change: +183% + + Error rate: 8.2% (baseline: 2.1%) + Error type: Infrastructure + +Recommendation: Alert Engineering team for infrastructure issue +``` + +### Phase 8: Route Alert + +For ongoing monitoring: +1. Save SQL query +2. Set up in BigQuery scheduled query or Hex Thread +3. Configure notification by error type: + - Infrastructure errors → Engineering team + - Applicative errors → API/Product team +4. Include endpoint, model, and org details in alert + +## 5. Context & References + +### Shared Knowledge +- **`shared/product-context.md`** — LTX products and API context +- **`shared/bq-schema.md`** — API tables and GPU cost table schema +- **`shared/metric-standards.md`** — Performance metric patterns +- **`shared/event-registry.yaml`** — Feature events (if analyzing event-driven metrics) +- **`shared/gpu-cost-query-templates.md`** — GPU cost queries (if analyzing cost-related performance) +- **`shared/gpu-cost-analysis-patterns.md`** — Cost analysis patterns + +### Data Sources + +**Primary table**: `ltx-dwh-prod-processed.web.ltxvapi_api_requests_with_be_costs` +- Partitioned by `action_ts` (TIMESTAMP) +- Key columns: `request_processing_time_ms`, `request_inference_time_ms`, `request_queue_time_ms`, `result`, `endpoint`, `model_type`, `org_name` + +**Alternative**: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost` +- Contains API runtime data but not primary source for performance metrics + +### Endpoints +Common endpoints: `/v1/text-to-video`, `/v1/image-to-video`, `/v1/upscale`, `/generate` + +### Models +Common models: `ltxv2`, `retake`, etc. + +## 6. Constraints & Done + +### DO NOT + +- **DO NOT** use absolute thresholds without baseline comparison +- **DO NOT** mix infrastructure and applicative errors in same alert +- **DO NOT** skip partition filtering — always filter on `action_ts` or `dt` for performance +- **DO NOT** forget to separate errors by error type/source + +[!IMPORTANT] Verify column name in schema: `error_type` vs `error_source` + +### DO + +- **DO** use `APPROX_QUANTILES` for percentile calculations (P50, P95, P99) +- **DO** separate errors by error_source (infrastructure vs applicative) +- **DO** filter by `result = 'success'` for success rate calculations +- **DO** break down by endpoint, model, and organization for detailed analysis +- **DO** compare current performance against historical baseline (7-day rolling avg) +- **DO** alert engineering team for infrastructure errors +- **DO** alert product/API team for applicative errors +- **DO** partition on `action_ts` or `dt` for performance +- **DO** use `ltx-dwh-explore` as execution project +- **DO** calculate error rate with `SAFE_DIVIDE(failed, total) * 100` +- **DO** flag P95 latency > 2x baseline or > 60s +- **DO** flag error rate > 5% or DoD increase > 50% +- **DO** flag throughput drops > 30% DoD/WoW +- **DO** flag queue time > 50% of processing time +- **DO** flag infrastructure errors > 10 requests/hour +- **DO** include endpoint, model, org details in all alerts +- **DO** validate unusual patterns with API/Engineering team before alerting + +### Completion Criteria + +✅ All performance metrics monitored (latency, errors, throughput, queue time) +✅ Alerts fire with thresholds (generic pending production analysis) +✅ Endpoint/model/org breakdown provided +✅ Errors separated by type (infrastructure vs applicative) +✅ Findings routed to appropriate team +✅ Partition filtering applied for performance +✅ Column name verified (error_type vs error_source) diff --git a/agents/monitoring/be-cost/SKILL.md b/agents/monitoring/be-cost/SKILL.md index 9dff880..fb5dd29 100644 --- a/agents/monitoring/be-cost/SKILL.md +++ b/agents/monitoring/be-cost/SKILL.md @@ -1,6 +1,6 @@ --- name: be-cost-monitoring -description: Monitor and analyze backend GPU costs for LTX API and LTX Studio. Use when analyzing cost trends, detecting anomalies, breaking down costs by endpoint/model/org/process, monitoring utilization, or investigating cost efficiency drift. +description: "Monitor backend GPU costs for LTX API and LTX Studio with data-driven thresholds. Detects cost anomalies and utilization issues. Use when: (1) detecting cost spikes, (2) analyzing idle vs inference costs, (3) investigating efficiency drift by endpoint/model/org." tags: [monitoring, costs, gpu, infrastructure, alerts] compatibility: - BigQuery access (ltx-dwh-prod-processed) @@ -9,31 +9,46 @@ compatibility: # Backend Cost Monitoring -## When to use +## 1. Overview (Why?) -- "Monitor GPU costs" -- "Analyze cost trends (daily/weekly/monthly)" -- "Detect cost anomalies or spikes" -- "Break down API costs by endpoint/model/org" -- "Break down Studio costs by process/workspace" -- "Compare API vs Studio cost distribution" -- "Investigate cost-per-request efficiency drift" -- "Monitor GPU utilization and idle costs" -- "Alert on cost budget breaches" -- "Day-over-day or week-over-week cost comparisons" +LTX GPU infrastructure spending is dominated by idle costs (~72% of total), with actual inference representing only ~27%. Cost patterns vary significantly by vertical (API vs Studio) and day-of-week. Generic alerting thresholds miss true anomalies while flagging normal variance. -## Steps +This skill provides **autonomous cost monitoring** with data-driven thresholds derived from 60-day statistical analysis. It detects genuine cost problems (autoscaler issues, efficiency regression, volume surges) while suppressing noise from expected patterns. -### 1. Run Comprehensive Analysis +**Problem solved**: Detect cost spikes, utilization degradation, and efficiency drift that indicate infrastructure problems or wasteful spending — without manual threshold tuning or false alarms. -**Automatically analyze ALL cost metrics to find problems:** -- **What to monitor**: ALL - Total costs, cost by product (API/Studio), cost by endpoint/model/org/process, cost efficiency, GPU utilization, billing type breakdown -- **By Segment**: ALL - LTX API (by endpoint/model/org) and LTX Studio (by process/workspace), by billing type, by GPU type -- **Time window**: Last 30 days with 7-day rolling baseline -- **Alert timing**: Analyze data from **3 days ago** (cost data needs time to finalize) -- **Alert method**: Compare to statistical thresholds (avg+2σ for WARNING, avg+3σ for CRITICAL) +**Critical timing**: Analyzes data from **3 days ago** (cost data needs time to finalize). -**Production Thresholds** (from `/Users/dbeer/workspace/dwh-data-model-transforms/queries/gpu_cost_alerting_thresholds.md`): +## 2. Requirements (What?) + +Monitor these outcomes autonomously: + +- [ ] Idle cost spikes (over-provisioning or traffic drop without scaledown) +- [ ] Inference cost spikes (volume surge, costlier model, heavy customer) +- [ ] Idle-to-inference ratio degradation (utilization dropping) +- [ ] Failure rate spikes (wasted compute + service quality issue) +- [ ] Cost-per-request drift (model regression or resolution creep) +- [ ] Day-over-day cost jumps (early warning signals) +- [ ] Volume drops (potential outage) +- [ ] Overhead spikes (system anomalies) +- [ ] Alerts prioritized by tier (High/Medium/Low) +- [ ] Vertical-specific thresholds (API vs Studio) + +## 3. Progress Tracker + +* [ ] Read production thresholds and shared knowledge +* [ ] Select appropriate query template(s) +* [ ] Execute query for 3 days ago +* [ ] Analyze results by tier priority +* [ ] Identify root cause (endpoint, model, org, process) +* [ ] Present findings with cost breakdown +* [ ] Route alerts to appropriate team (API/Studio/Engineering) + +## 4. Implementation Plan + +### Phase 1: Read Production Thresholds + +Production thresholds from `/Users/dbeer/workspace/dwh-data-model-transforms/queries/gpu_cost_alerting_thresholds.md`: #### Tier 1 — High Priority (daily monitoring) | Alert | Threshold | Signal | @@ -59,30 +74,33 @@ compatibility: - **LTX API**: Total daily > $5,555, Failure rate > 5.7%, DoD change > 30% - **LTX Studio**: Total daily > $11,928, Failure rate > 22.6%, DoD change > 15% -**Alert fires when:** Cost metric exceeds WARNING (avg+2σ) or CRITICAL (avg+3σ) threshold +**Alert logic**: Cost metric exceeds WARNING (avg+2σ) or CRITICAL (avg+3σ) threshold -### 2. Read Shared Knowledge +### Phase 2: Read Shared Knowledge -Before writing SQL: -- **`shared/product-context.md`** — LTX products and business context +Before writing SQL, read: +- **`shared/product-context.md`** — LTX products, user types, business model - **`shared/bq-schema.md`** — GPU cost table schema (lines 418-615) - **`shared/metric-standards.md`** — GPU cost metric patterns (section 13) -- **`shared/event-registry.yaml`** — Feature events (if analyzing feature-level costs) - **`shared/gpu-cost-query-templates.md`** — 11 production-ready SQL queries - **`shared/gpu-cost-analysis-patterns.md`** — Analysis workflows and benchmarks +- **`shared/event-registry.yaml`** — Feature events (if analyzing feature-level costs) -Key learnings: +**Data nuances**: - Table: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost` - Partitioned by `dt` (DATE) — always filter for performance -- `cost_category`: inference (requests), idle, overhead, unused -- Total cost = `row_cost + attributed_idle_cost + attributed_overhead_cost` (inference rows only) -- For infrastructure cost: `SUM(row_cost)` across all categories +- **Target date**: `DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)` (cost data from 3 days ago) +- Cost categories: inference, idle, overhead, unused +- Total cost per request = `row_cost + attributed_idle_cost + attributed_overhead_cost` (inference only) +- Infrastructure cost = `SUM(row_cost)` across all categories + +### Phase 3: Select Query Template -### 3. Run Query Templates +✅ **PREFERRED: Run all 11 templates for comprehensive overview** -**If user didn't specify anything:** Run all 11 query templates from `shared/gpu-cost-query-templates.md` to provide comprehensive cost overview. +If user didn't specify, run all templates from `shared/gpu-cost-query-templates.md`: -**If user specified specific analysis:** Select appropriate template: +**If user specified specific analysis**, select appropriate template: | User asks... | Use template | |-------------|-------------| @@ -96,9 +114,7 @@ Key learnings: | "Cost efficiency by model" | Cost per Request by Model | | "Which orgs cost most?" | API Cost by Organization | -See `shared/gpu-cost-query-templates.md` for all 11 query templates. - -### 4. Execute Query +### Phase 4: Execute Query Run query using: ```bash @@ -107,86 +123,107 @@ bq --project_id=ltx-dwh-explore query --use_legacy_sql=false --format=pretty " " ``` -Or use BigQuery console with project `ltx-dwh-explore`. +**Or** use BigQuery console with project `ltx-dwh-explore` -### 5. Analyze Results +**Critical**: Always filter on `dt = DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)` -**For cost trends:** +### Phase 5: Analyze Results + +**For cost trends**: - Compare current period vs baseline (7-day avg, prior week, prior month) - Calculate % change and flag significant shifts (>15-20%) -**For anomaly detection:** +**For anomaly detection**: - Flag days with Z-score > 2 (cost or volume deviates > 2 std devs from rolling avg) - Investigate root cause: specific endpoint/model/org, error rate spike, billing type change -**For breakdowns:** +**For breakdowns**: - Identify top cost drivers (endpoint, model, org, process) - Calculate cost per request to spot efficiency issues - Check failure costs (wasted spend on errors) -### 6. Present Findings +### Phase 6: Present Findings Format results with: -- **Summary**: Key finding (e.g., "GPU costs spiked 45% yesterday") +- **Summary**: Key finding (e.g., "GPU costs spiked 45% 3 days ago") - **Root cause**: What drove the change (e.g., "LTX API /v1/text-to-video requests +120%") -- **Breakdown**: Top contributors by dimension +- **Breakdown**: Top contributors by dimension (endpoint, model, org, process) - **Recommendation**: Action to take (investigate org X, optimize model Y, alert team) +- **Priority tier**: Tier 1 (High), Tier 2 (Medium), or Tier 3 (Low) -### 7. Set Up Alert (if requested) +### Phase 7: Set Up Alert (Optional) For ongoing monitoring: 1. Save SQL query 2. Set up in BigQuery scheduled query or Hex Thread -3. Configure notification threshold -4. Route alerts to Slack channel or Linear issue +3. Configure notification threshold by tier +4. Route alerts to appropriate team: + - API cost spikes → API team + - Studio cost spikes → Studio team + - Infrastructure issues → Engineering team -## Schema Reference +## 5. Context & References -For detailed table schema including all dimensions, columns, and cost calculations, see `references/schema-reference.md`. +### Production Thresholds +- **`/Users/dbeer/workspace/dwh-data-model-transforms/queries/gpu_cost_alerting_thresholds.md`** — Production thresholds for LTX division (60-day analysis) -## Reference Files +### Shared Knowledge +- **`shared/product-context.md`** — LTX products and business context +- **`shared/bq-schema.md`** — GPU cost table schema (lines 418-615) +- **`shared/metric-standards.md`** — GPU cost metric patterns (section 13) +- **`shared/gpu-cost-query-templates.md`** — 11 production-ready SQL queries +- **`shared/gpu-cost-analysis-patterns.md`** — Analysis workflows, benchmarks, investigation playbooks +- **`shared/event-registry.yaml`** — Feature events (if analyzing feature-level costs) -| File | Read when | -|------|-----------| -| `/Users/dbeer/workspace/dwh-data-model-transforms/queries/gpu_cost_alerting_thresholds.md` | **CRITICAL** - Production thresholds for LTX division (60-day analysis) | -| `references/schema-reference.md` | GPU cost table dimensions, columns, and cost calculations | -| `shared/bq-schema.md` | Understanding GPU cost table schema (lines 418-615) | -| `shared/metric-standards.md` | GPU cost metric SQL patterns (section 13) | -| `shared/gpu-cost-query-templates.md` | Selecting query template for analysis (11 production-ready queries) | -| `shared/gpu-cost-analysis-patterns.md` | Interpreting results, workflows, benchmarks, investigation playbooks | +### Data Source +Table: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost` +- Partitioned by `dt` (DATE) +- Division: `LTX` (LTX API + LTX Studio) +- Cost categories: inference (27%), idle (72%), overhead (0.3%) +- Key columns: `cost_category`, `row_cost`, `attributed_idle_cost`, `attributed_overhead_cost`, `result`, `endpoint`, `model_type`, `org_name`, `process_name` -## Rules +### Query Templates +See `shared/gpu-cost-query-templates.md` for 11 production-ready queries -### Query Best Practices +## 6. Constraints & Done + +### DO NOT + +- **DO NOT** analyze yesterday's data — use 3 days ago (cost data needs time to finalize) +- **DO NOT** sum row_cost + attributed_* across all cost_categories — causes double-counting +- **DO NOT** mix inference and non-inference rows in same aggregation without filtering +- **DO NOT** use absolute thresholds — always compare to baseline (avg+2σ/avg+3σ) +- **DO NOT** skip partition filtering — always filter on `dt` for performance + +### DO - **DO** always filter on `dt` partition column for performance -- **DO** use `DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)` as the target date (cost data from 3 days ago) +- **DO** use `DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)` as the target date - **DO** filter `cost_category = 'inference'` for request-level analysis -- **DO** exclude Lightricks team requests with `is_lt_team IS FALSE` for customer-facing cost analysis -- **DO** include LT team requests only when analyzing total infrastructure spend or debugging +- **DO** exclude Lightricks team with `is_lt_team IS FALSE` for customer-facing cost analysis +- **DO** include LT team only when analyzing total infrastructure spend or debugging - **DO** use `ltx-dwh-explore` as execution project - **DO** calculate cost per request with `SAFE_DIVIDE` to avoid division by zero -- **DO** compare against baseline (7-day avg, prior period) for trends -- **DO** round cost values to 2 decimal places for readability - -### Cost Calculation - - **DO** sum all three cost columns (row_cost + attributed_idle + attributed_overhead) for fully loaded cost per request - **DO** use `SUM(row_cost)` across all rows for total infrastructure cost -- **DO NOT** sum row_cost + attributed_* across all cost_categories (double-counting) -- **DO NOT** mix inference and non-inference rows in same aggregation without filtering - -### Analysis - +- **DO** compare against baseline (7-day avg, prior period) for trends +- **DO** round cost values to 2 decimal places for readability - **DO** flag anomalies with Z-score > 2 (cost or volume deviation > 2 std devs) - **DO** investigate failure costs (wasted spend on errors) - **DO** break down by endpoint/model for API, by process for Studio - **DO** check cost per request trends to spot efficiency degradation - **DO** validate results against total infrastructure spend - -### Alerts - - **DO** set thresholds based on historical baseline, not absolute values - **DO** alert engineering team for cost spikes > 30% vs baseline - **DO** include cost breakdown and root cause in alerts - **DO** route API cost alerts to API team, Studio alerts to Studio team + +### Completion Criteria + +✅ All cost categories analyzed (idle, inference, overhead) +✅ Production thresholds applied by tier (High/Medium/Low) +✅ Alerts prioritized and routed to appropriate teams +✅ Root cause investigation completed (endpoint/model/org/process) +✅ Findings presented with cost breakdown and recommendations +✅ Query uses 3-day lookback (not yesterday) +✅ Partition filtering applied for performance diff --git a/agents/monitoring/enterprise/SKILL.md b/agents/monitoring/enterprise/SKILL.md index 02a0b57..39c2403 100644 --- a/agents/monitoring/enterprise/SKILL.md +++ b/agents/monitoring/enterprise/SKILL.md @@ -1,73 +1,255 @@ --- name: enterprise-monitor -description: Monitors enterprise account health, usage, and contract compliance. Alerts on low engagement, quota breaches, or churn risk. -tags: [monitoring, enterprise, accounts, contracts] +description: "Monitor enterprise account health, usage, and contract compliance. Detects low engagement, quota breaches, and churn risk. Use when: (1) detecting enterprise account usage drops, (2) alerting on inactive accounts, (3) investigating power user engagement changes." +tags: [monitoring, enterprise, accounts, contracts, churn-risk] --- # Enterprise Monitor -## When to use - -- "Monitor enterprise account usage" -- "Monitor enterprise churn risk" -- "Alert when enterprise account is inactive" - -## What it monitors - -- **Account usage**: DAU, WAU, MAU per enterprise org -- **Token consumption**: Usage vs contracted quota, historical consumption trends -- **User activation**: % of seats active -- **Engagement**: Video generations, image generations, downloads per org -- **Churn signals**: Declining usage, inactive users - -## Steps - -1. **Run Comprehensive Analysis:** - - **Monitor**: ALL enterprise orgs automatically - - **What to monitor**: ALL - Account usage (DAU/WAU/MAU), token consumption vs quota, user activation, video/image generations, power user engagement - - **By Segment**: ALL enterprise accounts (apply McCann split, exclude Lightricks/Popular Pays) - - **Time window**: Last 30 days with org-specific historical baselines - - **Alert threshold**: Auto-detect problems using: - - DoD/WoW usage drops > 30% vs org's baseline - - MAU below org's 30-day average - - Token consumption < 50% of contracted quota (underutilization) - - Power user drops > 20% within org - - Zero activity for 7+ consecutive days - -2. **Read shared files:** - - `shared/product-context.md` — LTX products, enterprise business model, user types - - `shared/bq-schema.md` — Enterprise user segmentation queries - - `shared/metric-standards.md` — Enterprise metrics, quota tracking - - `shared/event-registry.yaml` — Feature events (if analyzing engagement) - - `shared/gpu-cost-query-templates.md` — GPU cost queries (if analyzing infrastructure costs) - - `shared/gpu-cost-analysis-patterns.md` — Cost analysis patterns (if analyzing infrastructure costs) - -3. **Identify enterprise users:** - - Use enterprise segmentation CTE from bq-schema.md (lines 441-461) - - Apply McCann split (McCann_NY vs McCann_Paris) - - Exclude Lightricks and Popular Pays - -4. **Write monitoring SQL:** - - Query org-level usage metrics - - Set baseline for each org based on their historical usage (e.g., 30-day average, 90-day trend) - - Compare current usage against org-specific baseline or contracted quota - - Flag orgs below threshold or showing decline - - Flag meaningful drops for power users (users with top usage within each org) - -5. **Present to user:** - - Show SQL query - - Show example alert format with org name and metrics - - Confirm threshold values and alert logic - -6. **Set up alert** (manual for now): - - Document SQL - - Configure notification to customer success team - -## Rules - -- DO use EXACT enterprise segmentation CTE from bq-schema.md without modification -- DO apply McCann split (McCann_NY vs McCann_Paris) -- DO exclude Lightricks and Popular Pays from enterprise orgs -- DO break out pilot vs contracted accounts -- DO NOT alert on free/self-serve users — this agent is enterprise-only -- DO include org name in alert for easy customer success follow-up +## 1. Overview (Why?) + +Enterprise accounts (contract and pilot) represent high-value customers with negotiated quotas and specific engagement patterns. Unlike self-serve users, enterprise usage should be monitored per-organization with org-specific baselines, since each has different team sizes, use cases, and contract terms. + +This skill provides **autonomous enterprise account monitoring** that detects declining usage, underutilization of quotas, inactive periods, and power user drops — all of which signal churn risk or engagement problems that require customer success intervention. + +**Problem solved**: Identify enterprise churn risk early through usage signals — before contracts end or accounts go completely inactive — with org-level root cause analysis. + +## 2. Requirements (What?) + +Monitor these outcomes autonomously: + +- [ ] DAU/WAU/MAU drops per enterprise org (> 30% vs org baseline) +- [ ] Token consumption vs contracted quota (underutilization < 50%) +- [ ] User activation (% of seats active) +- [ ] Video/image generation engagement per org +- [ ] Power user drops within org (> 20% decline) +- [ ] Zero activity for 7+ consecutive days +- [ ] Alerts include org name for customer success follow-up +- [ ] Pilot vs contract accounts separated +- [ ] McCann split applied (McCann_NY vs McCann_Paris) +- [ ] Lightricks and Popular Pays excluded + +## 3. Progress Tracker + +* [ ] Read shared knowledge (enterprise segmentation, schema, metrics) +* [ ] Identify enterprise users with segmentation CTE +* [ ] Write monitoring SQL with org-specific baselines +* [ ] Execute query for target date range +* [ ] Analyze results by org and account type +* [ ] Identify power user drops within each org +* [ ] Present findings with org-level details +* [ ] Route alerts to customer success team + +## 4. Implementation Plan + +### Phase 1: Read Alert Thresholds + +**Generic thresholds** (data-driven analysis pending): +- DoD/WoW usage drops > 30% vs org's baseline +- MAU below org's 30-day average +- Token consumption < 50% of contracted quota (underutilization) +- Power user drops > 20% within org +- Zero activity for 7+ consecutive days + +[!IMPORTANT] These are generic thresholds. Consider creating production thresholds based on org-specific analysis. + +### Phase 2: Read Shared Knowledge + +Before writing SQL, read: +- **`shared/product-context.md`** — LTX products, enterprise business model, user types +- **`shared/bq-schema.md`** — Enterprise user segmentation queries (lines 441-461) +- **`shared/metric-standards.md`** — Enterprise metrics, quota tracking +- **`shared/event-registry.yaml`** — Feature events (if analyzing engagement) + +**Data nuances**: +- Use EXACT enterprise segmentation CTE from bq-schema.md (lines 441-461) without modification +- Apply McCann split: `McCann_NY` vs `McCann_Paris` +- Exclude: `Lightricks`, `Popular Pays`, `None` +- Contract accounts: Indegene, HearWell_BeWell, Novig, Cylndr Studios, Miroma, Deriv, McCann_Paris +- Pilot accounts: All other enterprise orgs + +### Phase 3: Identify Enterprise Users + +✅ **PREFERRED: Use exact segmentation CTE from bq-schema.md** + +```sql +WITH ent_users AS ( + SELECT DISTINCT + lt_id, + CASE + WHEN COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) = 'McCann_NY' THEN 'McCann_NY' + WHEN COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) LIKE '%McCann%' THEN 'McCann_Paris' + ELSE COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) + END AS org + FROM `ltx-dwh-prod-processed.web.ltxstudio_users` + WHERE is_enterprise_user + AND current_customer_plan_type IN ('contract', 'pilot') + AND COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) NOT IN ('Lightricks', 'Popular Pays', 'None') +), +enterprise_users AS ( + SELECT DISTINCT + lt_id, + CASE + WHEN org IN ('Indegene', 'HearWell_BeWell', 'Novig', 'Cylndr Studios', 'Miroma', 'Deriv', 'McCann_Paris') + THEN org + ELSE CONCAT(org, ' Pilot') + END AS org, + CASE + WHEN org IN ('Indegene', 'HearWell_BeWell', 'Novig', 'Cylndr Studios', 'Miroma', 'Deriv', 'McCann_Paris') + THEN 'Contract' + ELSE 'Pilot' + END AS account_type + FROM ent_users + WHERE org NOT IN ('Lightricks', 'Popular Pays', 'None') +) +``` + +### Phase 4: Write Monitoring SQL + +✅ **PREFERRED: Monitor all enterprise orgs with org-specific baselines** + +```sql +WITH org_metrics AS ( + SELECT + a.dt, + e.org, + e.account_type, + COUNT(DISTINCT a.lt_id) AS dau, + SUM(a.num_tokens_consumed) AS tokens, + SUM(a.num_generate_image) AS image_gens, + SUM(a.num_generate_video) AS video_gens + FROM `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` a + JOIN enterprise_users e ON a.lt_id = e.lt_id + WHERE a.dt >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) + AND a.dt <= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) + GROUP BY a.dt, e.org, e.account_type +), +metrics_with_baseline AS ( + SELECT + *, + AVG(dau) OVER ( + PARTITION BY org + ORDER BY dt ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING + ) AS dau_baseline_30d, + AVG(tokens) OVER ( + PARTITION BY org + ORDER BY dt ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING + ) AS tokens_baseline_30d + FROM org_metrics +) +SELECT * FROM metrics_with_baseline +WHERE dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY); +``` + +**Key patterns**: +- **Org-specific baselines**: Each org compared to its own 30-day average +- **McCann split**: Separate McCann_NY and McCann_Paris +- **Account type**: Contract vs Pilot + +### Phase 5: Analyze Results + +**For usage trends**: +- Compare org's current usage vs their 30-day baseline +- Flag orgs with DoD/WoW drops > 30% vs baseline +- Flag orgs with MAU below their 30-day average + +**For quota analysis**: +- Compare token consumption vs contracted quota +- Flag underutilization (< 50% of quota) +- Identify orgs approaching or exceeding quota + +**For engagement**: +- Track power users within each org (top 20% by token usage) +- Flag power user drops > 20% within org +- Flag zero activity for 7+ consecutive days + +### Phase 6: Present Findings + +Format results with: +- **Summary**: Key finding (e.g., "Novig enterprise account inactive for 7 days") +- **Org details**: Org name, account type (Contract/Pilot), baseline usage +- **Metrics**: DAU, tokens, generations vs baseline +- **Recommendation**: Customer success action (reach out, investigate, adjust quota) + +**Alert format**: +``` +⚠️ ENTERPRISE ALERT: + • Org: Novig (Contract) + Metric: Token consumption + Current: 0 tokens | Baseline: 27K/day + Drop: -100% | Zero activity for 7 days + +Recommendation: Contact account manager immediately +``` + +### Phase 7: Route Alert + +For ongoing monitoring: +1. Save SQL query +2. Set up in BigQuery scheduled query or Hex Thread +3. Configure notification for customer success team +4. Include org name and account manager contact in alert + +## 5. Context & References + +### Shared Knowledge +- **`shared/product-context.md`** — LTX products, enterprise business model, user types +- **`shared/bq-schema.md`** — Enterprise user segmentation queries (lines 441-461) +- **`shared/metric-standards.md`** — Enterprise metrics, quota tracking +- **`shared/event-registry.yaml`** — Feature events for engagement analysis + +### Data Sources +- **Users table**: `ltx-dwh-prod-processed.web.ltxstudio_users` +- **Usage table**: `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` +- Key columns: `lt_id`, `enterprise_name_at_purchase`, `current_enterprise_name`, `organization_name`, `current_customer_plan_type` + +### Enterprise Orgs + +**Contract accounts**: +- Indegene +- HearWell_BeWell +- Novig +- Cylndr Studios +- Miroma +- Deriv +- McCann_Paris + +**Pilot accounts**: All other enterprise orgs (suffixed with " Pilot") + +**Excluded**: Lightricks, Popular Pays, None + +## 6. Constraints & Done + +### DO NOT + +- **DO NOT** modify enterprise segmentation CTE — use exact version from bq-schema.md +- **DO NOT** alert on free/self-serve users — this agent is enterprise-only +- **DO NOT** combine McCann_NY and McCann_Paris — keep them separate +- **DO NOT** include Lightricks or Popular Pays in enterprise monitoring +- **DO NOT** use generic baselines — each org compared to its own historical usage + +### DO + +- **DO** use EXACT enterprise segmentation CTE from bq-schema.md (lines 441-461) without modification +- **DO** apply McCann split (McCann_NY vs McCann_Paris) +- **DO** exclude Lightricks, Popular Pays, and None from enterprise orgs +- **DO** break out pilot vs contracted accounts +- **DO** include org name in alert for customer success follow-up +- **DO** use org-specific baselines (each org's 30-day average) +- **DO** flag DoD/WoW usage drops > 30% vs org baseline +- **DO** flag MAU below org's 30-day average +- **DO** flag token consumption < 50% of contracted quota +- **DO** flag power user drops > 20% within org +- **DO** flag zero activity for 7+ consecutive days +- **DO** route alerts to customer success team with org details +- **DO** validate unusual patterns with customer success before alerting + +### Completion Criteria + +✅ All enterprise orgs monitored (Contract and Pilot) +✅ Org-specific baselines applied +✅ McCann split applied (NY vs Paris) +✅ Lightricks and Popular Pays excluded +✅ Alerts include org name and account type +✅ Usage drops, quota issues, and engagement drops detected +✅ Findings routed to customer success team diff --git a/agents/monitoring/revenue/SKILL.md b/agents/monitoring/revenue/SKILL.md index 6c0cb4b..804d688 100644 --- a/agents/monitoring/revenue/SKILL.md +++ b/agents/monitoring/revenue/SKILL.md @@ -1,75 +1,224 @@ --- name: revenue-monitor -description: Monitors revenue metrics, tracks subscription changes, and alerts on revenue anomalies or threshold breaches. -tags: [monitoring, revenue, subscriptions] +description: "Monitor LTX Studio revenue metrics and subscription changes. Detects revenue anomalies, churn spikes, and refund issues. Use when: (1) detecting revenue drops, (2) alerting on churn rate changes, (3) investigating subscription tier movements." +tags: [monitoring, revenue, subscriptions, churn, mrr] --- # Revenue Monitor -## When to use +## 1. Overview (Why?) -- "Monitor revenue trends" -- "Alert when revenue drops" -- "Track subscription churn" -- "Monitor MRR/ARR changes" -- "Alert on refund spikes" +LTX Studio revenue varies by tier (Free/Lite/Standard/Pro/Enterprise) and plan type (self-serve/contract/pilot). Revenue monitoring requires tracking both top-line metrics (MRR, ARR) and operational indicators (churn, refunds, tier movements) to detect problems early. -## What it monitors +This skill provides **autonomous revenue monitoring** with alerting on revenue drops, churn rate spikes, refund increases, and subscription changes. It monitors all segments and identifies which tiers or plan types drive changes. -- **Revenue metrics**: MRR, ARR, daily revenue -- **Subscription metrics**: New subscriptions, cancellations, churns, renewals -- **Refunds**: Refund rate, refund amount -- **Tier changes**: Upgrades, downgrades -- **Enterprise contracts**: Contract value, renewals +**Problem solved**: Detect revenue problems, churn risk, and subscription health issues before they compound — with segment-level root cause analysis. -## Steps +## 2. Requirements (What?) -1. **Run Comprehensive Analysis:** - - **What to monitor**: ALL - MRR, ARR, daily revenue, new subscriptions, cancellations, churns, renewals, refunds, tier changes (upgrades/downgrades), enterprise contract values - - **By Segment**: ALL - By tier (Free/Lite/Standard/Pro/Enterprise), by plan type (self-serve/contract/pilot) - - **Time window**: Last 30 days with 7-day rolling baseline - - **Alert threshold**: Auto-detect problems using: - - Revenue drop > 15% DoD or > 10% WoW - - Churn rate > 5% or increase > 2x baseline - - Refund rate > 3% or spike > 50% DoD - - New subscriptions drop > 20% WoW - - Enterprise renewals < 90 days out +Monitor these outcomes autonomously: -2. **Read Shared Knowledge:** +- [ ] Revenue drops by segment (tier, plan type) +- [ ] MRR and ARR trend changes +- [ ] Churn rate increases above baseline +- [ ] Refund rate spikes +- [ ] New subscription volume drops +- [ ] Tier movements (upgrades vs downgrades) +- [ ] Enterprise contract renewals approaching (< 90 days) +- [ ] Alerts fire only when changes exceed thresholds +- [ ] Root cause identifies which tiers/plans drove changes -Before writing SQL: -- **`shared/product-context.md`** — LTX products, user types, business model, enterprise context -- **`shared/bq-schema.md`** — Subscription tables (ltxstudio_user_tiers_dates, ltxstudio_subscriptions, etc.), user segmentation queries -- **`shared/metric-standards.md`** — Revenue metric definitions (MRR, ARR, churn, LTV) -- **`shared/event-registry.yaml`** — Feature events (if analyzing feature-driven revenue) +## 3. Progress Tracker + +* [ ] Read shared knowledge (schema, metrics, business context) +* [ ] Write monitoring SQL with baseline comparisons +* [ ] Execute query for target date range +* [ ] Analyze results by segment (tier, plan type) +* [ ] Identify root cause (which segments drove changes) +* [ ] Present findings with severity and recommendations +* [ ] Set up ongoing alerts (if requested) -3. **Write monitoring SQL:** - - Query current metric value - - Compare against historical baseline or threshold - - Flag anomalies or breaches +## 4. Implementation Plan -4. **Present to user:** - - Show SQL query - - Show example alert format - - Confirm threshold values +### Phase 1: Read Alert Thresholds -5. **Set up alert** (manual for now): - - Document SQL in Hex or BigQuery scheduled query - - Configure Slack webhook or notification +**Generic thresholds** (data-driven analysis pending): +- Revenue drop > 15% DoD or > 10% WoW +- Churn rate > 5% or increase > 2x baseline +- Refund rate > 3% or spike > 50% DoD +- New subscriptions drop > 20% WoW +- Enterprise renewals < 90 days out -## Reference Files +[!IMPORTANT] These are generic thresholds. Consider creating production thresholds based on 60-day analysis (similar to usage/GPU cost monitoring). -| File | Read when | -|------|-----------| -| `shared/product-context.md` | Understanding LTX products, user types, business model, enterprise context | -| `shared/bq-schema.md` | Writing SQL for subscription/revenue tables, user segmentation queries | -| `shared/metric-standards.md` | Defining revenue metrics (MRR, ARR, churn, LTV, retention) | -| `shared/event-registry.yaml` | Analyzing feature-driven revenue or usage patterns | +### Phase 2: Read Shared Knowledge -## Rules +Before writing SQL, read: +- **`shared/product-context.md`** — LTX products, user types, business model, enterprise context +- **`shared/bq-schema.md`** — Subscription tables (ltxstudio_user_tiers_dates, ltxstudio_subscriptions, etc.), user segmentation queries +- **`shared/metric-standards.md`** — Revenue metric definitions (MRR, ARR, churn, LTV) +- **`shared/event-registry.yaml`** — Feature events (if analyzing feature-driven revenue) -- DO use LTX Studio subscription tables from bq-schema.md -- DO exclude is_lt_team unless explicitly requested -- DO validate thresholds with user before setting alerts -- DO NOT hardcode dates — use rolling windows -- DO account for timezone differences in daily revenue calculations +**Data nuances**: +- Tables: `ltxstudio_user_tiers_dates`, `ltxstudio_subscriptions` +- Key columns: `lt_id`, `griffin_tier_name`, `subscription_start_date`, `subscription_end_date`, `plan_type` +- Exclude `is_lt_team` unless explicitly requested +- Revenue calculations vary by plan type (self-serve vs contract/pilot) + +### Phase 3: Write Monitoring SQL + +✅ **PREFERRED: Monitor all segments with baseline comparisons** + +```sql +WITH revenue_metrics AS ( + SELECT + dt, + griffin_tier_name, + plan_type, + COUNT(DISTINCT lt_id) AS active_subscribers, + SUM(CASE WHEN subscription_start_date = dt THEN 1 ELSE 0 END) AS new_subs, + SUM(CASE WHEN subscription_end_date = dt THEN 1 ELSE 0 END) AS churned_subs, + SUM(mrr_amount) AS total_mrr, + SUM(arr_amount) AS total_arr + FROM subscription_table + WHERE dt >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) + AND dt <= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) + AND is_lt_team IS FALSE + GROUP BY dt, griffin_tier_name, plan_type +), +metrics_with_baseline AS ( + SELECT + *, + AVG(total_mrr) OVER ( + PARTITION BY griffin_tier_name, plan_type + ORDER BY dt ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING + ) AS mrr_baseline_7d, + AVG(churned_subs) OVER ( + PARTITION BY griffin_tier_name, plan_type + ORDER BY dt ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING + ) AS churn_baseline_7d + FROM revenue_metrics +) +SELECT * FROM metrics_with_baseline +WHERE dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY); +``` + +**Key patterns**: +- **Segmentation**: By tier (Free/Lite/Standard/Pro/Enterprise) and plan type (self-serve/contract/pilot) +- **Baseline**: 7-day rolling average for comparison +- **Time window**: Yesterday only + +### Phase 4: Execute Query + +Run query using: +```bash +bq --project_id=ltx-dwh-explore query --use_legacy_sql=false --format=pretty " + +" +``` + +### Phase 5: Analyze Results + +**For revenue trends**: +- Compare current period vs baseline (7-day avg, prior week, prior month) +- Calculate % change and flag significant shifts (>15% DoD, >10% WoW) +- Identify which tiers/plans drove changes + +**For churn analysis**: +- Calculate churn rate = churned_subs / active_subscribers +- Compare to baseline and flag if > 5% or increase > 2x baseline +- Identify which tiers have highest churn + +**For subscription health**: +- Track new subscription volume by tier +- Monitor upgrade vs downgrade ratios +- Flag enterprise renewal dates < 90 days out + +### Phase 6: Present Findings + +Format results with: +- **Summary**: Key finding (e.g., "MRR dropped 18% yesterday") +- **Root cause**: Which segment drove the change (e.g., "Standard tier churned 12 users") +- **Breakdown**: Metrics by tier and plan type +- **Recommendation**: Action to take (investigate churn reason, contact at-risk accounts) + +**Alert format**: +``` +⚠️ REVENUE ALERT: + • Metric: MRR Drop + Current: $X | Baseline: $Y + Change: -Z% + Segment: Standard tier, self-serve + +Recommendation: Investigate recent Standard tier churns +``` + +### Phase 7: Set Up Alert (Optional) + +For ongoing monitoring: +1. Save SQL query +2. Set up in BigQuery scheduled query or Hex Thread +3. Configure notification threshold +4. Route alerts to Revenue/Growth team + +## 5. Context & References + +### Shared Knowledge +- **`shared/product-context.md`** — LTX products, user types, business model, enterprise context +- **`shared/bq-schema.md`** — Subscription tables, user segmentation queries +- **`shared/metric-standards.md`** — Revenue metric definitions (MRR, ARR, churn, LTV, retention) +- **`shared/event-registry.yaml`** — Feature events for feature-driven revenue analysis + +### Data Sources +Tables: `ltxstudio_user_tiers_dates`, `ltxstudio_subscriptions` +- Key columns: `lt_id`, `griffin_tier_name`, `subscription_start_date`, `subscription_end_date`, `plan_type`, `mrr_amount`, `arr_amount` +- Filter: `is_lt_team IS FALSE` for customer revenue + +### Tiers +- Free (no revenue) +- Lite (lowest paid tier) +- Standard (mid-tier) +- Pro (high-tier) +- Enterprise (contract/pilot) + +### Plan Types +- Self-serve (automated subscription) +- Contract (enterprise contract) +- Pilot (enterprise pilot) + +## 6. Constraints & Done + +### DO NOT + +- **DO NOT** include is_lt_team users unless explicitly requested +- **DO NOT** hardcode dates — use rolling windows +- **DO NOT** use absolute thresholds — compare to baseline +- **DO NOT** mix plan types without proper segmentation +- **DO NOT** ignore timezone differences in daily revenue calculations + +### DO + +- **DO** use LTX Studio subscription tables from bq-schema.md +- **DO** exclude is_lt_team unless explicitly requested +- **DO** validate thresholds with user before setting alerts +- **DO** use rolling windows (7-day, 30-day baselines) +- **DO** account for timezone differences in daily revenue calculations +- **DO** segment by tier AND plan type for root cause analysis +- **DO** compare against baseline (7-day avg, prior period) for trends +- **DO** calculate churn rate = churned / active subscribers +- **DO** flag MRR drops > 15% DoD or > 10% WoW +- **DO** flag churn rate > 5% or increase > 2x baseline +- **DO** flag refund rate > 3% or spike > 50% DoD +- **DO** flag new subscription drops > 20% WoW +- **DO** track enterprise renewal dates < 90 days out +- **DO** include segment breakdown in all alerts +- **DO** validate unusual patterns with Revenue/Growth team before alerting + +### Completion Criteria + +✅ All revenue metrics monitored (MRR, ARR, churn, refunds, new subs) +✅ Alerts fire with thresholds (generic pending production analysis) +✅ Segment-level root cause identified +✅ Findings presented with recommendations +✅ Timezone handling applied to daily revenue +✅ Enterprise renewals tracked From ab1f8c2522bf1952c30b032067de3fb6607a39ac Mon Sep 17 00:00:00 2001 From: Daniel Beer Date: Tue, 10 Mar 2026 10:04:25 +0200 Subject: [PATCH 09/20] refactor: combine SQL and Python into single usage monitor script MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Changes: - Replace separate SQL + Python files with combined usage_monitor.py - Update from percentage thresholds to statistical 3σ method - Embed SQL query as string in Python script - Execute BigQuery directly (no CSV intermediate) - Add --date parameter for flexible date selection - Update SKILL.md to reference combined script Benefits: - Single file to maintain (353 lines vs 413 lines) - No intermediate CSV files needed - Easier to run and schedule - Self-contained SQL + alerting logic --- agents/monitoring/usage/SKILL.md | 208 +++++++----- agents/monitoring/usage/usage_monitor.py | 303 ++++++++++++++++++ agents/monitoring/usage/usage_monitoring.sql | 179 ----------- .../monitoring/usage/usage_monitoring_v2.py | 234 -------------- 4 files changed, 433 insertions(+), 491 deletions(-) create mode 100644 agents/monitoring/usage/usage_monitor.py delete mode 100644 agents/monitoring/usage/usage_monitoring.sql delete mode 100644 agents/monitoring/usage/usage_monitoring_v2.py diff --git a/agents/monitoring/usage/SKILL.md b/agents/monitoring/usage/SKILL.md index f20fe88..bbb107c 100644 --- a/agents/monitoring/usage/SKILL.md +++ b/agents/monitoring/usage/SKILL.md @@ -10,9 +10,9 @@ tags: [monitoring, usage, dau, generations, engagement, alerts] LTX Studio usage varies significantly by user segment and day-of-week. Enterprise accounts show strong weekday patterns (18% weekend usage), while Free users are stable across all days. Generic alerting thresholds produce false positives from normal variance. -This skill provides **autonomous usage monitoring** with data-driven thresholds derived from 60-day statistical analysis. It detects genuine problems while suppressing noise from expected patterns (Enterprise weekends, small segment variance, etc.). +This skill provides **autonomous usage monitoring** using statistical anomaly detection. It compares today's metrics against the last 10 same-day-of-week data points (e.g., last 10 Mondays) and alerts when values deviate by 3 standard deviations from the mean. -**Problem solved**: Detect usage drops that indicate product issues, enterprise churn risk, or engagement problems — without manual threshold tuning or false alarms. +**Problem solved**: Detect usage drops that indicate product issues, enterprise churn risk, or engagement problems — using statistical thresholds that adapt to each segment's variance patterns. ## 2. Requirements (What?) @@ -29,33 +29,33 @@ Monitor these outcomes autonomously: ## 3. Progress Tracker -* [ ] Read production thresholds and shared knowledge -* [ ] Write monitoring SQL with 14-day same-DOW baseline +* [ ] Read shared knowledge (schema, metrics, segmentation) +* [ ] Write monitoring SQL with last 10 same-DOW statistics * [ ] Execute query and save results -* [ ] Run Python alerting script -* [ ] Analyze alerts by segment and severity +* [ ] Run Python alerting script (3 std dev threshold) +* [ ] Analyze alerts by segment * [ ] Investigate root cause (org-level for Enterprise, tier-level for others) * [ ] Present findings with recommended actions ## 4. Implementation Plan -### Phase 1: Read Production Thresholds +### Phase 1: Statistical Anomaly Detection Method -Production thresholds from `/Users/dbeer/workspace/dwh-data-model-transforms/queries/segment_alerting_thresholds.md`: +**Approach**: Compare today's metric against the last 10 same-day-of-week data points using statistical thresholds. -| Segment | DAU | Image Gens | Video Gens | Tokens | Notes | -|---------|-----|------------|------------|--------|-------| -| **Enterprise Contract** | -50% | -60% | Skip | -70% | Weekday only (Mon-Fri) | -| **Enterprise Pilot** | -50% | -60% | Skip | -70% | Weekday only (Mon-Fri) | -| **Heavy Users** | -25% | -30% | -30% | -25% | All days | -| **Paying non-Enterprise** | -20% | -25% | -25% | -20% | All days | -| **Free** | -25% | -35% | -30% | -20% | All days | +**Alert logic**: +1. Collect last 10 same-day-of-week values (e.g., if today is Monday, get last 10 Mondays) +2. Calculate mean (μ) and standard deviation (σ) for each segment × metric +3. Alert if: `|today_value - μ| > 3σ` (3 standard deviations) -**Alert logic**: `today_value < rolling_14d_same_dow_avg * (1 - threshold_pct)` +**Why 3 standard deviations?** +- Captures true anomalies (99.7% of normal data falls within 3σ) +- Adapts to each segment's natural variance +- Reduces false positives from expected fluctuations -**Severity**: -- WARNING: Drop exceeds threshold -- CRITICAL: Drop exceeds 1.5x threshold +**Exceptions**: +- **Enterprise weekends**: Suppress alerts on Sat/Sun (too few data points, 18% of weekday usage) +- **Enterprise video gens**: Skip (CV > 100%, single-user dominated, insufficient stability for statistical thresholds) ### Phase 2: Read Shared Knowledge @@ -73,7 +73,7 @@ Before writing SQL, read: ### Phase 3: Write Monitoring SQL -✅ **PREFERRED: Use 14-day same-day-of-week baseline** +✅ **PREFERRED: Calculate mean and std dev from last 10 same-day-of-week** ```sql WITH daily_metrics AS ( @@ -94,68 +94,107 @@ WITH daily_metrics AS ( FROM `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` a LEFT JOIN enterprise_users e ON a.lt_id = e.lt_id LEFT JOIN heavy_users h ON a.lt_id = h.lt_id - WHERE a.dt >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) + WHERE a.dt >= DATE_SUB(CURRENT_DATE(), INTERVAL 70 DAY) -- Need 70 days to get 10 same-DOW AND a.dt <= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) GROUP BY a.dt, day_of_week, segment ), -aggregated_metrics AS ( +same_dow_last_10 AS ( SELECT dt, + day_of_week, segment, - SUM(dau) AS dau, - SUM(tokens) AS tokens, - SUM(image_gens) AS image_gens, - SUM(video_gens) AS video_gens, - -- 14-day same-day-of-week baseline - AVG(SUM(dau)) OVER ( - PARTITION BY segment, EXTRACT(DAYOFWEEK FROM dt) - ORDER BY dt ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING - ) AS dau_baseline_14d, - AVG(SUM(tokens)) OVER ( - PARTITION BY segment, EXTRACT(DAYOFWEEK FROM dt) - ORDER BY dt ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING - ) AS tokens_baseline_14d - -- ... repeat for image_gens, video_gens + dau, + tokens, + image_gens, + video_gens, + -- Get last 10 same-DOW values (excluding today) + ARRAY_AGG(dau) OVER ( + PARTITION BY segment, day_of_week + ORDER BY dt ROWS BETWEEN 10 PRECEDING AND 1 PRECEDING + ) AS dau_last_10, + ARRAY_AGG(tokens) OVER ( + PARTITION BY segment, day_of_week + ORDER BY dt ROWS BETWEEN 10 PRECEDING AND 1 PRECEDING + ) AS tokens_last_10, + ARRAY_AGG(image_gens) OVER ( + PARTITION BY segment, day_of_week + ORDER BY dt ROWS BETWEEN 10 PRECEDING AND 1 PRECEDING + ) AS image_gens_last_10, + ARRAY_AGG(video_gens) OVER ( + PARTITION BY segment, day_of_week + ORDER BY dt ROWS BETWEEN 10 PRECEDING AND 1 PRECEDING + ) AS video_gens_last_10 FROM daily_metrics - GROUP BY dt, segment +), +stats AS ( + SELECT + dt, + segment, + dau, + tokens, + image_gens, + video_gens, + -- Calculate mean and stddev from last 10 same-DOW + (SELECT AVG(x) FROM UNNEST(dau_last_10) AS x) AS dau_mean, + (SELECT STDDEV(x) FROM UNNEST(dau_last_10) AS x) AS dau_stddev, + (SELECT AVG(x) FROM UNNEST(tokens_last_10) AS x) AS tokens_mean, + (SELECT STDDEV(x) FROM UNNEST(tokens_last_10) AS x) AS tokens_stddev, + (SELECT AVG(x) FROM UNNEST(image_gens_last_10) AS x) AS image_gens_mean, + (SELECT STDDEV(x) FROM UNNEST(image_gens_last_10) AS x) AS image_gens_stddev, + (SELECT AVG(x) FROM UNNEST(video_gens_last_10) AS x) AS video_gens_mean, + (SELECT STDDEV(x) FROM UNNEST(video_gens_last_10) AS x) AS video_gens_stddev + FROM same_dow_last_10 ) -SELECT * FROM aggregated_metrics +SELECT * FROM stats WHERE dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY); ``` **Key patterns**: - **Segmentation**: Use full CTE from `shared/bq-schema.md` (lines 441-516) -- **Baseline**: Partition by segment AND day-of-week to handle patterns -- **Time window**: Yesterday only (`DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)`) +- **Last 10 same-DOW**: Use `ARRAY_AGG` with window frame to collect last 10 values +- **Statistics**: Calculate mean (AVG) and standard deviation (STDDEV) from array +- **Time window**: 70 days lookback (to ensure 10 same-DOW data points available) -### Phase 4: Execute Query and Run Monitoring +### Phase 4: Execute Monitoring -Run query and save results: +Run the combined monitoring script: ```bash -bq --project_id=ltx-dwh-explore query --use_legacy_sql=false --format=csv \ - < usage_monitoring.sql 2>/dev/null | grep -v "^Waiting" > usage_data_clean.csv +# Monitor yesterday (default) +python3 usage_monitor.py + +# Monitor specific date +python3 usage_monitor.py --date 2026-03-05 ``` -Run Python alerting script: +**Prerequisites**: ```bash -python3 usage_monitoring_v2.py +pip install google-cloud-bigquery ``` **The script**: -1. Reads CSV data -2. Checks each segment × metric against production thresholds -3. Suppresses Enterprise alerts on weekends -4. Skips Video Generations for Enterprise (CV > 100%) -5. Flags WARNING (drop > threshold) or CRITICAL (drop > 1.5x threshold) +1. Executes BigQuery SQL with last 10 same-DOW statistical calculations +2. Retrieves mean (μ) and standard deviation (σ) for each segment × metric +3. Calculates z-score: `z = (current - μ) / σ` +4. Alerts if `|z| > 3` (3 standard deviations from mean) +5. Suppresses Enterprise alerts on weekends (insufficient weekend data points) +6. Skips Video Generations for Enterprise (CV > 100%, single-user dominated) +7. Flags WARNING (|z| > 3) or CRITICAL (|z| > 4.5) +8. Outputs formatted alerts with mean, stddev, z-score, and % change ### Phase 5: Analyze Results **When alerts fire**: -1. **Check severity**: CRITICAL requires immediate investigation, WARNING needs monitoring +1. **Check severity**: + - CRITICAL (|z| > 4.5): Immediate investigation required + - WARNING (|z| > 3): Monitor closely, investigate if persists 2. **Identify segment**: Which user segment is affected? 3. **Check day-of-week**: Is this expected (e.g., weekend drop in Enterprise)? -4. **Investigate root cause**: +4. **Validate statistical significance**: + - Is the standard deviation reasonable? (Not too small, causing false positives) + - Are there enough historical data points? (Need 10 same-DOW values) + - Is the mean representative? (No major outliers in last 10 values) +5. **Investigate root cause**: **For Enterprise segments** — Drill down to organization level: ```sql @@ -180,8 +219,15 @@ ORDER BY tokens DESC; ``` 🔴 CRITICAL ALERTS (N): • Segment - Metric - Current: X | Baseline: Y - Drop: Z% | Threshold: T% + Current: X | Mean (μ): Y | Std Dev (σ): Z + Z-score: W (|W| > 3σ threshold) + Drop: N% from mean + +⚠️ WARNING ALERTS (N): + • Segment - Metric + Current: X | Mean (μ): Y | Std Dev (σ): Z + Z-score: W (|W| > 3σ threshold) + Drop: N% from mean ``` **Root cause format (Enterprise)**: @@ -207,22 +253,25 @@ Standard tier users consuming 36% fewer tokens per user ``` **Recommended actions**: -- **Immediate**: Contact account managers for stopped clients -- **Urgent**: Investigate large drops (>70%) -- **Monitor**: Track new/growing accounts for stability +- **CRITICAL alerts (|z| > 4.5)**: Immediate investigation, contact account managers, check for infrastructure issues +- **WARNING alerts (|z| > 3)**: Monitor for persistence, investigate if alert repeats next day +- **Large absolute drops**: Even if within 3σ, very large absolute value changes warrant investigation +- **Enterprise segments**: Drill down to organization level to identify specific client changes ### Phase 7: Set Up Alert (Optional) For ongoing monitoring: -1. Save SQL query +1. Save SQL query with statistical calculations 2. Set up in BigQuery scheduled query or Hex Thread -3. Configure notification threshold +3. Configure Python script to run daily and check 3σ threshold 4. Route alerts to Slack (#product-alerts, #engineering-alerts) +5. Monitor false positive rate and adjust threshold if needed (3σ = 99.7% confidence) ## 5. Context & References -### Production Thresholds -- **`/Users/dbeer/workspace/dwh-data-model-transforms/queries/segment_alerting_thresholds.md`** — Production thresholds per segment (60-day analysis) +### Statistical Thresholds +- **3 standard deviations (3σ)** — Captures 99.7% of normal variance, auto-adapts to each segment's patterns +- **Historical reference**: `/Users/dbeer/workspace/dwh-data-model-transforms/queries/segment_alerting_thresholds.md` (percentage-based thresholds from 60-day analysis, now replaced by statistical method) ### Shared Knowledge - **`shared/bq-schema.md`** — Table schema, segmentation queries (lines 441-516) @@ -231,8 +280,7 @@ For ongoing monitoring: - **`shared/event-registry.yaml`** — Feature events and action names ### Production Scripts -- **`usage_monitoring.sql`** — BigQuery query with 14-day same-DOW baselines -- **`usage_monitoring_v2.py`** — Python script that checks thresholds and generates alerts +- **`usage_monitor.py`** — Combined script: SQL query + 3σ alerting logic + formatted output - **`investigate_root_cause.sql`** — Organization-level drill-down for Enterprise segments ### Data Source @@ -245,38 +293,42 @@ Table: `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` ### DO NOT -- **DO NOT** use generic thresholds across segments — each has different volatility -- **DO NOT** alert on Enterprise weekends — weekend DAU is 6-7 (18% of weekday), too noisy -- **DO NOT** alert on Enterprise video gens — CV > 100%, single-user dominated -- **DO NOT** compare Monday to Sunday — day-of-week effects are huge +- **DO NOT** use generic thresholds across segments — statistical method auto-adapts to each segment's variance +- **DO NOT** alert on Enterprise weekends — weekend DAU is 6-7 (18% of weekday), too few data points for stable statistics +- **DO NOT** alert on Enterprise video gens — CV > 100%, single-user dominated, insufficient stability for statistical thresholds +- **DO NOT** compare different days of week — day-of-week effects are huge, always use same-DOW comparisons - **DO NOT** use simplified segmentation — hierarchy must be respected (Enterprise → Heavy → Paying → Free) -- **DO NOT** alert on 1-DAU segments — single-user noise dominates signal -- **DO NOT** use Z-scores on small segments — need 30+ data points for stable statistics +- **DO NOT** alert on segments with <10 historical same-DOW data points — insufficient for stable mean/stddev +- **DO NOT** use standard deviation when σ is near-zero — causes false positives from noise - **DO NOT** filter `is_lt_team IS FALSE` — already filtered at table level -- **DO NOT** use absolute thresholds — always compare to rolling 14-day same-DOW baseline +- **DO NOT** use absolute thresholds — always compare using statistical baselines (last 10 same-DOW) ### DO - **DO** always filter on `dt` partition column for performance -- **DO** use production thresholds from segment_alerting_thresholds.md +- **DO** use 3 standard deviations (3σ) as alert threshold — captures 99.7% of normal variance - **DO** suppress Enterprise alerts on weekends (weekday-only alerting) - **DO** skip Video Generation alerts for Enterprise - **DO** use full segmentation CTE from `shared/bq-schema.md` (lines 441-516) -- **DO** compare against 14-day same-day-of-week baseline -- **DO** flag CRITICAL (drop > 1.5x threshold) vs WARNING (drop > threshold) +- **DO** compare against last 10 same-day-of-week data points (not all days) +- **DO** calculate mean (μ) and standard deviation (σ) from ARRAY_AGG of last 10 same-DOW values +- **DO** flag CRITICAL (|z| > 4.5) vs WARNING (|z| > 3) based on z-score severity - **DO** investigate at organization level for Enterprise, tier level for others -- **DO** include segment breakdown and root cause in alerts +- **DO** include mean, stddev, z-score, and root cause in alerts - **DO** validate unusual patterns with product team before alerting - **DO** use `SAFE_DIVIDE` with * 100 for percentages - **DO** report both press count AND output count for generations - **DO** use `action_name_detailed` for generation events when needed +- **DO** ensure 70-day lookback to guarantee 10 same-DOW data points available ### Completion Criteria ✅ All metrics monitored (DAU, tokens, image gens, video gens) -✅ Production thresholds applied per segment -✅ Alerts fire with severity levels (WARNING/CRITICAL) +✅ Statistical thresholds (3σ) applied per segment × metric × day-of-week +✅ Mean (μ) and standard deviation (σ) calculated from last 10 same-DOW +✅ Alerts fire with severity levels based on z-score (WARNING: |z| > 3, CRITICAL: |z| > 4.5) ✅ Root cause investigation completed -✅ Findings presented with recommended actions +✅ Findings presented with mean, stddev, z-score, and recommended actions ✅ Enterprise weekend suppression working ✅ Enterprise video gen alerts skipped +✅ 70-day lookback ensures 10 same-DOW data points available diff --git a/agents/monitoring/usage/usage_monitor.py b/agents/monitoring/usage/usage_monitor.py new file mode 100644 index 0000000..4793e78 --- /dev/null +++ b/agents/monitoring/usage/usage_monitor.py @@ -0,0 +1,303 @@ +#!/usr/bin/env python3 +""" +LTX Studio Usage Monitor - Statistical Anomaly Detection +Combines SQL query + alerting logic in one file +Uses last 10 same-day-of-week data points with 3σ threshold +""" + +import sys +from datetime import datetime, date, timedelta +from dataclasses import dataclass +from typing import Optional, List +from google.cloud import bigquery + + +# SQL Query - Statistical monitoring with last 10 same-DOW +MONITORING_QUERY = """ +WITH ent_users AS ( + SELECT DISTINCT + lt_id, + CASE + WHEN COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) = 'McCann_NY' THEN 'McCann_NY' + WHEN COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) LIKE '%McCann%' THEN 'McCann_Paris' + ELSE COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) + END AS org + FROM `ltx-dwh-prod-processed.web.ltxstudio_users` + WHERE is_enterprise_user + AND current_customer_plan_type IN ('contract', 'pilot') + AND COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) NOT IN ('Lightricks', 'Popular Pays', 'None') +), +enterprise_users AS ( + SELECT DISTINCT + lt_id, + CASE + WHEN org IN ('Indegene', 'HearWell_BeWell', 'Novig', 'Cylndr Studios', 'Miroma', 'Deriv', 'McCann_Paris') + THEN org + ELSE CONCAT(org, ' Pilot') + END AS org, + CASE + WHEN org IN ('Indegene', 'HearWell_BeWell', 'Novig', 'Cylndr Studios', 'Miroma', 'Deriv', 'McCann_Paris') + THEN 'Contract' + ELSE 'Pilot' + END AS account_type + FROM ent_users + WHERE org NOT IN ('Lightricks', 'Popular Pays', 'None') +), +heavy_users AS ( + SELECT DISTINCT + u.lt_id, + u.griffin_tier_name + FROM `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` u + WHERE u.px_first_purchase_ts IS NOT NULL + AND u.dt >= DATE_SUB(@target_date, INTERVAL 28 DAY) + AND u.dt <= DATE_SUB(@target_date, INTERVAL 1 DAY) + AND DATE(u.px_first_purchase_ts) < DATE_SUB(@target_date, INTERVAL 28 DAY) + AND u.griffin_tier_name NOT IN ('free', 'custom_plan') + AND LOWER(u.email_domain) NOT LIKE '%lightricks%' + AND u.num_tokens_consumed > 0 + GROUP BY u.lt_id, u.griffin_tier_name + HAVING COUNT(DISTINCT DATE_TRUNC(u.dt, WEEK)) >= 4 +), +daily_metrics AS ( + SELECT + a.dt, + EXTRACT(DAYOFWEEK FROM a.dt) AS day_of_week, + CASE + WHEN e.lt_id IS NOT NULL AND e.account_type = 'Contract' THEN 'Enterprise Contract' + WHEN e.lt_id IS NOT NULL AND e.account_type = 'Pilot' THEN 'Enterprise Pilot' + WHEN h.lt_id IS NOT NULL THEN 'Heavy Users' + WHEN a.griffin_tier_name <> 'free' THEN 'Paying non-Enterprise' + ELSE 'Free' + END AS segment, + COUNT(DISTINCT a.lt_id) AS dau, + SUM(a.num_tokens_consumed) AS tokens, + SUM(a.num_generate_image) AS image_gens, + SUM(a.num_generate_video) AS video_gens + FROM `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` a + LEFT JOIN enterprise_users e ON a.lt_id = e.lt_id + LEFT JOIN heavy_users h ON a.lt_id = h.lt_id + WHERE a.dt >= DATE_SUB(@target_date, INTERVAL 70 DAY) + AND a.dt <= DATE_SUB(@target_date, INTERVAL 1 DAY) + GROUP BY a.dt, day_of_week, segment +), +same_dow_last_10 AS ( + SELECT + dt, + day_of_week, + segment, + dau, + tokens, + image_gens, + video_gens, + ARRAY_AGG(dau) OVER ( + PARTITION BY segment, day_of_week + ORDER BY dt ROWS BETWEEN 10 PRECEDING AND 1 PRECEDING + ) AS dau_last_10, + ARRAY_AGG(tokens) OVER ( + PARTITION BY segment, day_of_week + ORDER BY dt ROWS BETWEEN 10 PRECEDING AND 1 PRECEDING + ) AS tokens_last_10, + ARRAY_AGG(image_gens) OVER ( + PARTITION BY segment, day_of_week + ORDER BY dt ROWS BETWEEN 10 PRECEDING AND 1 PRECEDING + ) AS image_gens_last_10, + ARRAY_AGG(video_gens) OVER ( + PARTITION BY segment, day_of_week + ORDER BY dt ROWS BETWEEN 10 PRECEDING AND 1 PRECEDING + ) AS video_gens_last_10 + FROM daily_metrics +), +stats AS ( + SELECT + dt, + segment, + dau, + tokens, + image_gens, + video_gens, + (SELECT AVG(x) FROM UNNEST(dau_last_10) AS x) AS dau_mean, + (SELECT STDDEV(x) FROM UNNEST(dau_last_10) AS x) AS dau_stddev, + (SELECT AVG(x) FROM UNNEST(tokens_last_10) AS x) AS tokens_mean, + (SELECT STDDEV(x) FROM UNNEST(tokens_last_10) AS x) AS tokens_stddev, + (SELECT AVG(x) FROM UNNEST(image_gens_last_10) AS x) AS image_gens_mean, + (SELECT STDDEV(x) FROM UNNEST(image_gens_last_10) AS x) AS image_gens_stddev, + (SELECT AVG(x) FROM UNNEST(video_gens_last_10) AS x) AS video_gens_mean, + (SELECT STDDEV(x) FROM UNNEST(video_gens_last_10) AS x) AS video_gens_stddev + FROM same_dow_last_10 +) +SELECT * FROM stats +WHERE dt = DATE_SUB(@target_date, INTERVAL 1 DAY) +ORDER BY + CASE segment + WHEN 'Enterprise Contract' THEN 1 + WHEN 'Enterprise Pilot' THEN 2 + WHEN 'Heavy Users' THEN 3 + WHEN 'Paying non-Enterprise' THEN 4 + WHEN 'Free' THEN 5 + END; +""" + + +@dataclass +class Alert: + segment: str + metric: str + current_value: float + mean: float + stddev: float + z_score: float + drop_pct: float + severity: str + + +def is_weekend(target_date: date) -> bool: + """Check if date is Saturday or Sunday""" + return target_date.weekday() >= 5 + + +def check_metric( + segment: str, + metric: str, + current: float, + mean: float, + stddev: float +) -> Optional[Alert]: + """Check if metric triggers alert using 3σ threshold""" + if mean == 0 or stddev == 0 or current == 0: + return None + + z_score = (current - mean) / stddev + drop_pct = ((current - mean) / mean) * 100 + + if abs(z_score) > 3: + severity = 'CRITICAL' if abs(z_score) > 4.5 else 'WARNING' + return Alert( + segment=segment, + metric=metric, + current_value=current, + mean=mean, + stddev=stddev, + z_score=z_score, + drop_pct=drop_pct, + severity=severity + ) + return None + + +# Configuration +WEEKDAY_ONLY_SEGMENTS = ['Enterprise Contract', 'Enterprise Pilot'] +SKIP_VIDEO_GENS = ['Enterprise Contract', 'Enterprise Pilot'] + + +def run_monitoring(target_date: date = None) -> List[Alert]: + """Run usage monitoring for target date (defaults to yesterday)""" + if target_date is None: + target_date = date.today() + + monitored_date = target_date - timedelta(days=1) + is_weekend_day = is_weekend(monitored_date) + + print("=" * 80) + print(f" LTX STUDIO USAGE MONITORING - {monitored_date}") + print(f" Day: {'WEEKEND' if is_weekend_day else 'WEEKDAY'}") + print(f" Method: 3 Standard Deviations (3σ) from last 10 same-day-of-week") + print("=" * 80) + + # Execute BigQuery + client = bigquery.Client(project='ltx-dwh-explore') + job_config = bigquery.QueryJobConfig( + query_parameters=[ + bigquery.ScalarQueryParameter('target_date', 'DATE', target_date) + ] + ) + + print("\n⏳ Running BigQuery query...") + query_job = client.query(MONITORING_QUERY, job_config=job_config) + results = query_job.result() + print(f"✅ Query complete ({query_job.total_bytes_processed:,} bytes processed)\n") + + alerts = [] + + for row in results: + segment = row.segment + + # Skip weekend alerts for Enterprise + if segment in WEEKDAY_ONLY_SEGMENTS and is_weekend_day: + continue + + # Check each metric + metrics = ['dau', 'tokens', 'image_gens', 'video_gens'] + + for metric in metrics: + # Skip video gens for Enterprise + if metric == 'video_gens' and segment in SKIP_VIDEO_GENS: + continue + + current = row[metric] or 0 + mean = row[f'{metric}_mean'] or 0 + stddev = row[f'{metric}_stddev'] or 0 + + alert = check_metric(segment, metric, current, mean, stddev) + if alert: + alerts.append(alert) + + return alerts, monitored_date + + +def print_alerts(alerts: List[Alert], monitored_date: date): + """Print formatted alerts""" + if not alerts: + print("\n🟢 NO ALERTS - All metrics within 3σ threshold\n") + return + + # Sort by severity then |z-score| + alerts.sort(key=lambda x: (0 if x.severity == 'CRITICAL' else 1, -abs(x.z_score))) + + print(f"\n🔴 {len(alerts)} ALERTS DETECTED\n") + print("=" * 80) + + critical = [a for a in alerts if a.severity == 'CRITICAL'] + warning = [a for a in alerts if a.severity == 'WARNING'] + + if critical: + print(f"\n🔴 CRITICAL ALERTS ({len(critical)}):\n") + for alert in critical: + print(f" • {alert.segment} - {alert.metric.replace('_', ' ').title()}") + print(f" Current: {alert.current_value:,.0f} | Mean (μ): {alert.mean:,.0f} | Std Dev (σ): {alert.stddev:,.0f}") + print(f" Z-score: {alert.z_score:.2f} (|z| > 3σ threshold)") + print(f" Change: {alert.drop_pct:+.1f}% from mean") + print() + + if warning: + print(f"\n⚠️ WARNING ALERTS ({len(warning)}):\n") + for alert in warning: + print(f" • {alert.segment} - {alert.metric.replace('_', ' ').title()}") + print(f" Current: {alert.current_value:,.0f} | Mean (μ): {alert.mean:,.0f} | Std Dev (σ): {alert.stddev:,.0f}") + print(f" Z-score: {alert.z_score:.2f} (|z| > 3σ threshold)") + print(f" Change: {alert.drop_pct:+.1f}% from mean") + print() + + print("=" * 80) + print(f"Total: {len(critical)} CRITICAL, {len(warning)} WARNING") + print("=" * 80) + print("\nNote: 3σ threshold captures 99.7% of normal variance") + print(" CRITICAL: |z| > 4.5, WARNING: 3 < |z| ≤ 4.5") + + +def main(): + """Main entry point""" + # Parse command line args (optional: --date YYYY-MM-DD) + target_date = None + if len(sys.argv) > 2 and sys.argv[1] == '--date': + target_date = datetime.strptime(sys.argv[2], '%Y-%m-%d').date() + + try: + alerts, monitored_date = run_monitoring(target_date) + print_alerts(alerts, monitored_date) + except Exception as e: + print(f"\n❌ Error: {e}", file=sys.stderr) + sys.exit(1) + + +if __name__ == "__main__": + main() diff --git a/agents/monitoring/usage/usage_monitoring.sql b/agents/monitoring/usage/usage_monitoring.sql deleted file mode 100644 index d46db3f..0000000 --- a/agents/monitoring/usage/usage_monitoring.sql +++ /dev/null @@ -1,179 +0,0 @@ --- LTX Studio Usage Monitoring Query --- Gets last 30 days of data for anomaly detection across all segments - -WITH ent_users AS ( - SELECT DISTINCT - lt_id, - CASE - WHEN COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) = 'McCann_NY' THEN 'McCann_NY' - WHEN COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) LIKE '%McCann%' THEN 'McCann_Paris' - ELSE COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) - END AS org - FROM `ltx-dwh-prod-processed.web.ltxstudio_users` - WHERE is_enterprise_user - AND current_customer_plan_type IN ('contract', 'pilot') - AND COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) NOT IN ('Lightricks', 'Popular Pays', 'None') -), -enterprise_users AS ( - SELECT DISTINCT - lt_id, - CASE - WHEN org IN ('Indegene', 'HearWell_BeWell', 'Novig', 'Cylndr Studios', 'Miroma', 'Deriv', 'McCann_Paris') - THEN org - ELSE CONCAT(org, ' Pilot') - END AS org, - CASE - WHEN org IN ('Indegene', 'HearWell_BeWell', 'Novig', 'Cylndr Studios', 'Miroma', 'Deriv', 'McCann_Paris') - THEN 'Contract' - ELSE 'Pilot' - END AS account_type - FROM ent_users - WHERE org NOT IN ('Lightricks', 'Popular Pays', 'None') -), -heavy_users AS ( - SELECT DISTINCT - u.lt_id, - u.griffin_tier_name - FROM `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` u - WHERE u.px_first_purchase_ts IS NOT NULL - AND u.dt >= DATE_SUB(CURRENT_DATE(), INTERVAL 28 DAY) - AND u.dt <= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) - AND DATE(u.px_first_purchase_ts) < DATE_SUB(CURRENT_DATE(), INTERVAL 28 DAY) - AND u.griffin_tier_name NOT IN ('free', 'custom_plan') - AND LOWER(u.email_domain) NOT LIKE '%lightricks%' - AND u.num_tokens_consumed > 0 - GROUP BY u.lt_id, u.griffin_tier_name - HAVING COUNT(DISTINCT DATE_TRUNC(u.dt, WEEK)) >= 4 -), -daily_metrics AS ( - SELECT - a.dt, - EXTRACT(DAYOFWEEK FROM a.dt) AS day_of_week, - - -- Segment assignment - CASE - WHEN e.lt_id IS NOT NULL AND e.account_type = 'Contract' THEN 'Enterprise Contract' - WHEN e.lt_id IS NOT NULL AND e.account_type = 'Pilot' THEN 'Enterprise Pilot' - WHEN h.lt_id IS NOT NULL THEN 'Heavy Users' - WHEN a.griffin_tier_name <> 'free' THEN 'Paying non-Enterprise' - ELSE 'Free' - END AS segment, - - -- Organization for Enterprise - e.org, - - -- Metrics - COUNT(DISTINCT a.lt_id) AS dau, - SUM(a.num_tokens_consumed) AS tokens, - SUM(a.num_generate_image) AS image_gens, - SUM(a.num_generate_video) AS video_gens - - FROM `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` a - LEFT JOIN enterprise_users e ON a.lt_id = e.lt_id - LEFT JOIN heavy_users h ON a.lt_id = h.lt_id - WHERE a.dt >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) - AND a.dt <= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) - GROUP BY a.dt, day_of_week, segment, e.org -), -aggregated_metrics AS ( - SELECT - dt, - day_of_week, - segment, - - -- Aggregate metrics - SUM(dau) AS dau, - SUM(tokens) AS tokens, - SUM(image_gens) AS image_gens, - SUM(video_gens) AS video_gens, - - -- Previous day - LAG(SUM(dau), 1) OVER (PARTITION BY segment ORDER BY dt) AS dau_yesterday, - LAG(SUM(tokens), 1) OVER (PARTITION BY segment ORDER BY dt) AS tokens_yesterday, - LAG(SUM(image_gens), 1) OVER (PARTITION BY segment ORDER BY dt) AS image_gens_yesterday, - LAG(SUM(video_gens), 1) OVER (PARTITION BY segment ORDER BY dt) AS video_gens_yesterday, - - -- Same day last week - LAG(SUM(dau), 7) OVER (PARTITION BY segment ORDER BY dt) AS dau_last_week, - LAG(SUM(tokens), 7) OVER (PARTITION BY segment ORDER BY dt) AS tokens_last_week, - LAG(SUM(image_gens), 7) OVER (PARTITION BY segment ORDER BY dt) AS image_gens_last_week, - LAG(SUM(video_gens), 7) OVER (PARTITION BY segment ORDER BY dt) AS video_gens_last_week, - - -- 14-day baseline (weekday vs weekend for Enterprise) - AVG(SUM(dau)) OVER ( - PARTITION BY segment, - CASE WHEN EXTRACT(DAYOFWEEK FROM dt) BETWEEN 2 AND 6 THEN 'weekday' ELSE 'weekend' END - ORDER BY dt ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING - ) AS dau_baseline_14d, - AVG(SUM(tokens)) OVER ( - PARTITION BY segment, - CASE WHEN EXTRACT(DAYOFWEEK FROM dt) BETWEEN 2 AND 6 THEN 'weekday' ELSE 'weekend' END - ORDER BY dt ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING - ) AS tokens_baseline_14d, - AVG(SUM(image_gens)) OVER ( - PARTITION BY segment, - CASE WHEN EXTRACT(DAYOFWEEK FROM dt) BETWEEN 2 AND 6 THEN 'weekday' ELSE 'weekend' END - ORDER BY dt ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING - ) AS image_gens_baseline_14d, - AVG(SUM(video_gens)) OVER ( - PARTITION BY segment, - CASE WHEN EXTRACT(DAYOFWEEK FROM dt) BETWEEN 2 AND 6 THEN 'weekday' ELSE 'weekend' END - ORDER BY dt ROWS BETWEEN 14 PRECEDING AND 1 PRECEDING - ) AS video_gens_baseline_14d - - FROM daily_metrics - GROUP BY dt, day_of_week, segment -) -SELECT - dt, - segment, - - -- Current values - dau, - tokens, - image_gens, - video_gens, - - -- DoD comparison - dau_yesterday, - tokens_yesterday, - image_gens_yesterday, - video_gens_yesterday, - - SAFE_DIVIDE(dau - dau_yesterday, NULLIF(dau_yesterday, 0)) * 100 AS dau_dod_pct, - SAFE_DIVIDE(tokens - tokens_yesterday, NULLIF(tokens_yesterday, 0)) * 100 AS tokens_dod_pct, - SAFE_DIVIDE(image_gens - image_gens_yesterday, NULLIF(image_gens_yesterday, 0)) * 100 AS image_gens_dod_pct, - SAFE_DIVIDE(video_gens - video_gens_yesterday, NULLIF(video_gens_yesterday, 0)) * 100 AS video_gens_dod_pct, - - -- WoW comparison - dau_last_week, - tokens_last_week, - image_gens_last_week, - video_gens_last_week, - - SAFE_DIVIDE(dau - dau_last_week, NULLIF(dau_last_week, 0)) * 100 AS dau_wow_pct, - SAFE_DIVIDE(tokens - tokens_last_week, NULLIF(tokens_last_week, 0)) * 100 AS tokens_wow_pct, - SAFE_DIVIDE(image_gens - image_gens_last_week, NULLIF(image_gens_last_week, 0)) * 100 AS image_gens_wow_pct, - SAFE_DIVIDE(video_gens - video_gens_last_week, NULLIF(video_gens_last_week, 0)) * 100 AS video_gens_wow_pct, - - -- Baselines - dau_baseline_14d, - tokens_baseline_14d, - image_gens_baseline_14d, - video_gens_baseline_14d, - - SAFE_DIVIDE(dau - dau_baseline_14d, NULLIF(dau_baseline_14d, 0)) * 100 AS dau_vs_baseline_pct, - SAFE_DIVIDE(tokens - tokens_baseline_14d, NULLIF(tokens_baseline_14d, 0)) * 100 AS tokens_vs_baseline_pct, - SAFE_DIVIDE(image_gens - image_gens_baseline_14d, NULLIF(image_gens_baseline_14d, 0)) * 100 AS image_gens_vs_baseline_pct, - SAFE_DIVIDE(video_gens - video_gens_baseline_14d, NULLIF(video_gens_baseline_14d, 0)) * 100 AS video_gens_vs_baseline_pct - -FROM aggregated_metrics -WHERE dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) -- Yesterday only -ORDER BY - CASE segment - WHEN 'Enterprise Contract' THEN 1 - WHEN 'Enterprise Pilot' THEN 2 - WHEN 'Heavy Users' THEN 3 - WHEN 'Paying non-Enterprise' THEN 4 - WHEN 'Free' THEN 5 - END; diff --git a/agents/monitoring/usage/usage_monitoring_v2.py b/agents/monitoring/usage/usage_monitoring_v2.py deleted file mode 100644 index a7f88e9..0000000 --- a/agents/monitoring/usage/usage_monitoring_v2.py +++ /dev/null @@ -1,234 +0,0 @@ -#!/usr/bin/env python3 -""" -LTX Studio Usage Monitoring - Production Thresholds -Based on /Users/dbeer/workspace/dwh-data-model-transforms/queries/segment_alerting_thresholds.md -""" - -import csv -import sys -from datetime import datetime -from dataclasses import dataclass -from typing import Optional - - -@dataclass -class SegmentThresholds: - """Threshold configuration per segment""" - name: str - dau_pct: float - image_gens_pct: float - video_gens_pct: Optional[float] # None = skip alerting - tokens_pct: float - weekday_only: bool # Only alert on Mon-Fri - - -# Production thresholds from segment_alerting_thresholds.md -THRESHOLDS = { - 'Enterprise Contract': SegmentThresholds( - name='Enterprise Contract', - dau_pct=0.50, # -50% - image_gens_pct=0.60, # -60% - video_gens_pct=None, # Skip - tokens_pct=0.70, # -70% - weekday_only=True - ), - 'Enterprise Pilot': SegmentThresholds( - name='Enterprise Pilot', - dau_pct=0.50, - image_gens_pct=0.60, - video_gens_pct=None, # Skip - tokens_pct=0.70, - weekday_only=True - ), - 'Free': SegmentThresholds( - name='Free', - dau_pct=0.25, - image_gens_pct=0.35, - video_gens_pct=0.30, - tokens_pct=0.20, - weekday_only=False - ), - 'Heavy Users': SegmentThresholds( - name='Heavy Users', - dau_pct=0.25, - image_gens_pct=0.30, - video_gens_pct=0.30, - tokens_pct=0.25, - weekday_only=False - ), - 'Paying non-Enterprise': SegmentThresholds( - name='Paying non-Enterprise', - dau_pct=0.20, - image_gens_pct=0.25, - video_gens_pct=0.25, - tokens_pct=0.20, - weekday_only=False - ), -} - - -@dataclass -class Alert: - segment: str - metric: str - current_value: float - baseline_14d: float - drop_pct: float - threshold_pct: float - severity: str # 'WARNING' or 'CRITICAL' - - -def parse_float(val): - """Safely parse float from CSV, handle empty strings""" - if val == '' or val is None: - return 0.0 - try: - return float(val) - except ValueError: - return 0.0 - - -def is_weekend(date_str: str) -> bool: - """Check if date is Saturday or Sunday""" - dt = datetime.strptime(date_str, '%Y-%m-%d') - return dt.weekday() >= 5 # 5=Saturday, 6=Sunday - - -def check_metric( - segment: str, - metric: str, - current: float, - baseline: float, - threshold_pct: float, - date_str: str -) -> Optional[Alert]: - """ - Check if metric triggers alert. - - Alert logic: today_value < rolling_14d_baseline * (1 - threshold_pct) - """ - if baseline == 0: - return None - - ratio = current / baseline - drop_pct = (1 - ratio) * 100 # Positive = drop, Negative = growth - - # Alert if drop exceeds threshold - if ratio < (1 - threshold_pct): - # Determine severity - critical_threshold = threshold_pct * 1.5 # 1.5x threshold = CRITICAL - severity = 'CRITICAL' if ratio < (1 - critical_threshold) else 'WARNING' - - return Alert( - segment=segment, - metric=metric, - current_value=current, - baseline_14d=baseline, - drop_pct=drop_pct, - threshold_pct=threshold_pct * 100, - severity=severity - ) - - return None - - -def main(): - # Read data from CSV - with open('/Users/dbeer/usage_data_clean.csv', 'r') as f: - reader = csv.DictReader(f) - rows = list(reader) - - if not rows: - print("No data found!") - sys.exit(1) - - date_str = rows[0]['dt'] - is_weekend_day = is_weekend(date_str) - - print("=" * 80) - print(f" LTX STUDIO USAGE MONITORING - {date_str}") - print(f" Day: {'WEEKEND' if is_weekend_day else 'WEEKDAY'}") - print("=" * 80) - - alerts = [] - - for row in rows: - segment = row['segment'] - - if segment not in THRESHOLDS: - continue - - thresholds = THRESHOLDS[segment] - - # Skip weekend alerts for Enterprise segments - if thresholds.weekday_only and is_weekend_day: - continue - - # Check each metric - metrics_to_check = { - 'dau': thresholds.dau_pct, - 'image_gens': thresholds.image_gens_pct, - 'video_gens': thresholds.video_gens_pct, - 'tokens': thresholds.tokens_pct, - } - - for metric, threshold in metrics_to_check.items(): - if threshold is None: # Skip this metric - continue - - current = parse_float(row[metric]) - baseline = parse_float(row[f'{metric}_baseline_14d']) - - # Skip if no baseline or current value - if baseline == 0 or current == 0: - continue - - alert = check_metric( - segment=segment, - metric=metric, - current=current, - baseline=baseline, - threshold_pct=threshold, - date_str=date_str - ) - - if alert: - alerts.append(alert) - - # Sort alerts by severity (CRITICAL first) then by drop % - alerts.sort(key=lambda x: (0 if x.severity == 'CRITICAL' else 1, -x.drop_pct)) - - if not alerts: - print("\n🟢 NO ALERTS - All metrics within thresholds\n") - return - - print(f"\n🔴 {len(alerts)} ALERTS DETECTED\n") - print("=" * 80) - - # Group by severity - critical_alerts = [a for a in alerts if a.severity == 'CRITICAL'] - warning_alerts = [a for a in alerts if a.severity == 'WARNING'] - - if critical_alerts: - print(f"\n🔴 CRITICAL ALERTS ({len(critical_alerts)}):\n") - for alert in critical_alerts: - print(f" • {alert.segment} - {alert.metric.replace('_', ' ').title()}") - print(f" Current: {alert.current_value:,.0f} | Baseline: {alert.baseline_14d:,.0f}") - print(f" Drop: {alert.drop_pct:.1f}% | Threshold: {alert.threshold_pct:.0f}%") - print() - - if warning_alerts: - print(f"\n⚠️ WARNING ALERTS ({len(warning_alerts)}):\n") - for alert in warning_alerts: - print(f" • {alert.segment} - {alert.metric.replace('_', ' ').title()}") - print(f" Current: {alert.current_value:,.0f} | Baseline: {alert.baseline_14d:,.0f}") - print(f" Drop: {alert.drop_pct:.1f}% | Threshold: {alert.threshold_pct:.0f}%") - print() - - print("=" * 80) - print(f"Total: {len(critical_alerts)} CRITICAL, {len(warning_alerts)} WARNING") - print("=" * 80) - - -if __name__ == "__main__": - main() From e031e24816a10aa15bde01b9bf47c3a288c1afc0 Mon Sep 17 00:00:00 2001 From: Daniel Beer Date: Tue, 10 Mar 2026 10:23:22 +0200 Subject: [PATCH 10/20] Update usage monitor to detect spikes in both directions Change focus from 'detecting drops' to 'detecting data spikes' to reflect that the statistical method detects anomalies in both directions: - Increases (e.g., +46.6% token spike on 2026-03-09) - Decreases (e.g., churn, engagement drops) Updated: - Overview: Problem solved now mentions both increases and decreases - Requirements: Changed 'drops' to 'spikes (increases or decreases)' - Description: Changed 'detecting usage drops' to 'detecting usage anomalies' --- agents/monitoring/usage/SKILL.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/agents/monitoring/usage/SKILL.md b/agents/monitoring/usage/SKILL.md index bbb107c..07d8960 100644 --- a/agents/monitoring/usage/SKILL.md +++ b/agents/monitoring/usage/SKILL.md @@ -1,6 +1,6 @@ --- name: usage-monitor -description: "Monitor LTX Studio product usage metrics with data-driven thresholds. Detects anomalies in DAU, generations, and token consumption. Use when: (1) detecting usage drops, (2) alerting on segment-specific problems, (3) investigating root causes of engagement changes." +description: "Monitor LTX Studio product usage metrics with statistical anomaly detection. Detects data spikes (increases or decreases) in DAU, generations, and token consumption. Use when: (1) detecting usage anomalies, (2) alerting on segment-specific changes, (3) investigating root causes of engagement shifts." tags: [monitoring, usage, dau, generations, engagement, alerts] --- @@ -12,17 +12,17 @@ LTX Studio usage varies significantly by user segment and day-of-week. Enterpris This skill provides **autonomous usage monitoring** using statistical anomaly detection. It compares today's metrics against the last 10 same-day-of-week data points (e.g., last 10 Mondays) and alerts when values deviate by 3 standard deviations from the mean. -**Problem solved**: Detect usage drops that indicate product issues, enterprise churn risk, or engagement problems — using statistical thresholds that adapt to each segment's variance patterns. +**Problem solved**: Detect data spikes in usage — both increases and decreases — that indicate significant changes in user behavior, product adoption, feature launches, enterprise churn risk, or engagement shifts. Uses statistical thresholds that adapt to each segment's variance patterns. ## 2. Requirements (What?) Monitor these outcomes autonomously: -- [ ] DAU drops by segment (Enterprise Contract/Pilot, Heavy, Paying, Free) -- [ ] Image generation volume changes +- [ ] DAU spikes (increases or decreases) by segment (Enterprise Contract/Pilot, Heavy, Paying, Free) +- [ ] Image generation volume changes (both increases and decreases) - [ ] Video generation volume changes (skip for Enterprise - too volatile) -- [ ] Token consumption trends -- [ ] Alerts fire only when drops exceed segment-specific thresholds +- [ ] Token consumption trends (spikes up or down) +- [ ] Alerts fire when values deviate beyond 3σ (increases or decreases) - [ ] Weekend alerts suppressed for Enterprise (weekday-only monitoring) - [ ] Root cause investigation identifies which orgs/tiers drove changes - [ ] Results formatted with severity (WARNING vs CRITICAL) From 79e29229fd0dbe5dc439330b7fa8564ba35433ed Mon Sep 17 00:00:00 2001 From: Daniel Beer Date: Tue, 10 Mar 2026 10:30:09 +0200 Subject: [PATCH 11/20] Remove detailed usage patterns from Overview Drop paragraph about segment/day-of-week variance details. Keep Overview focused on the solution (statistical anomaly detection) rather than the detailed problem context. --- agents/monitoring/usage/SKILL.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/agents/monitoring/usage/SKILL.md b/agents/monitoring/usage/SKILL.md index 07d8960..3b38403 100644 --- a/agents/monitoring/usage/SKILL.md +++ b/agents/monitoring/usage/SKILL.md @@ -8,8 +8,6 @@ tags: [monitoring, usage, dau, generations, engagement, alerts] ## 1. Overview (Why?) -LTX Studio usage varies significantly by user segment and day-of-week. Enterprise accounts show strong weekday patterns (18% weekend usage), while Free users are stable across all days. Generic alerting thresholds produce false positives from normal variance. - This skill provides **autonomous usage monitoring** using statistical anomaly detection. It compares today's metrics against the last 10 same-day-of-week data points (e.g., last 10 Mondays) and alerts when values deviate by 3 standard deviations from the mean. **Problem solved**: Detect data spikes in usage — both increases and decreases — that indicate significant changes in user behavior, product adoption, feature launches, enterprise churn risk, or engagement shifts. Uses statistical thresholds that adapt to each segment's variance patterns. From f7716959b76a9c1a2362e6de4f81a72b726d5de1 Mon Sep 17 00:00:00 2001 From: Daniel Beer Date: Tue, 10 Mar 2026 10:54:06 +0200 Subject: [PATCH 12/20] Refactor usage monitor skill: remove duplication, apply progressive disclosure MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Major changes: - Reduced from 335 lines to 182 lines (-45%) - Removed duplicate SQL query (already in usage_monitor.py) - Removed duplicate alert format examples - Consolidated overlapping phases (Phase 4 + 5 → Phase 4) - Simplified DO/DO NOT section (removed repetitive rules) - Applied progressive disclosure (method → run → analyze → present) - Kept only essential information, reference scripts for details Benefits: - Clearer, more focused instructions - Less maintenance (single source of truth in Python script) - Easier to scan and understand - Follows Agent Skills spec better (<500 lines, minimal duplication) --- agents/monitoring/usage/SKILL.md | 336 +++++++++---------------------- 1 file changed, 93 insertions(+), 243 deletions(-) diff --git a/agents/monitoring/usage/SKILL.md b/agents/monitoring/usage/SKILL.md index 3b38403..2d0eea7 100644 --- a/agents/monitoring/usage/SKILL.md +++ b/agents/monitoring/usage/SKILL.md @@ -28,305 +28,155 @@ Monitor these outcomes autonomously: ## 3. Progress Tracker * [ ] Read shared knowledge (schema, metrics, segmentation) -* [ ] Write monitoring SQL with last 10 same-DOW statistics -* [ ] Execute query and save results -* [ ] Run Python alerting script (3 std dev threshold) +* [ ] Run monitoring script for target date * [ ] Analyze alerts by segment * [ ] Investigate root cause (org-level for Enterprise, tier-level for others) * [ ] Present findings with recommended actions ## 4. Implementation Plan -### Phase 1: Statistical Anomaly Detection Method +### Phase 1: Understand the Statistical Method -**Approach**: Compare today's metric against the last 10 same-day-of-week data points using statistical thresholds. +**Alert logic**: `|today_value - μ| > 3σ` -**Alert logic**: -1. Collect last 10 same-day-of-week values (e.g., if today is Monday, get last 10 Mondays) -2. Calculate mean (μ) and standard deviation (σ) for each segment × metric -3. Alert if: `|today_value - μ| > 3σ` (3 standard deviations) +Where: +- μ (mean) = average of last 10 same-day-of-week values +- σ (stddev) = standard deviation of last 10 same-day-of-week values +- z-score = (today - μ) / σ **Why 3 standard deviations?** -- Captures true anomalies (99.7% of normal data falls within 3σ) -- Adapts to each segment's natural variance +- Captures 99.7% of normal variance (true anomalies only) +- Auto-adapts to each segment's natural patterns - Reduces false positives from expected fluctuations +**Severity levels**: +- WARNING: `3 < |z| ≤ 4.5` +- CRITICAL: `|z| > 4.5` + **Exceptions**: -- **Enterprise weekends**: Suppress alerts on Sat/Sun (too few data points, 18% of weekday usage) -- **Enterprise video gens**: Skip (CV > 100%, single-user dominated, insufficient stability for statistical thresholds) +- Enterprise weekends: Suppress (too few data points) +- Enterprise video gens: Skip (CV > 100%, single-user dominated) ### Phase 2: Read Shared Knowledge -Before writing SQL, read: -- **`shared/bq-schema.md`** — Table schema, segmentation CTEs (lines 441-516) +Before running monitoring, reference: +- **`shared/bq-schema.md`** — Segmentation CTEs (lines 441-516), table schema - **`shared/metric-standards.md`** — DAU/WAU/MAU, generation metrics - **`shared/product-context.md`** — LTX products, user types, business model -- **`shared/event-registry.yaml`** — Feature events and action names - -**Data nuances**: -- Table: `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` -- Partitioned by `dt` (DATE) — filter for performance -- LT team already excluded (no `is_lt_team` filter needed) -- Enterprise patterns: Strong weekday/weekend differences, use same-DOW comparisons - -### Phase 3: Write Monitoring SQL - -✅ **PREFERRED: Calculate mean and std dev from last 10 same-day-of-week** - -```sql -WITH daily_metrics AS ( - SELECT - a.dt, - EXTRACT(DAYOFWEEK FROM a.dt) AS day_of_week, - CASE - WHEN e.lt_id IS NOT NULL AND e.account_type = 'Contract' THEN 'Enterprise Contract' - WHEN e.lt_id IS NOT NULL AND e.account_type = 'Pilot' THEN 'Enterprise Pilot' - WHEN h.lt_id IS NOT NULL THEN 'Heavy Users' - WHEN a.griffin_tier_name <> 'free' THEN 'Paying non-Enterprise' - ELSE 'Free' - END AS segment, - COUNT(DISTINCT a.lt_id) AS dau, - SUM(a.num_tokens_consumed) AS tokens, - SUM(a.num_generate_image) AS image_gens, - SUM(a.num_generate_video) AS video_gens - FROM `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` a - LEFT JOIN enterprise_users e ON a.lt_id = e.lt_id - LEFT JOIN heavy_users h ON a.lt_id = h.lt_id - WHERE a.dt >= DATE_SUB(CURRENT_DATE(), INTERVAL 70 DAY) -- Need 70 days to get 10 same-DOW - AND a.dt <= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) - GROUP BY a.dt, day_of_week, segment -), -same_dow_last_10 AS ( - SELECT - dt, - day_of_week, - segment, - dau, - tokens, - image_gens, - video_gens, - -- Get last 10 same-DOW values (excluding today) - ARRAY_AGG(dau) OVER ( - PARTITION BY segment, day_of_week - ORDER BY dt ROWS BETWEEN 10 PRECEDING AND 1 PRECEDING - ) AS dau_last_10, - ARRAY_AGG(tokens) OVER ( - PARTITION BY segment, day_of_week - ORDER BY dt ROWS BETWEEN 10 PRECEDING AND 1 PRECEDING - ) AS tokens_last_10, - ARRAY_AGG(image_gens) OVER ( - PARTITION BY segment, day_of_week - ORDER BY dt ROWS BETWEEN 10 PRECEDING AND 1 PRECEDING - ) AS image_gens_last_10, - ARRAY_AGG(video_gens) OVER ( - PARTITION BY segment, day_of_week - ORDER BY dt ROWS BETWEEN 10 PRECEDING AND 1 PRECEDING - ) AS video_gens_last_10 - FROM daily_metrics -), -stats AS ( - SELECT - dt, - segment, - dau, - tokens, - image_gens, - video_gens, - -- Calculate mean and stddev from last 10 same-DOW - (SELECT AVG(x) FROM UNNEST(dau_last_10) AS x) AS dau_mean, - (SELECT STDDEV(x) FROM UNNEST(dau_last_10) AS x) AS dau_stddev, - (SELECT AVG(x) FROM UNNEST(tokens_last_10) AS x) AS tokens_mean, - (SELECT STDDEV(x) FROM UNNEST(tokens_last_10) AS x) AS tokens_stddev, - (SELECT AVG(x) FROM UNNEST(image_gens_last_10) AS x) AS image_gens_mean, - (SELECT STDDEV(x) FROM UNNEST(image_gens_last_10) AS x) AS image_gens_stddev, - (SELECT AVG(x) FROM UNNEST(video_gens_last_10) AS x) AS video_gens_mean, - (SELECT STDDEV(x) FROM UNNEST(video_gens_last_10) AS x) AS video_gens_stddev - FROM same_dow_last_10 -) -SELECT * FROM stats -WHERE dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY); -``` -**Key patterns**: -- **Segmentation**: Use full CTE from `shared/bq-schema.md` (lines 441-516) -- **Last 10 same-DOW**: Use `ARRAY_AGG` with window frame to collect last 10 values -- **Statistics**: Calculate mean (AVG) and standard deviation (STDDEV) from array -- **Time window**: 70 days lookback (to ensure 10 same-DOW data points available) +**Key data source**: `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` +- Partitioned by `dt` (DATE) +- Key columns: `lt_id`, `griffin_tier_name`, `num_tokens_consumed`, `num_generate_image`, `num_generate_video` +- LT team already excluded at table level + +### Phase 3: Run Monitoring -### Phase 4: Execute Monitoring +Execute the combined monitoring script: -Run the combined monitoring script: ```bash +# Install dependency (one-time) +pip install google-cloud-bigquery + # Monitor yesterday (default) python3 usage_monitor.py # Monitor specific date -python3 usage_monitor.py --date 2026-03-05 +python3 usage_monitor.py --date 2026-03-09 ``` -**Prerequisites**: -```bash -pip install google-cloud-bigquery -``` +**What the script does**: +1. Executes BigQuery SQL with last 10 same-DOW calculations (70-day lookback) +2. Uses `ARRAY_AGG` with window frames to collect last 10 values +3. Calculates mean and stddev from arrays +4. Computes z-scores for each segment × metric +5. Alerts when `|z| > 3` +6. Suppresses Enterprise weekend alerts and video gen alerts +7. Outputs formatted results with mean, stddev, z-score -**The script**: -1. Executes BigQuery SQL with last 10 same-DOW statistical calculations -2. Retrieves mean (μ) and standard deviation (σ) for each segment × metric -3. Calculates z-score: `z = (current - μ) / σ` -4. Alerts if `|z| > 3` (3 standard deviations from mean) -5. Suppresses Enterprise alerts on weekends (insufficient weekend data points) -6. Skips Video Generations for Enterprise (CV > 100%, single-user dominated) -7. Flags WARNING (|z| > 3) or CRITICAL (|z| > 4.5) -8. Outputs formatted alerts with mean, stddev, z-score, and % change +**See**: `usage_monitor.py` for complete SQL query and alerting logic. -### Phase 5: Analyze Results +### Phase 4: Analyze Results **When alerts fire**: -1. **Check severity**: - - CRITICAL (|z| > 4.5): Immediate investigation required - - WARNING (|z| > 3): Monitor closely, investigate if persists +1. **Check severity**: CRITICAL (|z| > 4.5) requires immediate action, WARNING (|z| > 3) needs monitoring 2. **Identify segment**: Which user segment is affected? -3. **Check day-of-week**: Is this expected (e.g., weekend drop in Enterprise)? -4. **Validate statistical significance**: - - Is the standard deviation reasonable? (Not too small, causing false positives) - - Are there enough historical data points? (Need 10 same-DOW values) - - Is the mean representative? (No major outliers in last 10 values) -5. **Investigate root cause**: - -**For Enterprise segments** — Drill down to organization level: -```sql -SELECT - e.org, - e.account_type, - COUNT(DISTINCT a.lt_id) AS dau, - SUM(a.num_tokens_consumed) AS tokens, - SUM(a.num_generate_image) AS image_gens -FROM `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` a -JOIN enterprise_users e ON a.lt_id = e.lt_id -WHERE a.dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) -GROUP BY e.org, e.account_type -ORDER BY tokens DESC; -``` - -**For other segments** — Check tier distribution (Standard vs Pro vs Lite) - -### Phase 6: Present Findings - -**Alert format**: +3. **Validate significance**: + - Is stddev reasonable? (Not too small causing false positives) + - Are there 10+ historical same-DOW data points? + - Are there outliers in the last 10 values skewing the mean? +4. **Investigate root cause**: + - **Enterprise**: Drill down to organization level with `investigate_root_cause.sql` + - **Other segments**: Check tier distribution (Standard vs Pro vs Lite) + +**Example alert output**: ``` -🔴 CRITICAL ALERTS (N): - • Segment - Metric - Current: X | Mean (μ): Y | Std Dev (σ): Z - Z-score: W (|W| > 3σ threshold) - Drop: N% from mean - -⚠️ WARNING ALERTS (N): - • Segment - Metric - Current: X | Mean (μ): Y | Std Dev (σ): Z - Z-score: W (|W| > 3σ threshold) - Drop: N% from mean +⚠️ WARNING ALERTS (2): + • Free - Tokens + Current: 4,497,947 | Mean (μ): 3,068,455 | Std Dev (σ): 426,074 + Z-score: 3.36 (|z| > 3σ threshold) + Change: +46.6% from mean ``` -**Root cause format (Enterprise)**: -``` -Enterprise Contract - Tokens Drop - -Clients Driving Drop: -- Novig: 1,914 tokens → 0 (-100%) -- Miroma: 27K tokens → 0 (-100%) - -Clients Driving Growth: -- McCann_Paris: 19K → 32K (+68%) - -Net Result: Only 1 out of 4 contract accounts active -``` - -**Root cause format (Other segments)**: -``` -Heavy Users - Tokens Drop +### Phase 5: Present Findings -Standard tier users consuming 36% fewer tokens per user -(engagement drop, not churn) -``` - -**Recommended actions**: -- **CRITICAL alerts (|z| > 4.5)**: Immediate investigation, contact account managers, check for infrastructure issues -- **WARNING alerts (|z| > 3)**: Monitor for persistence, investigate if alert repeats next day -- **Large absolute drops**: Even if within 3σ, very large absolute value changes warrant investigation -- **Enterprise segments**: Drill down to organization level to identify specific client changes +Format findings with: +- **Summary**: Which segments alerted and direction (increase/decrease) +- **Severity**: CRITICAL or WARNING +- **Statistical details**: Current value, mean, stddev, z-score, % change +- **Root cause**: For Enterprise, identify which orgs drove the change +- **Recommended actions**: + - CRITICAL: Immediate investigation, contact account managers + - WARNING: Monitor for persistence (alert repeats next day?) + - Positive spikes: Investigate feature launches, product changes + - Negative spikes: Investigate churn events, product issues -### Phase 7: Set Up Alert (Optional) +### Phase 6: Set Up Ongoing Monitoring (Optional) -For ongoing monitoring: -1. Save SQL query with statistical calculations -2. Set up in BigQuery scheduled query or Hex Thread -3. Configure Python script to run daily and check 3σ threshold -4. Route alerts to Slack (#product-alerts, #engineering-alerts) -5. Monitor false positive rate and adjust threshold if needed (3σ = 99.7% confidence) +For scheduled monitoring: +1. Deploy `usage_monitor.py` to run daily via cron or scheduled query +2. Route alerts to Slack (#product-alerts, #engineering-alerts) +3. Monitor false positive rate and adjust threshold if needed (3σ = 99.7% confidence) ## 5. Context & References ### Statistical Thresholds -- **3 standard deviations (3σ)** — Captures 99.7% of normal variance, auto-adapts to each segment's patterns -- **Historical reference**: `/Users/dbeer/workspace/dwh-data-model-transforms/queries/segment_alerting_thresholds.md` (percentage-based thresholds from 60-day analysis, now replaced by statistical method) +- **3σ method** captures 99.7% of normal variance, auto-adapts to each segment ### Shared Knowledge -- **`shared/bq-schema.md`** — Table schema, segmentation queries (lines 441-516) -- **`shared/metric-standards.md`** — DAU/WAU/MAU definitions, generation metrics -- **`shared/product-context.md`** — LTX products, user types, business model -- **`shared/event-registry.yaml`** — Feature events and action names +- **`shared/bq-schema.md`** — Segmentation queries, table schema +- **`shared/metric-standards.md`** — Metric definitions +- **`shared/product-context.md`** — LTX products, user types ### Production Scripts -- **`usage_monitor.py`** — Combined script: SQL query + 3σ alerting logic + formatted output -- **`investigate_root_cause.sql`** — Organization-level drill-down for Enterprise segments - -### Data Source -Table: `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` -- Partitioned by `dt` (DATE) -- Key columns: `lt_id`, `griffin_tier_name`, `num_tokens_consumed`, `num_generate_image`, `num_generate_video` -- LT team already excluded at table level +- **`usage_monitor.py`** — Combined SQL query + 3σ alerting logic (303 lines) +- **`investigate_root_cause.sql`** — Organization-level drill-down for Enterprise ## 6. Constraints & Done ### DO NOT -- **DO NOT** use generic thresholds across segments — statistical method auto-adapts to each segment's variance -- **DO NOT** alert on Enterprise weekends — weekend DAU is 6-7 (18% of weekday), too few data points for stable statistics -- **DO NOT** alert on Enterprise video gens — CV > 100%, single-user dominated, insufficient stability for statistical thresholds -- **DO NOT** compare different days of week — day-of-week effects are huge, always use same-DOW comparisons -- **DO NOT** use simplified segmentation — hierarchy must be respected (Enterprise → Heavy → Paying → Free) -- **DO NOT** alert on segments with <10 historical same-DOW data points — insufficient for stable mean/stddev -- **DO NOT** use standard deviation when σ is near-zero — causes false positives from noise -- **DO NOT** filter `is_lt_team IS FALSE` — already filtered at table level -- **DO NOT** use absolute thresholds — always compare using statistical baselines (last 10 same-DOW) +- **DO NOT** use simplified segmentation — use exact CTEs from `shared/bq-schema.md` (lines 441-516) +- **DO NOT** alert on Enterprise weekends or video gens — exceptions apply +- **DO NOT** compare different days of week — always use same-DOW comparisons +- **DO NOT** use absolute thresholds — always use statistical baselines ### DO -- **DO** always filter on `dt` partition column for performance -- **DO** use 3 standard deviations (3σ) as alert threshold — captures 99.7% of normal variance -- **DO** suppress Enterprise alerts on weekends (weekday-only alerting) -- **DO** skip Video Generation alerts for Enterprise -- **DO** use full segmentation CTE from `shared/bq-schema.md` (lines 441-516) -- **DO** compare against last 10 same-day-of-week data points (not all days) -- **DO** calculate mean (μ) and standard deviation (σ) from ARRAY_AGG of last 10 same-DOW values -- **DO** flag CRITICAL (|z| > 4.5) vs WARNING (|z| > 3) based on z-score severity -- **DO** investigate at organization level for Enterprise, tier level for others -- **DO** include mean, stddev, z-score, and root cause in alerts -- **DO** validate unusual patterns with product team before alerting -- **DO** use `SAFE_DIVIDE` with * 100 for percentages -- **DO** report both press count AND output count for generations -- **DO** use `action_name_detailed` for generation events when needed -- **DO** ensure 70-day lookback to guarantee 10 same-DOW data points available +- **DO** filter on `dt` partition column for performance +- **DO** use 3σ as alert threshold (99.7% confidence) +- **DO** calculate mean and stddev from last 10 same-DOW via `ARRAY_AGG` +- **DO** ensure 70-day lookback for 10+ same-DOW data points +- **DO** flag CRITICAL (|z| > 4.5) vs WARNING (|z| > 3) +- **DO** investigate at org-level for Enterprise, tier-level for others +- **DO** include mean, stddev, z-score in all alerts +- **DO** validate unusual patterns with product team ### Completion Criteria -✅ All metrics monitored (DAU, tokens, image gens, video gens) -✅ Statistical thresholds (3σ) applied per segment × metric × day-of-week -✅ Mean (μ) and standard deviation (σ) calculated from last 10 same-DOW -✅ Alerts fire with severity levels based on z-score (WARNING: |z| > 3, CRITICAL: |z| > 4.5) -✅ Root cause investigation completed -✅ Findings presented with mean, stddev, z-score, and recommended actions -✅ Enterprise weekend suppression working -✅ Enterprise video gen alerts skipped -✅ 70-day lookback ensures 10 same-DOW data points available +✅ Script executed for target date +✅ Alerts fire with statistical details (mean, stddev, z-score) +✅ Severity levels applied correctly (WARNING/CRITICAL) +✅ Root cause investigation completed for alerts +✅ Findings presented with recommended actions +✅ Enterprise weekend suppression and video gen skip working From 88257fad541fe136d8f125e8e74c0cf888cb233d Mon Sep 17 00:00:00 2001 From: Daniel Beer Date: Tue, 10 Mar 2026 10:59:05 +0200 Subject: [PATCH 13/20] Address PR review comments 1. Change date example from specific date to 'yesterday' 2. Remove '(skip for Enterprise - too volatile)' from video generation line The exception details are still preserved in Phase 1 where they belong. --- agents/monitoring/usage/SKILL.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/agents/monitoring/usage/SKILL.md b/agents/monitoring/usage/SKILL.md index 2d0eea7..275911b 100644 --- a/agents/monitoring/usage/SKILL.md +++ b/agents/monitoring/usage/SKILL.md @@ -18,7 +18,7 @@ Monitor these outcomes autonomously: - [ ] DAU spikes (increases or decreases) by segment (Enterprise Contract/Pilot, Heavy, Paying, Free) - [ ] Image generation volume changes (both increases and decreases) -- [ ] Video generation volume changes (skip for Enterprise - too volatile) +- [ ] Video generation volume changes - [ ] Token consumption trends (spikes up or down) - [ ] Alerts fire when values deviate beyond 3σ (increases or decreases) - [ ] Weekend alerts suppressed for Enterprise (weekday-only monitoring) @@ -81,7 +81,7 @@ pip install google-cloud-bigquery python3 usage_monitor.py # Monitor specific date -python3 usage_monitor.py --date 2026-03-09 +python3 usage_monitor.py --date yesterday ``` **What the script does**: From 27dfadb50551def5bd42fbf55cb094ea5cab14de Mon Sep 17 00:00:00 2001 From: Assaf Hay Eden Date: Tue, 10 Mar 2026 10:52:12 +0000 Subject: [PATCH 14/20] =?UTF-8?q?Update=20usage=20monitor:=202=CF=83=20thr?= =?UTF-8?q?eshold,=20NOTICE=20severity,=20remove=20video=20gen=20exception?= =?UTF-8?q?s?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Lower alert threshold from 3σ to 2σ with new NOTICE severity (2 < |z| ≤ 3) - Compare yesterday's metrics instead of today's - Remove enterprise video gen exceptions - Add event-registry.yaml to shared knowledge - Remove Phase 6, Context & References section - Add Free segment to tier distribution checks Co-Authored-By: Claude Opus 4.6 --- agents/monitoring/usage/SKILL.md | 64 +++++++++++--------------------- 1 file changed, 21 insertions(+), 43 deletions(-) diff --git a/agents/monitoring/usage/SKILL.md b/agents/monitoring/usage/SKILL.md index 275911b..3a5ae63 100644 --- a/agents/monitoring/usage/SKILL.md +++ b/agents/monitoring/usage/SKILL.md @@ -1,6 +1,6 @@ --- name: usage-monitor -description: "Monitor LTX Studio product usage metrics with statistical anomaly detection. Detects data spikes (increases or decreases) in DAU, generations, and token consumption. Use when: (1) detecting usage anomalies, (2) alerting on segment-specific changes, (3) investigating root causes of engagement shifts." +description: "Monitor LTX Studio product usage metrics with statistical anomaly detection. Detects data spikes (increases or decreases) in DAU, generations, and token consumption. Use when: (1) daily monitoring and detecting usage anomalies, (2) alerting on segment-specific changes, (3) investigating root causes of engagement shifts." tags: [monitoring, usage, dau, generations, engagement, alerts] --- @@ -8,7 +8,7 @@ tags: [monitoring, usage, dau, generations, engagement, alerts] ## 1. Overview (Why?) -This skill provides **autonomous usage monitoring** using statistical anomaly detection. It compares today's metrics against the last 10 same-day-of-week data points (e.g., last 10 Mondays) and alerts when values deviate by 3 standard deviations from the mean. +This skill provides **autonomous usage monitoring** using statistical anomaly detection. It compares yesterday's metrics against the last 10 same-day-of-week data points (e.g., last 10 Mondays) and alerts when values deviate by 2 standard deviations from the mean. **Problem solved**: Detect data spikes in usage — both increases and decreases — that indicate significant changes in user behavior, product adoption, feature launches, enterprise churn risk, or engagement shifts. Uses statistical thresholds that adapt to each segment's variance patterns. @@ -20,10 +20,10 @@ Monitor these outcomes autonomously: - [ ] Image generation volume changes (both increases and decreases) - [ ] Video generation volume changes - [ ] Token consumption trends (spikes up or down) -- [ ] Alerts fire when values deviate beyond 3σ (increases or decreases) +- [ ] Alerts fire when values deviate beyond 2σ (increases or decreases) - [ ] Weekend alerts suppressed for Enterprise (weekday-only monitoring) - [ ] Root cause investigation identifies which orgs/tiers drove changes -- [ ] Results formatted with severity (WARNING vs CRITICAL) +- [ ] Results formatted with severity (NOTICE vs WARNING vs CRITICAL) ## 3. Progress Tracker @@ -37,25 +37,25 @@ Monitor these outcomes autonomously: ### Phase 1: Understand the Statistical Method -**Alert logic**: `|today_value - μ| > 3σ` +**Alert logic**: `|yesterday_value - μ| > 2σ` Where: - μ (mean) = average of last 10 same-day-of-week values - σ (stddev) = standard deviation of last 10 same-day-of-week values -- z-score = (today - μ) / σ +- z-score = (yesterday - μ) / σ -**Why 3 standard deviations?** -- Captures 99.7% of normal variance (true anomalies only) +**Why 2 standard deviations?** +- Captures 95.4% of normal variance - Auto-adapts to each segment's natural patterns -- Reduces false positives from expected fluctuations +- Balances early detection with false positive reduction **Severity levels**: +- NOTICE: `2 < |z| ≤ 3` - WARNING: `3 < |z| ≤ 4.5` - CRITICAL: `|z| > 4.5` **Exceptions**: - Enterprise weekends: Suppress (too few data points) -- Enterprise video gens: Skip (CV > 100%, single-user dominated) ### Phase 2: Read Shared Knowledge @@ -63,6 +63,7 @@ Before running monitoring, reference: - **`shared/bq-schema.md`** — Segmentation CTEs (lines 441-516), table schema - **`shared/metric-standards.md`** — DAU/WAU/MAU, generation metrics - **`shared/product-context.md`** — LTX products, user types, business model +- **`shared/event-registry.yaml`** — Known events per feature, types, status **Key data source**: `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` - Partitioned by `dt` (DATE) @@ -89,8 +90,8 @@ python3 usage_monitor.py --date yesterday 2. Uses `ARRAY_AGG` with window frames to collect last 10 values 3. Calculates mean and stddev from arrays 4. Computes z-scores for each segment × metric -5. Alerts when `|z| > 3` -6. Suppresses Enterprise weekend alerts and video gen alerts +5. Alerts when `|z| > 2` +6. Suppresses Enterprise weekend alerts 7. Outputs formatted results with mean, stddev, z-score **See**: `usage_monitor.py` for complete SQL query and alerting logic. @@ -99,7 +100,7 @@ python3 usage_monitor.py --date yesterday **When alerts fire**: -1. **Check severity**: CRITICAL (|z| > 4.5) requires immediate action, WARNING (|z| > 3) needs monitoring +1. **Check severity**: CRITICAL (|z| > 4.5) requires immediate action, WARNING (|z| > 3) needs monitoring, NOTICE (|z| > 2) just alert 2. **Identify segment**: Which user segment is affected? 3. **Validate significance**: - Is stddev reasonable? (Not too small causing false positives) @@ -107,7 +108,7 @@ python3 usage_monitor.py --date yesterday - Are there outliers in the last 10 values skewing the mean? 4. **Investigate root cause**: - **Enterprise**: Drill down to organization level with `investigate_root_cause.sql` - - **Other segments**: Check tier distribution (Standard vs Pro vs Lite) + - **Other segments**: Check tier distribution (Standard vs Pro vs Lite vs Free) **Example alert output**: ``` @@ -122,7 +123,7 @@ python3 usage_monitor.py --date yesterday Format findings with: - **Summary**: Which segments alerted and direction (increase/decrease) -- **Severity**: CRITICAL or WARNING +- **Severity**: CRITICAL, WARNING, or NOTICE - **Statistical details**: Current value, mean, stddev, z-score, % change - **Root cause**: For Enterprise, identify which orgs drove the change - **Recommended actions**: @@ -131,47 +132,24 @@ Format findings with: - Positive spikes: Investigate feature launches, product changes - Negative spikes: Investigate churn events, product issues -### Phase 6: Set Up Ongoing Monitoring (Optional) - -For scheduled monitoring: -1. Deploy `usage_monitor.py` to run daily via cron or scheduled query -2. Route alerts to Slack (#product-alerts, #engineering-alerts) -3. Monitor false positive rate and adjust threshold if needed (3σ = 99.7% confidence) - -## 5. Context & References - -### Statistical Thresholds -- **3σ method** captures 99.7% of normal variance, auto-adapts to each segment - -### Shared Knowledge -- **`shared/bq-schema.md`** — Segmentation queries, table schema -- **`shared/metric-standards.md`** — Metric definitions -- **`shared/product-context.md`** — LTX products, user types - -### Production Scripts -- **`usage_monitor.py`** — Combined SQL query + 3σ alerting logic (303 lines) -- **`investigate_root_cause.sql`** — Organization-level drill-down for Enterprise - -## 6. Constraints & Done +## 5. Constraints & Done ### DO NOT - **DO NOT** use simplified segmentation — use exact CTEs from `shared/bq-schema.md` (lines 441-516) -- **DO NOT** alert on Enterprise weekends or video gens — exceptions apply +- **DO NOT** alert on Enterprise weekends — exceptions apply - **DO NOT** compare different days of week — always use same-DOW comparisons - **DO NOT** use absolute thresholds — always use statistical baselines ### DO - **DO** filter on `dt` partition column for performance -- **DO** use 3σ as alert threshold (99.7% confidence) +- **DO** use 2σ as alert threshold (95.4% confidence) for noticing purposes, 3σ for warnings - **DO** calculate mean and stddev from last 10 same-DOW via `ARRAY_AGG` - **DO** ensure 70-day lookback for 10+ same-DOW data points -- **DO** flag CRITICAL (|z| > 4.5) vs WARNING (|z| > 3) +- **DO** flag CRITICAL (|z| > 4.5) vs WARNING (|z| > 3) vs NOTICE (|z| > 2) - **DO** investigate at org-level for Enterprise, tier-level for others - **DO** include mean, stddev, z-score in all alerts -- **DO** validate unusual patterns with product team - ### Completion Criteria ✅ Script executed for target date @@ -179,4 +157,4 @@ For scheduled monitoring: ✅ Severity levels applied correctly (WARNING/CRITICAL) ✅ Root cause investigation completed for alerts ✅ Findings presented with recommended actions -✅ Enterprise weekend suppression and video gen skip working +✅ Enterprise weekend suppression working From db9816dd34f664586b95cd1703b894b5725260cb Mon Sep 17 00:00:00 2001 From: Daniel Beer Date: Tue, 10 Mar 2026 20:32:09 +0200 Subject: [PATCH 15/20] feat: add production GPU cost thresholds and 3-day lookback Major updates: - Add production thresholds from 60-day statistical analysis - Tier-based alerting: Tier 1 (High), Tier 2 (Medium), Tier 3 (Low) - Change timing: analyze data from 3 days ago (cost data needs time to finalize) - Restructure to 6-part Agent Skills format - Add per-vertical thresholds (API vs Studio) Tier 1 alerts: - Idle cost spike > $15,600/day - Inference cost spike > $5,743/day - Idle-to-inference ratio > 4:1 Benefits: - Data-driven thresholds (not guesses) - Prioritized alerts by tier - Proper timing (3-day lookback) - Clear structure (6-part spec) --- agents/monitoring/be-cost/SKILL.md | 211 +++++++++++++++++++---------- 1 file changed, 139 insertions(+), 72 deletions(-) diff --git a/agents/monitoring/be-cost/SKILL.md b/agents/monitoring/be-cost/SKILL.md index c2f177e..fb5dd29 100644 --- a/agents/monitoring/be-cost/SKILL.md +++ b/agents/monitoring/be-cost/SKILL.md @@ -1,6 +1,6 @@ --- name: be-cost-monitoring -description: Monitor and analyze backend GPU costs for LTX API and LTX Studio. Use when analyzing cost trends, detecting anomalies, breaking down costs by endpoint/model/org/process, monitoring utilization, or investigating cost efficiency drift. +description: "Monitor backend GPU costs for LTX API and LTX Studio with data-driven thresholds. Detects cost anomalies and utilization issues. Use when: (1) detecting cost spikes, (2) analyzing idle vs inference costs, (3) investigating efficiency drift by endpoint/model/org." tags: [monitoring, costs, gpu, infrastructure, alerts] compatibility: - BigQuery access (ltx-dwh-prod-processed) @@ -9,52 +9,98 @@ compatibility: # Backend Cost Monitoring -## When to use +## 1. Overview (Why?) -- "Monitor GPU costs" -- "Analyze cost trends (daily/weekly/monthly)" -- "Detect cost anomalies or spikes" -- "Break down API costs by endpoint/model/org" -- "Break down Studio costs by process/workspace" -- "Compare API vs Studio cost distribution" -- "Investigate cost-per-request efficiency drift" -- "Monitor GPU utilization and idle costs" -- "Alert on cost budget breaches" -- "Day-over-day or week-over-week cost comparisons" +LTX GPU infrastructure spending is dominated by idle costs (~72% of total), with actual inference representing only ~27%. Cost patterns vary significantly by vertical (API vs Studio) and day-of-week. Generic alerting thresholds miss true anomalies while flagging normal variance. -## Steps +This skill provides **autonomous cost monitoring** with data-driven thresholds derived from 60-day statistical analysis. It detects genuine cost problems (autoscaler issues, efficiency regression, volume surges) while suppressing noise from expected patterns. -### 1. Gather Requirements +**Problem solved**: Detect cost spikes, utilization degradation, and efficiency drift that indicate infrastructure problems or wasteful spending — without manual threshold tuning or false alarms. -Ask the user: -- **What to monitor**: Total costs, cost by product/feature, cost efficiency, utilization, anomalies -- **Scope**: LTX API, LTX Studio, or both? Specific endpoint/org/process? -- **Time window**: Daily, weekly, monthly? How far back? -- **Analysis type**: Trends, comparisons (DoD/WoW), anomaly detection, breakdowns -- **Alert threshold** (if setting up alerts): Absolute ($X/day) or relative (spike > X% vs baseline) +**Critical timing**: Analyzes data from **3 days ago** (cost data needs time to finalize). -### 2. Read Shared Knowledge +## 2. Requirements (What?) -Before writing SQL: -- **`shared/product-context.md`** — LTX products and business context +Monitor these outcomes autonomously: + +- [ ] Idle cost spikes (over-provisioning or traffic drop without scaledown) +- [ ] Inference cost spikes (volume surge, costlier model, heavy customer) +- [ ] Idle-to-inference ratio degradation (utilization dropping) +- [ ] Failure rate spikes (wasted compute + service quality issue) +- [ ] Cost-per-request drift (model regression or resolution creep) +- [ ] Day-over-day cost jumps (early warning signals) +- [ ] Volume drops (potential outage) +- [ ] Overhead spikes (system anomalies) +- [ ] Alerts prioritized by tier (High/Medium/Low) +- [ ] Vertical-specific thresholds (API vs Studio) + +## 3. Progress Tracker + +* [ ] Read production thresholds and shared knowledge +* [ ] Select appropriate query template(s) +* [ ] Execute query for 3 days ago +* [ ] Analyze results by tier priority +* [ ] Identify root cause (endpoint, model, org, process) +* [ ] Present findings with cost breakdown +* [ ] Route alerts to appropriate team (API/Studio/Engineering) + +## 4. Implementation Plan + +### Phase 1: Read Production Thresholds + +Production thresholds from `/Users/dbeer/workspace/dwh-data-model-transforms/queries/gpu_cost_alerting_thresholds.md`: + +#### Tier 1 — High Priority (daily monitoring) +| Alert | Threshold | Signal | +|-------|-----------|--------| +| **Idle cost spike** | > $15,600/day | Autoscaler over-provisioning or traffic drop without scaledown | +| **Inference cost spike** | > $5,743/day | Volume surge, costlier model, or heavy new customer | +| **Idle-to-inference ratio** | > 4:1 | GPU utilization degrading (baseline ~2.7:1) | + +#### Tier 2 — Medium Priority (daily monitoring) +| Alert | Threshold | Signal | +|-------|-----------|--------| +| **Failure rate spike** | > 20.4% overall | Wasted compute + service quality issue | +| **Cost-per-request drift** | API > $0.70, Studio > $0.46 | Model regression or duration/resolution creep | +| **DoD cost jump** | API > 30%, Studio > 15% | Early warning before absolute thresholds breach | + +#### Tier 3 — Low Priority (weekly review) +| Alert | Threshold | Signal | +|-------|-----------|--------| +| **Volume drop** | < 18,936 requests/day | Possible outage or upstream feed issue | +| **Overhead spike** | > $138/day | System overhead anomaly (review weekly) | + +**Per-Vertical Thresholds:** +- **LTX API**: Total daily > $5,555, Failure rate > 5.7%, DoD change > 30% +- **LTX Studio**: Total daily > $11,928, Failure rate > 22.6%, DoD change > 15% + +**Alert logic**: Cost metric exceeds WARNING (avg+2σ) or CRITICAL (avg+3σ) threshold + +### Phase 2: Read Shared Knowledge + +Before writing SQL, read: +- **`shared/product-context.md`** — LTX products, user types, business model - **`shared/bq-schema.md`** — GPU cost table schema (lines 418-615) - **`shared/metric-standards.md`** — GPU cost metric patterns (section 13) -- **`shared/event-registry.yaml`** — Feature events (if analyzing feature-level costs) - **`shared/gpu-cost-query-templates.md`** — 11 production-ready SQL queries - **`shared/gpu-cost-analysis-patterns.md`** — Analysis workflows and benchmarks +- **`shared/event-registry.yaml`** — Feature events (if analyzing feature-level costs) -Key learnings: +**Data nuances**: - Table: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost` - Partitioned by `dt` (DATE) — always filter for performance -- `cost_category`: inference (requests), idle, overhead, unused -- Total cost = `row_cost + attributed_idle_cost + attributed_overhead_cost` (inference rows only) -- For infrastructure cost: `SUM(row_cost)` across all categories +- **Target date**: `DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)` (cost data from 3 days ago) +- Cost categories: inference, idle, overhead, unused +- Total cost per request = `row_cost + attributed_idle_cost + attributed_overhead_cost` (inference only) +- Infrastructure cost = `SUM(row_cost)` across all categories -### 3. Run Query Templates +### Phase 3: Select Query Template -**If user didn't specify anything:** Run all 11 query templates from `shared/gpu-cost-query-templates.md` to provide comprehensive cost overview. +✅ **PREFERRED: Run all 11 templates for comprehensive overview** -**If user specified specific analysis:** Select appropriate template: +If user didn't specify, run all templates from `shared/gpu-cost-query-templates.md`: + +**If user specified specific analysis**, select appropriate template: | User asks... | Use template | |-------------|-------------| @@ -68,9 +114,7 @@ Key learnings: | "Cost efficiency by model" | Cost per Request by Model | | "Which orgs cost most?" | API Cost by Organization | -See `shared/gpu-cost-query-templates.md` for all 11 query templates. - -### 4. Execute Query +### Phase 4: Execute Query Run query using: ```bash @@ -79,84 +123,107 @@ bq --project_id=ltx-dwh-explore query --use_legacy_sql=false --format=pretty " " ``` -Or use BigQuery console with project `ltx-dwh-explore`. +**Or** use BigQuery console with project `ltx-dwh-explore` + +**Critical**: Always filter on `dt = DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)` -### 5. Analyze Results +### Phase 5: Analyze Results -**For cost trends:** +**For cost trends**: - Compare current period vs baseline (7-day avg, prior week, prior month) - Calculate % change and flag significant shifts (>15-20%) -**For anomaly detection:** +**For anomaly detection**: - Flag days with Z-score > 2 (cost or volume deviates > 2 std devs from rolling avg) - Investigate root cause: specific endpoint/model/org, error rate spike, billing type change -**For breakdowns:** +**For breakdowns**: - Identify top cost drivers (endpoint, model, org, process) - Calculate cost per request to spot efficiency issues - Check failure costs (wasted spend on errors) -### 6. Present Findings +### Phase 6: Present Findings Format results with: -- **Summary**: Key finding (e.g., "GPU costs spiked 45% yesterday") +- **Summary**: Key finding (e.g., "GPU costs spiked 45% 3 days ago") - **Root cause**: What drove the change (e.g., "LTX API /v1/text-to-video requests +120%") -- **Breakdown**: Top contributors by dimension +- **Breakdown**: Top contributors by dimension (endpoint, model, org, process) - **Recommendation**: Action to take (investigate org X, optimize model Y, alert team) +- **Priority tier**: Tier 1 (High), Tier 2 (Medium), or Tier 3 (Low) -### 7. Set Up Alert (if requested) +### Phase 7: Set Up Alert (Optional) For ongoing monitoring: 1. Save SQL query 2. Set up in BigQuery scheduled query or Hex Thread -3. Configure notification threshold -4. Route alerts to Slack channel or Linear issue +3. Configure notification threshold by tier +4. Route alerts to appropriate team: + - API cost spikes → API team + - Studio cost spikes → Studio team + - Infrastructure issues → Engineering team -## Schema Reference +## 5. Context & References -For detailed table schema including all dimensions, columns, and cost calculations, see `references/schema-reference.md`. +### Production Thresholds +- **`/Users/dbeer/workspace/dwh-data-model-transforms/queries/gpu_cost_alerting_thresholds.md`** — Production thresholds for LTX division (60-day analysis) -## Reference Files +### Shared Knowledge +- **`shared/product-context.md`** — LTX products and business context +- **`shared/bq-schema.md`** — GPU cost table schema (lines 418-615) +- **`shared/metric-standards.md`** — GPU cost metric patterns (section 13) +- **`shared/gpu-cost-query-templates.md`** — 11 production-ready SQL queries +- **`shared/gpu-cost-analysis-patterns.md`** — Analysis workflows, benchmarks, investigation playbooks +- **`shared/event-registry.yaml`** — Feature events (if analyzing feature-level costs) + +### Data Source +Table: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost` +- Partitioned by `dt` (DATE) +- Division: `LTX` (LTX API + LTX Studio) +- Cost categories: inference (27%), idle (72%), overhead (0.3%) +- Key columns: `cost_category`, `row_cost`, `attributed_idle_cost`, `attributed_overhead_cost`, `result`, `endpoint`, `model_type`, `org_name`, `process_name` + +### Query Templates +See `shared/gpu-cost-query-templates.md` for 11 production-ready queries -| File | Read when | -|------|-----------| -| `references/schema-reference.md` | GPU cost table dimensions, columns, and cost calculations | -| `shared/bq-schema.md` | Understanding GPU cost table schema (lines 418-615) | -| `shared/metric-standards.md` | GPU cost metric SQL patterns (section 13) | -| `shared/gpu-cost-query-templates.md` | Selecting query template for analysis (11 production-ready queries) | -| `shared/gpu-cost-analysis-patterns.md` | Interpreting results, workflows, benchmarks, investigation playbooks | +## 6. Constraints & Done -## Rules +### DO NOT -### Query Best Practices +- **DO NOT** analyze yesterday's data — use 3 days ago (cost data needs time to finalize) +- **DO NOT** sum row_cost + attributed_* across all cost_categories — causes double-counting +- **DO NOT** mix inference and non-inference rows in same aggregation without filtering +- **DO NOT** use absolute thresholds — always compare to baseline (avg+2σ/avg+3σ) +- **DO NOT** skip partition filtering — always filter on `dt` for performance + +### DO - **DO** always filter on `dt` partition column for performance +- **DO** use `DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)` as the target date - **DO** filter `cost_category = 'inference'` for request-level analysis -- **DO** exclude Lightricks team requests with `is_lt_team IS FALSE` for customer-facing cost analysis -- **DO** include LT team requests only when analyzing total infrastructure spend or debugging +- **DO** exclude Lightricks team with `is_lt_team IS FALSE` for customer-facing cost analysis +- **DO** include LT team only when analyzing total infrastructure spend or debugging - **DO** use `ltx-dwh-explore` as execution project - **DO** calculate cost per request with `SAFE_DIVIDE` to avoid division by zero -- **DO** compare against baseline (7-day avg, prior period) for trends -- **DO** round cost values to 2 decimal places for readability - -### Cost Calculation - - **DO** sum all three cost columns (row_cost + attributed_idle + attributed_overhead) for fully loaded cost per request - **DO** use `SUM(row_cost)` across all rows for total infrastructure cost -- **DO NOT** sum row_cost + attributed_* across all cost_categories (double-counting) -- **DO NOT** mix inference and non-inference rows in same aggregation without filtering - -### Analysis - +- **DO** compare against baseline (7-day avg, prior period) for trends +- **DO** round cost values to 2 decimal places for readability - **DO** flag anomalies with Z-score > 2 (cost or volume deviation > 2 std devs) - **DO** investigate failure costs (wasted spend on errors) - **DO** break down by endpoint/model for API, by process for Studio - **DO** check cost per request trends to spot efficiency degradation - **DO** validate results against total infrastructure spend - -### Alerts - - **DO** set thresholds based on historical baseline, not absolute values - **DO** alert engineering team for cost spikes > 30% vs baseline - **DO** include cost breakdown and root cause in alerts - **DO** route API cost alerts to API team, Studio alerts to Studio team + +### Completion Criteria + +✅ All cost categories analyzed (idle, inference, overhead) +✅ Production thresholds applied by tier (High/Medium/Low) +✅ Alerts prioritized and routed to appropriate teams +✅ Root cause investigation completed (endpoint/model/org/process) +✅ Findings presented with cost breakdown and recommendations +✅ Query uses 3-day lookback (not yesterday) +✅ Partition filtering applied for performance From dad369b7bd6e8bace8782410fcb2f6adf692bf7b Mon Sep 17 00:00:00 2001 From: Daniel Beer Date: Tue, 10 Mar 2026 20:33:55 +0200 Subject: [PATCH 16/20] feat: restructure revenue monitor to 6-part Agent Skills format MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Major updates: - Restructure to 6-part Agent Skills format (Overview → Constraints) - Add all shared knowledge files to Phase 2 references - Add comprehensive revenue metrics (MRR, ARR, churn, refunds, new subs) - Add segment-level analysis (tier + plan type) - Add baseline comparisons (7-day rolling average) - Add alert thresholds (generic, pending production analysis) Monitoring coverage: - Revenue drops by segment - MRR and ARR trends - Churn rate increases - Refund rate spikes - New subscription volume - Tier movements (upgrades/downgrades) - Enterprise contract renewals Benefits: - Comprehensive monitoring across all revenue metrics - Segment-level root cause analysis - Clear structure (6-part spec) - Baseline-driven alerting --- agents/monitoring/revenue/SKILL.md | 250 +++++++++++++++++++++++------ 1 file changed, 205 insertions(+), 45 deletions(-) diff --git a/agents/monitoring/revenue/SKILL.md b/agents/monitoring/revenue/SKILL.md index 07f45b4..804d688 100644 --- a/agents/monitoring/revenue/SKILL.md +++ b/agents/monitoring/revenue/SKILL.md @@ -1,64 +1,224 @@ --- name: revenue-monitor -description: Monitors revenue metrics, tracks subscription changes, and alerts on revenue anomalies or threshold breaches. -tags: [monitoring, revenue, subscriptions] +description: "Monitor LTX Studio revenue metrics and subscription changes. Detects revenue anomalies, churn spikes, and refund issues. Use when: (1) detecting revenue drops, (2) alerting on churn rate changes, (3) investigating subscription tier movements." +tags: [monitoring, revenue, subscriptions, churn, mrr] --- # Revenue Monitor -## When to use +## 1. Overview (Why?) -- "Monitor revenue trends" -- "Alert when revenue drops" -- "Track subscription churn" -- "Monitor MRR/ARR changes" -- "Alert on refund spikes" +LTX Studio revenue varies by tier (Free/Lite/Standard/Pro/Enterprise) and plan type (self-serve/contract/pilot). Revenue monitoring requires tracking both top-line metrics (MRR, ARR) and operational indicators (churn, refunds, tier movements) to detect problems early. -## What it monitors +This skill provides **autonomous revenue monitoring** with alerting on revenue drops, churn rate spikes, refund increases, and subscription changes. It monitors all segments and identifies which tiers or plan types drive changes. -- **Revenue metrics**: MRR, ARR, daily revenue -- **Subscription metrics**: New subscriptions, cancellations, churns, renewals -- **Refunds**: Refund rate, refund amount -- **Tier changes**: Upgrades, downgrades -- **Enterprise contracts**: Contract value, renewals +**Problem solved**: Detect revenue problems, churn risk, and subscription health issues before they compound — with segment-level root cause analysis. -## Steps +## 2. Requirements (What?) -1. **Gather requirements from user:** - - Which revenue metric to monitor - - Alert threshold (e.g., "drop > 10%", "< $X per day") - - Time window (daily, weekly, monthly) - - Notification channel (Slack, email) +Monitor these outcomes autonomously: -2. **Read shared files:** - - `shared/bq-schema.md` — Subscription tables (ltxstudio_user_tiers_dates, etc.) - - `shared/metric-standards.md` — Revenue metric definitions +- [ ] Revenue drops by segment (tier, plan type) +- [ ] MRR and ARR trend changes +- [ ] Churn rate increases above baseline +- [ ] Refund rate spikes +- [ ] New subscription volume drops +- [ ] Tier movements (upgrades vs downgrades) +- [ ] Enterprise contract renewals approaching (< 90 days) +- [ ] Alerts fire only when changes exceed thresholds +- [ ] Root cause identifies which tiers/plans drove changes -3. **Write monitoring SQL:** - - Query current metric value - - Compare against historical baseline or threshold - - Flag anomalies or breaches +## 3. Progress Tracker -4. **Present to user:** - - Show SQL query - - Show example alert format - - Confirm threshold values +* [ ] Read shared knowledge (schema, metrics, business context) +* [ ] Write monitoring SQL with baseline comparisons +* [ ] Execute query for target date range +* [ ] Analyze results by segment (tier, plan type) +* [ ] Identify root cause (which segments drove changes) +* [ ] Present findings with severity and recommendations +* [ ] Set up ongoing alerts (if requested) -5. **Set up alert** (manual for now): - - Document SQL in Hex or BigQuery scheduled query - - Configure Slack webhook or notification +## 4. Implementation Plan -## Reference files +### Phase 1: Read Alert Thresholds -| File | Read when | -|------|-----------| -| `shared/bq-schema.md` | Writing SQL for subscription/revenue tables | -| `shared/metric-standards.md` | Defining revenue metrics | +**Generic thresholds** (data-driven analysis pending): +- Revenue drop > 15% DoD or > 10% WoW +- Churn rate > 5% or increase > 2x baseline +- Refund rate > 3% or spike > 50% DoD +- New subscriptions drop > 20% WoW +- Enterprise renewals < 90 days out -## Rules +[!IMPORTANT] These are generic thresholds. Consider creating production thresholds based on 60-day analysis (similar to usage/GPU cost monitoring). -- DO use LTX Studio subscription tables from bq-schema.md -- DO exclude is_lt_team unless explicitly requested -- DO validate thresholds with user before setting alerts -- DO NOT hardcode dates — use rolling windows -- DO account for timezone differences in daily revenue calculations +### Phase 2: Read Shared Knowledge + +Before writing SQL, read: +- **`shared/product-context.md`** — LTX products, user types, business model, enterprise context +- **`shared/bq-schema.md`** — Subscription tables (ltxstudio_user_tiers_dates, ltxstudio_subscriptions, etc.), user segmentation queries +- **`shared/metric-standards.md`** — Revenue metric definitions (MRR, ARR, churn, LTV) +- **`shared/event-registry.yaml`** — Feature events (if analyzing feature-driven revenue) + +**Data nuances**: +- Tables: `ltxstudio_user_tiers_dates`, `ltxstudio_subscriptions` +- Key columns: `lt_id`, `griffin_tier_name`, `subscription_start_date`, `subscription_end_date`, `plan_type` +- Exclude `is_lt_team` unless explicitly requested +- Revenue calculations vary by plan type (self-serve vs contract/pilot) + +### Phase 3: Write Monitoring SQL + +✅ **PREFERRED: Monitor all segments with baseline comparisons** + +```sql +WITH revenue_metrics AS ( + SELECT + dt, + griffin_tier_name, + plan_type, + COUNT(DISTINCT lt_id) AS active_subscribers, + SUM(CASE WHEN subscription_start_date = dt THEN 1 ELSE 0 END) AS new_subs, + SUM(CASE WHEN subscription_end_date = dt THEN 1 ELSE 0 END) AS churned_subs, + SUM(mrr_amount) AS total_mrr, + SUM(arr_amount) AS total_arr + FROM subscription_table + WHERE dt >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) + AND dt <= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) + AND is_lt_team IS FALSE + GROUP BY dt, griffin_tier_name, plan_type +), +metrics_with_baseline AS ( + SELECT + *, + AVG(total_mrr) OVER ( + PARTITION BY griffin_tier_name, plan_type + ORDER BY dt ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING + ) AS mrr_baseline_7d, + AVG(churned_subs) OVER ( + PARTITION BY griffin_tier_name, plan_type + ORDER BY dt ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING + ) AS churn_baseline_7d + FROM revenue_metrics +) +SELECT * FROM metrics_with_baseline +WHERE dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY); +``` + +**Key patterns**: +- **Segmentation**: By tier (Free/Lite/Standard/Pro/Enterprise) and plan type (self-serve/contract/pilot) +- **Baseline**: 7-day rolling average for comparison +- **Time window**: Yesterday only + +### Phase 4: Execute Query + +Run query using: +```bash +bq --project_id=ltx-dwh-explore query --use_legacy_sql=false --format=pretty " + +" +``` + +### Phase 5: Analyze Results + +**For revenue trends**: +- Compare current period vs baseline (7-day avg, prior week, prior month) +- Calculate % change and flag significant shifts (>15% DoD, >10% WoW) +- Identify which tiers/plans drove changes + +**For churn analysis**: +- Calculate churn rate = churned_subs / active_subscribers +- Compare to baseline and flag if > 5% or increase > 2x baseline +- Identify which tiers have highest churn + +**For subscription health**: +- Track new subscription volume by tier +- Monitor upgrade vs downgrade ratios +- Flag enterprise renewal dates < 90 days out + +### Phase 6: Present Findings + +Format results with: +- **Summary**: Key finding (e.g., "MRR dropped 18% yesterday") +- **Root cause**: Which segment drove the change (e.g., "Standard tier churned 12 users") +- **Breakdown**: Metrics by tier and plan type +- **Recommendation**: Action to take (investigate churn reason, contact at-risk accounts) + +**Alert format**: +``` +⚠️ REVENUE ALERT: + • Metric: MRR Drop + Current: $X | Baseline: $Y + Change: -Z% + Segment: Standard tier, self-serve + +Recommendation: Investigate recent Standard tier churns +``` + +### Phase 7: Set Up Alert (Optional) + +For ongoing monitoring: +1. Save SQL query +2. Set up in BigQuery scheduled query or Hex Thread +3. Configure notification threshold +4. Route alerts to Revenue/Growth team + +## 5. Context & References + +### Shared Knowledge +- **`shared/product-context.md`** — LTX products, user types, business model, enterprise context +- **`shared/bq-schema.md`** — Subscription tables, user segmentation queries +- **`shared/metric-standards.md`** — Revenue metric definitions (MRR, ARR, churn, LTV, retention) +- **`shared/event-registry.yaml`** — Feature events for feature-driven revenue analysis + +### Data Sources +Tables: `ltxstudio_user_tiers_dates`, `ltxstudio_subscriptions` +- Key columns: `lt_id`, `griffin_tier_name`, `subscription_start_date`, `subscription_end_date`, `plan_type`, `mrr_amount`, `arr_amount` +- Filter: `is_lt_team IS FALSE` for customer revenue + +### Tiers +- Free (no revenue) +- Lite (lowest paid tier) +- Standard (mid-tier) +- Pro (high-tier) +- Enterprise (contract/pilot) + +### Plan Types +- Self-serve (automated subscription) +- Contract (enterprise contract) +- Pilot (enterprise pilot) + +## 6. Constraints & Done + +### DO NOT + +- **DO NOT** include is_lt_team users unless explicitly requested +- **DO NOT** hardcode dates — use rolling windows +- **DO NOT** use absolute thresholds — compare to baseline +- **DO NOT** mix plan types without proper segmentation +- **DO NOT** ignore timezone differences in daily revenue calculations + +### DO + +- **DO** use LTX Studio subscription tables from bq-schema.md +- **DO** exclude is_lt_team unless explicitly requested +- **DO** validate thresholds with user before setting alerts +- **DO** use rolling windows (7-day, 30-day baselines) +- **DO** account for timezone differences in daily revenue calculations +- **DO** segment by tier AND plan type for root cause analysis +- **DO** compare against baseline (7-day avg, prior period) for trends +- **DO** calculate churn rate = churned / active subscribers +- **DO** flag MRR drops > 15% DoD or > 10% WoW +- **DO** flag churn rate > 5% or increase > 2x baseline +- **DO** flag refund rate > 3% or spike > 50% DoD +- **DO** flag new subscription drops > 20% WoW +- **DO** track enterprise renewal dates < 90 days out +- **DO** include segment breakdown in all alerts +- **DO** validate unusual patterns with Revenue/Growth team before alerting + +### Completion Criteria + +✅ All revenue metrics monitored (MRR, ARR, churn, refunds, new subs) +✅ Alerts fire with thresholds (generic pending production analysis) +✅ Segment-level root cause identified +✅ Findings presented with recommendations +✅ Timezone handling applied to daily revenue +✅ Enterprise renewals tracked From ecb38f9c14ee397dc24788e77e1d61589ef70757 Mon Sep 17 00:00:00 2001 From: Daniel Beer Date: Tue, 10 Mar 2026 20:36:14 +0200 Subject: [PATCH 17/20] feat: restructure enterprise monitor to 6-part Agent Skills format MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Major updates: - Restructure to 6-part Agent Skills format (Overview → Constraints) - Emphasize EXACT segmentation CTE usage from bq-schema.md (no modifications) - Add org-specific baseline comparisons (each org vs its 30-day average) - Add McCann split logic (McCann_NY vs McCann_Paris) - Add Contract vs Pilot account separation - Add power user tracking within orgs (top 20% by token usage) - Add quota monitoring and underutilization detection Enterprise-specific monitoring: - DAU/WAU/MAU drops per org (> 30% vs org baseline) - Token consumption vs contracted quota (< 50% utilization) - User activation (% of seats active) - Video/image generation engagement per org - Power user drops (> 20% decline) - Zero activity for 7+ consecutive days Benefits: - Org-specific baselines (not generic thresholds) - Churn risk detection - Customer success actionable alerts - Exact segmentation compliance --- agents/monitoring/enterprise/SKILL.md | 310 +++++++++++++++++++++----- 1 file changed, 249 insertions(+), 61 deletions(-) diff --git a/agents/monitoring/enterprise/SKILL.md b/agents/monitoring/enterprise/SKILL.md index 2cf15ce..39c2403 100644 --- a/agents/monitoring/enterprise/SKILL.md +++ b/agents/monitoring/enterprise/SKILL.md @@ -1,67 +1,255 @@ --- name: enterprise-monitor -description: Monitors enterprise account health, usage, and contract compliance. Alerts on low engagement, quota breaches, or churn risk. -tags: [monitoring, enterprise, accounts, contracts] +description: "Monitor enterprise account health, usage, and contract compliance. Detects low engagement, quota breaches, and churn risk. Use when: (1) detecting enterprise account usage drops, (2) alerting on inactive accounts, (3) investigating power user engagement changes." +tags: [monitoring, enterprise, accounts, contracts, churn-risk] --- # Enterprise Monitor -## When to use - -- "Monitor enterprise account usage" -- "Monitor enterprise churn risk" -- "Alert when enterprise account is inactive" - -## What it monitors - -- **Account usage**: DAU, WAU, MAU per enterprise org -- **Token consumption**: Usage vs contracted quota, historical consumption trends -- **User activation**: % of seats active -- **Engagement**: Video generations, image generations, downloads per org -- **Churn signals**: Declining usage, inactive users - -## Steps - -1. **Gather requirements from user:** - - Which enterprise org(s) to monitor (or all) - - Alert threshold based on historical usage of each account (e.g., "usage drops > 30% vs their baseline", "MAU below their 30-day average", "< 50% of contracted quota") - - Time window (weekly, monthly) - - Notification channel (Slack, email, Linear issue) - -2. **Read shared files:** - - `shared/product-context.md` — LTX products, enterprise business model, user types - - `shared/bq-schema.md` — Enterprise user segmentation queries - - `shared/metric-standards.md` — Enterprise metrics, quota tracking - - `shared/event-registry.yaml` — Feature events (if analyzing engagement) - - `shared/gpu-cost-query-templates.md` — GPU cost queries (if analyzing infrastructure costs) - - `shared/gpu-cost-analysis-patterns.md` — Cost analysis patterns (if analyzing infrastructure costs) - -3. **Identify enterprise users:** - - Use enterprise segmentation CTE from bq-schema.md (lines 441-461) - - Apply McCann split (McCann_NY vs McCann_Paris) - - Exclude Lightricks and Popular Pays - -4. **Write monitoring SQL:** - - Query org-level usage metrics - - Set baseline for each org based on their historical usage (e.g., 30-day average, 90-day trend) - - Compare current usage against org-specific baseline or contracted quota - - Flag orgs below threshold or showing decline - - Flag meaningful drops for power users (users with top usage within each org) - -5. **Present to user:** - - Show SQL query - - Show example alert format with org name and metrics - - Confirm threshold values and alert logic - -6. **Set up alert** (manual for now): - - Document SQL - - Configure notification to customer success team - -## Rules - -- DO use EXACT enterprise segmentation CTE from bq-schema.md without modification -- DO apply McCann split (McCann_NY vs McCann_Paris) -- DO exclude Lightricks and Popular Pays from enterprise orgs -- DO break out pilot vs contracted accounts -- DO NOT alert on free/self-serve users — this agent is enterprise-only -- DO include org name in alert for easy customer success follow-up +## 1. Overview (Why?) + +Enterprise accounts (contract and pilot) represent high-value customers with negotiated quotas and specific engagement patterns. Unlike self-serve users, enterprise usage should be monitored per-organization with org-specific baselines, since each has different team sizes, use cases, and contract terms. + +This skill provides **autonomous enterprise account monitoring** that detects declining usage, underutilization of quotas, inactive periods, and power user drops — all of which signal churn risk or engagement problems that require customer success intervention. + +**Problem solved**: Identify enterprise churn risk early through usage signals — before contracts end or accounts go completely inactive — with org-level root cause analysis. + +## 2. Requirements (What?) + +Monitor these outcomes autonomously: + +- [ ] DAU/WAU/MAU drops per enterprise org (> 30% vs org baseline) +- [ ] Token consumption vs contracted quota (underutilization < 50%) +- [ ] User activation (% of seats active) +- [ ] Video/image generation engagement per org +- [ ] Power user drops within org (> 20% decline) +- [ ] Zero activity for 7+ consecutive days +- [ ] Alerts include org name for customer success follow-up +- [ ] Pilot vs contract accounts separated +- [ ] McCann split applied (McCann_NY vs McCann_Paris) +- [ ] Lightricks and Popular Pays excluded + +## 3. Progress Tracker + +* [ ] Read shared knowledge (enterprise segmentation, schema, metrics) +* [ ] Identify enterprise users with segmentation CTE +* [ ] Write monitoring SQL with org-specific baselines +* [ ] Execute query for target date range +* [ ] Analyze results by org and account type +* [ ] Identify power user drops within each org +* [ ] Present findings with org-level details +* [ ] Route alerts to customer success team + +## 4. Implementation Plan + +### Phase 1: Read Alert Thresholds + +**Generic thresholds** (data-driven analysis pending): +- DoD/WoW usage drops > 30% vs org's baseline +- MAU below org's 30-day average +- Token consumption < 50% of contracted quota (underutilization) +- Power user drops > 20% within org +- Zero activity for 7+ consecutive days + +[!IMPORTANT] These are generic thresholds. Consider creating production thresholds based on org-specific analysis. + +### Phase 2: Read Shared Knowledge + +Before writing SQL, read: +- **`shared/product-context.md`** — LTX products, enterprise business model, user types +- **`shared/bq-schema.md`** — Enterprise user segmentation queries (lines 441-461) +- **`shared/metric-standards.md`** — Enterprise metrics, quota tracking +- **`shared/event-registry.yaml`** — Feature events (if analyzing engagement) + +**Data nuances**: +- Use EXACT enterprise segmentation CTE from bq-schema.md (lines 441-461) without modification +- Apply McCann split: `McCann_NY` vs `McCann_Paris` +- Exclude: `Lightricks`, `Popular Pays`, `None` +- Contract accounts: Indegene, HearWell_BeWell, Novig, Cylndr Studios, Miroma, Deriv, McCann_Paris +- Pilot accounts: All other enterprise orgs + +### Phase 3: Identify Enterprise Users + +✅ **PREFERRED: Use exact segmentation CTE from bq-schema.md** + +```sql +WITH ent_users AS ( + SELECT DISTINCT + lt_id, + CASE + WHEN COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) = 'McCann_NY' THEN 'McCann_NY' + WHEN COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) LIKE '%McCann%' THEN 'McCann_Paris' + ELSE COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) + END AS org + FROM `ltx-dwh-prod-processed.web.ltxstudio_users` + WHERE is_enterprise_user + AND current_customer_plan_type IN ('contract', 'pilot') + AND COALESCE(enterprise_name_at_purchase, current_enterprise_name, organization_name) NOT IN ('Lightricks', 'Popular Pays', 'None') +), +enterprise_users AS ( + SELECT DISTINCT + lt_id, + CASE + WHEN org IN ('Indegene', 'HearWell_BeWell', 'Novig', 'Cylndr Studios', 'Miroma', 'Deriv', 'McCann_Paris') + THEN org + ELSE CONCAT(org, ' Pilot') + END AS org, + CASE + WHEN org IN ('Indegene', 'HearWell_BeWell', 'Novig', 'Cylndr Studios', 'Miroma', 'Deriv', 'McCann_Paris') + THEN 'Contract' + ELSE 'Pilot' + END AS account_type + FROM ent_users + WHERE org NOT IN ('Lightricks', 'Popular Pays', 'None') +) +``` + +### Phase 4: Write Monitoring SQL + +✅ **PREFERRED: Monitor all enterprise orgs with org-specific baselines** + +```sql +WITH org_metrics AS ( + SELECT + a.dt, + e.org, + e.account_type, + COUNT(DISTINCT a.lt_id) AS dau, + SUM(a.num_tokens_consumed) AS tokens, + SUM(a.num_generate_image) AS image_gens, + SUM(a.num_generate_video) AS video_gens + FROM `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` a + JOIN enterprise_users e ON a.lt_id = e.lt_id + WHERE a.dt >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) + AND a.dt <= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) + GROUP BY a.dt, e.org, e.account_type +), +metrics_with_baseline AS ( + SELECT + *, + AVG(dau) OVER ( + PARTITION BY org + ORDER BY dt ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING + ) AS dau_baseline_30d, + AVG(tokens) OVER ( + PARTITION BY org + ORDER BY dt ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING + ) AS tokens_baseline_30d + FROM org_metrics +) +SELECT * FROM metrics_with_baseline +WHERE dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY); +``` + +**Key patterns**: +- **Org-specific baselines**: Each org compared to its own 30-day average +- **McCann split**: Separate McCann_NY and McCann_Paris +- **Account type**: Contract vs Pilot + +### Phase 5: Analyze Results + +**For usage trends**: +- Compare org's current usage vs their 30-day baseline +- Flag orgs with DoD/WoW drops > 30% vs baseline +- Flag orgs with MAU below their 30-day average + +**For quota analysis**: +- Compare token consumption vs contracted quota +- Flag underutilization (< 50% of quota) +- Identify orgs approaching or exceeding quota + +**For engagement**: +- Track power users within each org (top 20% by token usage) +- Flag power user drops > 20% within org +- Flag zero activity for 7+ consecutive days + +### Phase 6: Present Findings + +Format results with: +- **Summary**: Key finding (e.g., "Novig enterprise account inactive for 7 days") +- **Org details**: Org name, account type (Contract/Pilot), baseline usage +- **Metrics**: DAU, tokens, generations vs baseline +- **Recommendation**: Customer success action (reach out, investigate, adjust quota) + +**Alert format**: +``` +⚠️ ENTERPRISE ALERT: + • Org: Novig (Contract) + Metric: Token consumption + Current: 0 tokens | Baseline: 27K/day + Drop: -100% | Zero activity for 7 days + +Recommendation: Contact account manager immediately +``` + +### Phase 7: Route Alert + +For ongoing monitoring: +1. Save SQL query +2. Set up in BigQuery scheduled query or Hex Thread +3. Configure notification for customer success team +4. Include org name and account manager contact in alert + +## 5. Context & References + +### Shared Knowledge +- **`shared/product-context.md`** — LTX products, enterprise business model, user types +- **`shared/bq-schema.md`** — Enterprise user segmentation queries (lines 441-461) +- **`shared/metric-standards.md`** — Enterprise metrics, quota tracking +- **`shared/event-registry.yaml`** — Feature events for engagement analysis + +### Data Sources +- **Users table**: `ltx-dwh-prod-processed.web.ltxstudio_users` +- **Usage table**: `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` +- Key columns: `lt_id`, `enterprise_name_at_purchase`, `current_enterprise_name`, `organization_name`, `current_customer_plan_type` + +### Enterprise Orgs + +**Contract accounts**: +- Indegene +- HearWell_BeWell +- Novig +- Cylndr Studios +- Miroma +- Deriv +- McCann_Paris + +**Pilot accounts**: All other enterprise orgs (suffixed with " Pilot") + +**Excluded**: Lightricks, Popular Pays, None + +## 6. Constraints & Done + +### DO NOT + +- **DO NOT** modify enterprise segmentation CTE — use exact version from bq-schema.md +- **DO NOT** alert on free/self-serve users — this agent is enterprise-only +- **DO NOT** combine McCann_NY and McCann_Paris — keep them separate +- **DO NOT** include Lightricks or Popular Pays in enterprise monitoring +- **DO NOT** use generic baselines — each org compared to its own historical usage + +### DO + +- **DO** use EXACT enterprise segmentation CTE from bq-schema.md (lines 441-461) without modification +- **DO** apply McCann split (McCann_NY vs McCann_Paris) +- **DO** exclude Lightricks, Popular Pays, and None from enterprise orgs +- **DO** break out pilot vs contracted accounts +- **DO** include org name in alert for customer success follow-up +- **DO** use org-specific baselines (each org's 30-day average) +- **DO** flag DoD/WoW usage drops > 30% vs org baseline +- **DO** flag MAU below org's 30-day average +- **DO** flag token consumption < 50% of contracted quota +- **DO** flag power user drops > 20% within org +- **DO** flag zero activity for 7+ consecutive days +- **DO** route alerts to customer success team with org details +- **DO** validate unusual patterns with customer success before alerting + +### Completion Criteria + +✅ All enterprise orgs monitored (Contract and Pilot) +✅ Org-specific baselines applied +✅ McCann split applied (NY vs Paris) +✅ Lightricks and Popular Pays excluded +✅ Alerts include org name and account type +✅ Usage drops, quota issues, and engagement drops detected +✅ Findings routed to customer success team From 618140cf898150ec622583ca93bbc5d61a10c711 Mon Sep 17 00:00:00 2001 From: Daniel Beer Date: Tue, 10 Mar 2026 20:42:51 +0200 Subject: [PATCH 18/20] feat: restructure API runtime monitor to 6-part Agent Skills format MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Major updates: - Restructure to 6-part Agent Skills format (Overview → Constraints) - Add all 6 shared knowledge files to Phase 2 references - Add detailed data source explanation (ltxvapi tables and GPU cost table) - Add percentile calculations (P50/P95/P99 latency) - Add error type separation (infrastructure vs applicative) - Add baseline comparisons (7-day rolling average) - Add alert routing by error type (Engineering vs API/Product team) Performance monitoring coverage: - P95 latency spikes (> 2x baseline or > 60s) - Error rate increases (> 5% or DoD > 50%) - Throughput drops (> 30% DoD/WoW) - Queue time issues (> 50% of processing time) - Infrastructure errors (> 10 requests/hour) Benefits: - Comprehensive API performance monitoring - Endpoint/model/org breakdown for root cause - Error type routing to appropriate teams - Clear structure (6-part spec) - Baseline-driven alerting --- agents/monitoring/api-runtime/SKILL.md | 335 +++++++++++++++++++------ 1 file changed, 257 insertions(+), 78 deletions(-) diff --git a/agents/monitoring/api-runtime/SKILL.md b/agents/monitoring/api-runtime/SKILL.md index 84ece4a..f3d579a 100644 --- a/agents/monitoring/api-runtime/SKILL.md +++ b/agents/monitoring/api-runtime/SKILL.md @@ -1,84 +1,263 @@ --- name: api-runtime-monitor -description: Monitors LTX API runtime performance, latency, error rates, and throughput. Alerts on performance degradation or errors. -tags: [monitoring, api, performance, latency, errors] +description: "Monitor LTX API runtime performance, latency, error rates, and throughput. Detects performance degradation and errors. Use when: (1) detecting API latency issues, (2) alerting on error rate spikes, (3) investigating throughput drops by endpoint/model/org." +tags: [monitoring, api, performance, latency, errors, throughput] --- # API Runtime Monitor -## When to use - -- "Monitor API latency" -- "Alert on API errors" -- "Track API throughput" -- "Monitor inference time" -- "Alert on API performance degradation" - -## What it monitors - -- **Latency**: Request processing time, inference time, queue time -- **Error rates**: % of failed requests, error types, error sources -- **Throughput**: Requests per hour/day, by endpoint/model -- **Performance**: P50/P95/P99 latency, success rate -- **Utilization**: API usage by org, model, resolution - -## Steps - -1. **Gather requirements from user:** - - Which performance metric to monitor (latency, errors, throughput) - - Alert threshold (e.g., "P95 latency > 30s", "error rate > 5%", "throughput drops > 20%") - - Time window (hourly, daily) - - Scope (all requests, specific endpoint, specific org) - - Notification channel - -2. **Read shared files:** - - `shared/bq-schema.md` — GPU cost table (has API runtime data) and ltxvapi tables - - `shared/metric-standards.md` — Performance metric patterns - -3. **Identify data source:** - - For LTX API: Use `ltxvapi_api_requests_with_be_costs` or `gpu_request_attribution_and_cost` - - **Key columns explained:** - - `request_processing_time_ms`: Total time from request submission to completion - - `request_inference_time_ms`: GPU processing time (actual model inference) - - `request_queue_time_ms`: Time waiting in queue before processing starts - - `result`: Request outcome (success, failed, timeout, etc.) - - `error_type`: Classification of errors (infrastructure vs applicative) - - `endpoint`: API endpoint called (e.g., /generate, /upscale) - - `model_type`: Model used (ltxv2, retake, etc.) - - `org_name`: Customer organization making the request - -4. **Write monitoring SQL:** - - Query relevant performance metric - - Calculate percentiles (P50, P95, P99) for latency - - Calculate error rate (failed / total requests) - - Compare against baseline - -5. **Present to user:** - - Show SQL query - - Show example alert format with performance breakdown - - Confirm threshold values - -6. **Set up alert** (manual for now): - - Document SQL - - Configure notification to engineering team - -## Reference files - -| File | Read when | -|------|-----------| -| `shared/product-context.md` | LTX products and business context | -| `shared/bq-schema.md` | API tables and GPU cost table schema | -| `shared/metric-standards.md` | Performance metric patterns | -| `shared/event-registry.yaml` | Feature events (if analyzing event-driven metrics) | -| `shared/gpu-cost-query-templates.md` | GPU cost queries (if analyzing cost-related performance) | -| `shared/gpu-cost-analysis-patterns.md` | Cost analysis patterns (if analyzing cost-related performance) | - -## Rules - -- DO use APPROX_QUANTILES for percentile calculations (P50, P95, P99) -- DO separate errors by error_source (infrastructure vs applicative) -- DO filter by result = 'success' for success rate calculations -- DO break down by endpoint, model, and resolution for detailed analysis -- DO compare current performance against historical baseline -- DO alert engineering team for infrastructure errors, product team for applicative errors -- DO partition by dt for performance +## 1. Overview (Why?) + +LTX API performance varies by endpoint, model, and customer organization. Latency issues, error rate spikes, and throughput drops can indicate infrastructure problems, model regressions, or customer-specific issues that require engineering intervention. + +This skill provides **autonomous API runtime monitoring** that detects performance degradation (P95 latency spikes), error rate increases, throughput drops, and queue time issues — with breakdown by endpoint, model, and organization for root cause analysis. + +**Problem solved**: Detect API performance problems and errors before they impact customer experience — with segment-level (endpoint/model/org) root cause identification. + +## 2. Requirements (What?) + +Monitor these outcomes autonomously: + +- [ ] P95 latency spikes (> 2x baseline or > 60s) +- [ ] Error rate increases (> 5% or DoD increase > 50%) +- [ ] Throughput drops (> 30% DoD/WoW) +- [ ] Queue time excessive (> 50% of processing time) +- [ ] Infrastructure errors (> 10 requests/hour) +- [ ] Alerts include breakdown by endpoint, model, organization +- [ ] Results formatted by priority (infrastructure vs applicative errors) +- [ ] Findings routed to appropriate team (API team or Engineering) + +## 3. Progress Tracker + +* [ ] Read shared knowledge (schema, metrics, performance patterns) +* [ ] Identify data source (ltxvapi tables or GPU cost table) +* [ ] Write monitoring SQL with percentile calculations +* [ ] Execute query for target date range +* [ ] Analyze results by endpoint, model, organization +* [ ] Separate infrastructure vs applicative errors +* [ ] Present findings with performance breakdown +* [ ] Route alerts to appropriate team + +## 4. Implementation Plan + +### Phase 1: Read Alert Thresholds + +**Generic thresholds** (data-driven analysis pending): +- P95 latency > 2x baseline or > 60s +- Error rate > 5% or DoD increase > 50% +- Throughput drops > 30% DoD/WoW +- Queue time > 50% of processing time +- Infrastructure errors > 10 requests/hour + +[!IMPORTANT] These are generic thresholds. Consider creating production thresholds based on endpoint/model-specific analysis (similar to usage/GPU cost monitoring). + +### Phase 2: Read Shared Knowledge + +Before writing SQL, read: +- **`shared/product-context.md`** — LTX products, user types, business model, API context +- **`shared/bq-schema.md`** — GPU cost table (has API runtime data), ltxvapi tables, API request schema +- **`shared/metric-standards.md`** — Performance metric patterns (latency, error rates, throughput) +- **`shared/event-registry.yaml`** — Feature events (if analyzing event-driven metrics) +- **`shared/gpu-cost-query-templates.md`** — GPU cost queries (if analyzing cost-related performance) +- **`shared/gpu-cost-analysis-patterns.md`** — Cost analysis patterns (if analyzing cost-related performance) + +**Data nuances**: +- Primary table: `ltx-dwh-prod-processed.web.ltxvapi_api_requests_with_be_costs` +- Alternative: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost` +- Partitioned by `action_ts` (TIMESTAMP) or `dt` (DATE) — filter for performance +- Key columns: `request_processing_time_ms`, `request_inference_time_ms`, `request_queue_time_ms`, `result`, `endpoint`, `model_type`, `org_name` + +### Phase 3: Identify Data Source + +✅ **PREFERRED: Use ltxvapi_api_requests_with_be_costs for API runtime metrics** + +**Key columns**: +- `request_processing_time_ms`: Total time from request submission to completion +- `request_inference_time_ms`: GPU processing time (actual model inference) +- `request_queue_time_ms`: Time waiting in queue before processing starts +- `result`: Request outcome (success, failed, timeout, etc.) +- `error_type` or `error_source`: Classification of errors (infrastructure vs applicative) +- `endpoint`: API endpoint called (e.g., /generate, /upscale) +- `model_type`: Model used (ltxv2, retake, etc.) +- `org_name`: Customer organization making the request + +[!IMPORTANT] Verify column name: `error_type` vs `error_source` in actual schema + +### Phase 4: Write Monitoring SQL + +✅ **PREFERRED: Calculate percentiles and error rates with baseline comparisons** + +```sql +WITH api_metrics AS ( + SELECT + DATE(action_ts) AS dt, + endpoint, + model_type, + org_name, + COUNT(*) AS total_requests, + COUNTIF(result = 'success') AS successful_requests, + COUNTIF(result != 'success') AS failed_requests, + SAFE_DIVIDE(COUNTIF(result != 'success'), COUNT(*)) * 100 AS error_rate_pct, + APPROX_QUANTILES(request_processing_time_ms, 100)[OFFSET(50)] AS p50_latency_ms, + APPROX_QUANTILES(request_processing_time_ms, 100)[OFFSET(95)] AS p95_latency_ms, + APPROX_QUANTILES(request_processing_time_ms, 100)[OFFSET(99)] AS p99_latency_ms, + AVG(request_queue_time_ms) AS avg_queue_time_ms, + AVG(request_inference_time_ms) AS avg_inference_time_ms + FROM `ltx-dwh-prod-processed.web.ltxvapi_api_requests_with_be_costs` + WHERE action_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY) + AND action_ts < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY) + GROUP BY dt, endpoint, model_type, org_name +), +metrics_with_baseline AS ( + SELECT + *, + AVG(p95_latency_ms) OVER ( + PARTITION BY endpoint, model_type + ORDER BY dt ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING + ) AS p95_latency_baseline_7d, + AVG(error_rate_pct) OVER ( + PARTITION BY endpoint, model_type + ORDER BY dt ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING + ) AS error_rate_baseline_7d + FROM api_metrics +) +SELECT * FROM metrics_with_baseline +WHERE dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY); +``` + +**Key patterns**: +- **Percentiles**: Use `APPROX_QUANTILES` for P50/P95/P99 +- **Error rate**: `SAFE_DIVIDE(failed, total) * 100` +- **Baseline**: 7-day rolling average by endpoint and model +- **Time window**: Last 7 days (shorter than usage monitoring due to higher frequency data) + +### Phase 5: Execute Query + +Run query using: +```bash +bq --project_id=ltx-dwh-explore query --use_legacy_sql=false --format=pretty " + +" +``` + +### Phase 6: Analyze Results + +**For latency trends**: +- Compare P95 latency vs baseline (7-day avg) +- Flag if P95 > 2x baseline or > 60s absolute +- Identify which endpoint/model/org drove spikes + +**For error rate analysis**: +- Compare error rate vs baseline +- Separate errors by `error_type`/`error_source` (infrastructure vs applicative) +- Flag if error rate > 5% or DoD increase > 50% + +**For throughput**: +- Track requests per hour/day by endpoint +- Flag throughput drops > 30% DoD/WoW +- Identify which endpoints lost traffic + +**For queue analysis**: +- Calculate queue time as % of total processing time +- Flag if queue time > 50% of processing time +- Indicates capacity/scaling issues + +### Phase 7: Present Findings + +Format results with: +- **Summary**: Key finding (e.g., "P95 latency spiked to 85s for /v1/text-to-video") +- **Root cause**: Which endpoint/model/org drove the issue +- **Breakdown**: Performance metrics by dimension +- **Error classification**: Infrastructure vs applicative errors +- **Recommendation**: Route to API team (applicative) or Engineering team (infrastructure) + +**Alert format**: +``` +⚠️ API PERFORMANCE ALERT: + • Endpoint: /v1/text-to-video + Model: ltxv2 + Metric: P95 Latency + Current: 85s | Baseline: 30s + Change: +183% + + Error rate: 8.2% (baseline: 2.1%) + Error type: Infrastructure + +Recommendation: Alert Engineering team for infrastructure issue +``` + +### Phase 8: Route Alert + +For ongoing monitoring: +1. Save SQL query +2. Set up in BigQuery scheduled query or Hex Thread +3. Configure notification by error type: + - Infrastructure errors → Engineering team + - Applicative errors → API/Product team +4. Include endpoint, model, and org details in alert + +## 5. Context & References + +### Shared Knowledge +- **`shared/product-context.md`** — LTX products and API context +- **`shared/bq-schema.md`** — API tables and GPU cost table schema +- **`shared/metric-standards.md`** — Performance metric patterns +- **`shared/event-registry.yaml`** — Feature events (if analyzing event-driven metrics) +- **`shared/gpu-cost-query-templates.md`** — GPU cost queries (if analyzing cost-related performance) +- **`shared/gpu-cost-analysis-patterns.md`** — Cost analysis patterns + +### Data Sources + +**Primary table**: `ltx-dwh-prod-processed.web.ltxvapi_api_requests_with_be_costs` +- Partitioned by `action_ts` (TIMESTAMP) +- Key columns: `request_processing_time_ms`, `request_inference_time_ms`, `request_queue_time_ms`, `result`, `endpoint`, `model_type`, `org_name` + +**Alternative**: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost` +- Contains API runtime data but not primary source for performance metrics + +### Endpoints +Common endpoints: `/v1/text-to-video`, `/v1/image-to-video`, `/v1/upscale`, `/generate` + +### Models +Common models: `ltxv2`, `retake`, etc. + +## 6. Constraints & Done + +### DO NOT + +- **DO NOT** use absolute thresholds without baseline comparison +- **DO NOT** mix infrastructure and applicative errors in same alert +- **DO NOT** skip partition filtering — always filter on `action_ts` or `dt` for performance +- **DO NOT** forget to separate errors by error type/source + +[!IMPORTANT] Verify column name in schema: `error_type` vs `error_source` + +### DO + +- **DO** use `APPROX_QUANTILES` for percentile calculations (P50, P95, P99) +- **DO** separate errors by error_source (infrastructure vs applicative) +- **DO** filter by `result = 'success'` for success rate calculations +- **DO** break down by endpoint, model, and organization for detailed analysis +- **DO** compare current performance against historical baseline (7-day rolling avg) +- **DO** alert engineering team for infrastructure errors +- **DO** alert product/API team for applicative errors +- **DO** partition on `action_ts` or `dt` for performance +- **DO** use `ltx-dwh-explore` as execution project +- **DO** calculate error rate with `SAFE_DIVIDE(failed, total) * 100` +- **DO** flag P95 latency > 2x baseline or > 60s +- **DO** flag error rate > 5% or DoD increase > 50% +- **DO** flag throughput drops > 30% DoD/WoW +- **DO** flag queue time > 50% of processing time +- **DO** flag infrastructure errors > 10 requests/hour +- **DO** include endpoint, model, org details in all alerts +- **DO** validate unusual patterns with API/Engineering team before alerting + +### Completion Criteria + +✅ All performance metrics monitored (latency, errors, throughput, queue time) +✅ Alerts fire with thresholds (generic pending production analysis) +✅ Endpoint/model/org breakdown provided +✅ Errors separated by type (infrastructure vs applicative) +✅ Findings routed to appropriate team +✅ Partition filtering applied for performance +✅ Column name verified (error_type vs error_source) From a77114e0b6c1d2052a7e9aa6b9933a486eba138f Mon Sep 17 00:00:00 2001 From: Daniel Beer Date: Wed, 11 Mar 2026 11:47:12 +0200 Subject: [PATCH 19/20] refactor: streamline BE cost monitoring skill MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Remove duplications (data nuances, analysis patterns, routing) - Consolidate 7 phases into 3 clean phases - Apply statistical 2σ method like usage monitor - Reduce from 200 to 111 lines (-44.5%) - Update progress tracker to match 3-phase structure - Clean up DO rules (19 → 7 essential rules) Co-Authored-By: Claude Sonnet 4.5 --- agents/monitoring/be-cost/SKILL.md | 198 ++++++----------------------- 1 file changed, 40 insertions(+), 158 deletions(-) diff --git a/agents/monitoring/be-cost/SKILL.md b/agents/monitoring/be-cost/SKILL.md index fb5dd29..88f486e 100644 --- a/agents/monitoring/be-cost/SKILL.md +++ b/agents/monitoring/be-cost/SKILL.md @@ -11,9 +11,7 @@ compatibility: ## 1. Overview (Why?) -LTX GPU infrastructure spending is dominated by idle costs (~72% of total), with actual inference representing only ~27%. Cost patterns vary significantly by vertical (API vs Studio) and day-of-week. Generic alerting thresholds miss true anomalies while flagging normal variance. - -This skill provides **autonomous cost monitoring** with data-driven thresholds derived from 60-day statistical analysis. It detects genuine cost problems (autoscaler issues, efficiency regression, volume surges) while suppressing noise from expected patterns. +This skill provides **autonomous cost monitoring** using statistical anomaly detection. It detects genuine cost problems (autoscaler issues, efficiency regression, volume surges) while suppressing noise from expected patterns. **Problem solved**: Detect cost spikes, utilization degradation, and efficiency drift that indicate infrastructure problems or wasteful spending — without manual threshold tuning or false alarms. @@ -36,194 +34,78 @@ Monitor these outcomes autonomously: ## 3. Progress Tracker -* [ ] Read production thresholds and shared knowledge -* [ ] Select appropriate query template(s) -* [ ] Execute query for 3 days ago -* [ ] Analyze results by tier priority +* [ ] Read shared knowledge (schema, query templates, analysis patterns) +* [ ] Execute monitoring query for 3 days ago +* [ ] Calculate z-scores and classify severity (NOTICE/WARNING/CRITICAL) * [ ] Identify root cause (endpoint, model, org, process) -* [ ] Present findings with cost breakdown -* [ ] Route alerts to appropriate team (API/Studio/Engineering) +* [ ] Present findings with statistical details and recommendations ## 4. Implementation Plan -### Phase 1: Read Production Thresholds - -Production thresholds from `/Users/dbeer/workspace/dwh-data-model-transforms/queries/gpu_cost_alerting_thresholds.md`: - -#### Tier 1 — High Priority (daily monitoring) -| Alert | Threshold | Signal | -|-------|-----------|--------| -| **Idle cost spike** | > $15,600/day | Autoscaler over-provisioning or traffic drop without scaledown | -| **Inference cost spike** | > $5,743/day | Volume surge, costlier model, or heavy new customer | -| **Idle-to-inference ratio** | > 4:1 | GPU utilization degrading (baseline ~2.7:1) | - -#### Tier 2 — Medium Priority (daily monitoring) -| Alert | Threshold | Signal | -|-------|-----------|--------| -| **Failure rate spike** | > 20.4% overall | Wasted compute + service quality issue | -| **Cost-per-request drift** | API > $0.70, Studio > $0.46 | Model regression or duration/resolution creep | -| **DoD cost jump** | API > 30%, Studio > 15% | Early warning before absolute thresholds breach | - -#### Tier 3 — Low Priority (weekly review) -| Alert | Threshold | Signal | -|-------|-----------|--------| -| **Volume drop** | < 18,936 requests/day | Possible outage or upstream feed issue | -| **Overhead spike** | > $138/day | System overhead anomaly (review weekly) | - -**Per-Vertical Thresholds:** -- **LTX API**: Total daily > $5,555, Failure rate > 5.7%, DoD change > 30% -- **LTX Studio**: Total daily > $11,928, Failure rate > 22.6%, DoD change > 15% - -**Alert logic**: Cost metric exceeds WARNING (avg+2σ) or CRITICAL (avg+3σ) threshold +### Phase 1: Read Shared Knowledge -### Phase 2: Read Shared Knowledge - -Before writing SQL, read: -- **`shared/product-context.md`** — LTX products, user types, business model +Before analyzing, read: - **`shared/bq-schema.md`** — GPU cost table schema (lines 418-615) -- **`shared/metric-standards.md`** — GPU cost metric patterns (section 13) - **`shared/gpu-cost-query-templates.md`** — 11 production-ready SQL queries - **`shared/gpu-cost-analysis-patterns.md`** — Analysis workflows and benchmarks -- **`shared/event-registry.yaml`** — Feature events (if analyzing feature-level costs) - -**Data nuances**: -- Table: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost` -- Partitioned by `dt` (DATE) — always filter for performance -- **Target date**: `DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)` (cost data from 3 days ago) -- Cost categories: inference, idle, overhead, unused -- Total cost per request = `row_cost + attributed_idle_cost + attributed_overhead_cost` (inference only) -- Infrastructure cost = `SUM(row_cost)` across all categories -### Phase 3: Select Query Template +**Statistical Method**: Last 10 same-day-of-week comparison with 2σ threshold -✅ **PREFERRED: Run all 11 templates for comprehensive overview** +**Alert logic**: `|current_value - μ| > 2σ` -If user didn't specify, run all templates from `shared/gpu-cost-query-templates.md`: +Where: +- **μ (mean)** = average of last 10 same-day-of-week values +- **σ (stddev)** = standard deviation of last 10 same-day-of-week values +- **z-score** = (current - μ) / σ -**If user specified specific analysis**, select appropriate template: +**Severity levels**: +- **NOTICE**: `2 < |z| ≤ 3` - Minor deviation, monitor +- **WARNING**: `3 < |z| ≤ 4.5` - Significant anomaly, investigate +- **CRITICAL**: `|z| > 4.5` - Extreme anomaly, immediate action -| User asks... | Use template | -|-------------|-------------| -| "What are total costs?" | Daily Total Cost (API + Studio) | -| "Break down API costs" | API Cost by Endpoint & Model | -| "Break down Studio costs" | Studio Cost by Process | -| "Yesterday vs day before" | Day-over-Day Comparison | -| "This week vs last week" | Week-over-Week Comparison | -| "Detect cost spikes" | Anomaly Detection (Z-Score) | -| "GPU utilization breakdown" | Utilization by Cost Category | -| "Cost efficiency by model" | Cost per Request by Model | -| "Which orgs cost most?" | API Cost by Organization | +### Phase 2: Execute Monitoring Query -### Phase 4: Execute Query +Run query for **3 days ago** (cost data needs time to finalize): -Run query using: ```bash bq --project_id=ltx-dwh-explore query --use_legacy_sql=false --format=pretty " - + " ``` -**Or** use BigQuery console with project `ltx-dwh-explore` - -**Critical**: Always filter on `dt = DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)` - -### Phase 5: Analyze Results - -**For cost trends**: -- Compare current period vs baseline (7-day avg, prior week, prior month) -- Calculate % change and flag significant shifts (>15-20%) - -**For anomaly detection**: -- Flag days with Z-score > 2 (cost or volume deviates > 2 std devs from rolling avg) -- Investigate root cause: specific endpoint/model/org, error rate spike, billing type change - -**For breakdowns**: -- Identify top cost drivers (endpoint, model, org, process) -- Calculate cost per request to spot efficiency issues -- Check failure costs (wasted spend on errors) - -### Phase 6: Present Findings - -Format results with: -- **Summary**: Key finding (e.g., "GPU costs spiked 45% 3 days ago") -- **Root cause**: What drove the change (e.g., "LTX API /v1/text-to-video requests +120%") -- **Breakdown**: Top contributors by dimension (endpoint, model, org, process) -- **Recommendation**: Action to take (investigate org X, optimize model Y, alert team) -- **Priority tier**: Tier 1 (High), Tier 2 (Medium), or Tier 3 (Low) - -### Phase 7: Set Up Alert (Optional) - -For ongoing monitoring: -1. Save SQL query -2. Set up in BigQuery scheduled query or Hex Thread -3. Configure notification threshold by tier -4. Route alerts to appropriate team: - - API cost spikes → API team - - Studio cost spikes → Studio team - - Infrastructure issues → Engineering team - -## 5. Context & References - -### Production Thresholds -- **`/Users/dbeer/workspace/dwh-data-model-transforms/queries/gpu_cost_alerting_thresholds.md`** — Production thresholds for LTX division (60-day analysis) - -### Shared Knowledge -- **`shared/product-context.md`** — LTX products and business context -- **`shared/bq-schema.md`** — GPU cost table schema (lines 418-615) -- **`shared/metric-standards.md`** — GPU cost metric patterns (section 13) -- **`shared/gpu-cost-query-templates.md`** — 11 production-ready SQL queries -- **`shared/gpu-cost-analysis-patterns.md`** — Analysis workflows, benchmarks, investigation playbooks -- **`shared/event-registry.yaml`** — Feature events (if analyzing feature-level costs) +✅ **PREFERRED: Run all 11 templates for comprehensive overview** -### Data Source -Table: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost` -- Partitioned by `dt` (DATE) -- Division: `LTX` (LTX API + LTX Studio) -- Cost categories: inference (27%), idle (72%), overhead (0.3%) -- Key columns: `cost_category`, `row_cost`, `attributed_idle_cost`, `attributed_overhead_cost`, `result`, `endpoint`, `model_type`, `org_name`, `process_name` +### Phase 3: Analyze and Present Results -### Query Templates -See `shared/gpu-cost-query-templates.md` for 11 production-ready queries +Format findings with: +- **Summary**: Key finding (e.g., "Idle costs spiked 45% on March 5") +- **Severity**: NOTICE / WARNING / CRITICAL +- **Z-score**: Statistical deviation from baseline +- **Root cause**: Top contributors by endpoint/model/org/process +- **Recommendation**: Action to take, team to notify -## 6. Constraints & Done +## 5. Constraints & Done ### DO NOT - **DO NOT** analyze yesterday's data — use 3 days ago (cost data needs time to finalize) - **DO NOT** sum row_cost + attributed_* across all cost_categories — causes double-counting -- **DO NOT** mix inference and non-inference rows in same aggregation without filtering -- **DO NOT** use absolute thresholds — always compare to baseline (avg+2σ/avg+3σ) -- **DO NOT** skip partition filtering — always filter on `dt` for performance +- **DO NOT** skip partition filtering — always filter on `dt` ### DO -- **DO** always filter on `dt` partition column for performance -- **DO** use `DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)` as the target date +- **DO** use `DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)` as target date - **DO** filter `cost_category = 'inference'` for request-level analysis -- **DO** exclude Lightricks team with `is_lt_team IS FALSE` for customer-facing cost analysis -- **DO** include LT team only when analyzing total infrastructure spend or debugging -- **DO** use `ltx-dwh-explore` as execution project -- **DO** calculate cost per request with `SAFE_DIVIDE` to avoid division by zero +- **DO** exclude LT team with `is_lt_team IS FALSE` for customer-facing cost +- **DO** use execution project `ltx-dwh-explore` - **DO** sum all three cost columns (row_cost + attributed_idle + attributed_overhead) for fully loaded cost per request -- **DO** use `SUM(row_cost)` across all rows for total infrastructure cost -- **DO** compare against baseline (7-day avg, prior period) for trends -- **DO** round cost values to 2 decimal places for readability -- **DO** flag anomalies with Z-score > 2 (cost or volume deviation > 2 std devs) -- **DO** investigate failure costs (wasted spend on errors) -- **DO** break down by endpoint/model for API, by process for Studio -- **DO** check cost per request trends to spot efficiency degradation -- **DO** validate results against total infrastructure spend -- **DO** set thresholds based on historical baseline, not absolute values -- **DO** alert engineering team for cost spikes > 30% vs baseline -- **DO** include cost breakdown and root cause in alerts -- **DO** route API cost alerts to API team, Studio alerts to Studio team +- **DO** flag anomalies with Z-score > 2 (statistical threshold) +- **DO** route alerts: API team (API costs), Studio team (Studio costs), Engineering (infrastructure) ### Completion Criteria -✅ All cost categories analyzed (idle, inference, overhead) -✅ Production thresholds applied by tier (High/Medium/Low) -✅ Alerts prioritized and routed to appropriate teams -✅ Root cause investigation completed (endpoint/model/org/process) -✅ Findings presented with cost breakdown and recommendations -✅ Query uses 3-day lookback (not yesterday) -✅ Partition filtering applied for performance +✅ Query executed for 3 days ago with partition filtering +✅ Statistical analysis applied (2σ threshold) +✅ Alerts classified by severity (NOTICE/WARNING/CRITICAL) +✅ Root cause identified (endpoint/model/org/process) +✅ Results presented with cost breakdown and z-scores From 2002c8510a576bf9f909f09e27d809d9a16afdff Mon Sep 17 00:00:00 2001 From: Daniel Beer Date: Thu, 19 Mar 2026 15:03:35 +0200 Subject: [PATCH 20/20] Add usage monitor README documentation --- agents/monitoring/usage/README.md | 184 ++++++++++++++++++++++++++++++ 1 file changed, 184 insertions(+) create mode 100644 agents/monitoring/usage/README.md diff --git a/agents/monitoring/usage/README.md b/agents/monitoring/usage/README.md new file mode 100644 index 0000000..2fff6c2 --- /dev/null +++ b/agents/monitoring/usage/README.md @@ -0,0 +1,184 @@ +# Usage Monitor + +Autonomous usage monitoring for LTX Studio using statistical anomaly detection. + +## Overview + +Detects **data spikes** (both increases and decreases) in user engagement metrics by comparing yesterday's values against the last 10 same-day-of-week data points using 2 standard deviations (2σ) threshold. + +**Problem solved:** Early detection of significant changes in user behavior, product adoption, feature launches, enterprise churn risk, or engagement shifts. + +## Quick Start + +```bash +# Install dependency (one-time) +pip3 install google-cloud-bigquery + +# Monitor yesterday (default) +python3 usage_monitor.py + +# Monitor specific date +python3 usage_monitor.py --date 2026-03-09 + +# Show help +python3 usage_monitor.py --help +``` + +## What It Monitors + +### Segments (prioritized) +1. **Enterprise Contract** - Contracted enterprise accounts (weekdays only) +2. **Enterprise Pilot** - Pilot enterprise accounts (weekdays only) +3. **Heavy Users** - Active 4+ weeks, paying, consuming tokens +4. **Paying non-Enterprise** - Paid tiers excluding enterprise +5. **Free** - Free tier users + +### Metrics per Segment +- **DAU** (Daily Active Users) +- **Token consumption** +- **Image generations** +- **Video generations** (skipped for Enterprise - too volatile) + +## Statistical Method + +**Alert Logic:** `|yesterday_value - μ| > 2σ` + +Where: +- **μ (mean)** = average of last 10 same-day-of-week values +- **σ (stddev)** = standard deviation of last 10 same-day-of-week values +- **z-score** = (yesterday - μ) / σ + +**Why same-day-of-week?** +- Monday usage differs from Friday usage +- Comparing Monday to last 10 Mondays (not all days) +- Accounts for weekly seasonality + +**Why 2 standard deviations?** +- Captures 95.4% of normal variance +- Auto-adapts to each segment's natural patterns +- Balances early detection with false positive reduction + +**Severity Levels:** +- **NOTICE**: `2 < |z| ≤ 3` - Minor deviation, monitor +- **WARNING**: `3 < |z| ≤ 4.5` - Significant anomaly, investigate +- **CRITICAL**: `|z| > 4.5` - Extreme anomaly, immediate action + +## Example Output + +``` +================================================================================ + LTX STUDIO USAGE MONITORING - 2026-03-09 + Day: WEEKDAY + Method: 2 Standard Deviations (2σ) from last 10 same-day-of-week +================================================================================ + +⏳ Running BigQuery query... +✅ Query complete (12,345,678 bytes processed) + +🔴 3 ALERTS DETECTED + +================================================================================ + +⚠️ WARNING ALERTS (2): + + • Free - Tokens + Current: 4,497,947 | Mean (μ): 3,068,455 | Std Dev (σ): 426,074 + Z-score: 3.36 (3σ < |z| ≤ 4.5σ) + Change: +46.6% from mean + + • Heavy Users - Tokens + Current: 2,157,891 | Mean (μ): 1,246,532 | Std Dev (σ): 263,847 + Z-score: 3.45 (3σ < |z| ≤ 4.5σ) + Change: +73.0% from mean + +ℹ️ NOTICE ALERTS (1): + + • Paying non-Enterprise - Dau + Current: 1,234 | Mean (μ): 1,450 | Std Dev (σ): 89 + Z-score: -2.43 (2σ < |z| ≤ 3σ) + Change: -14.9% from mean + +================================================================================ +Total: 0 CRITICAL, 2 WARNING, 1 NOTICE +================================================================================ + +Note: 2σ threshold captures 95.4% of normal variance + CRITICAL: |z| > 4.5, WARNING: 3 < |z| ≤ 4.5, NOTICE: 2 < |z| ≤ 3 +``` + +## Root Cause Investigation + +When alerts fire for **Enterprise** segments, use `investigate_root_cause.sql` to drill down to organization level: + +```bash +# Copy SQL to BigQuery console or bq CLI +# Shows week-over-week change by organization +``` + +For other segments (Heavy, Paying, Free), investigate: +- Tier distribution (Standard vs Pro vs Lite) +- Feature launches or product changes +- Marketing campaigns or promotional activity + +## Exception Handling + +**Enterprise weekends are suppressed:** +- Enterprise Contract and Pilot alerts only fire on weekdays (Mon-Fri) +- Weekend usage is too sparse and volatile for statistical reliability +- Prevents false positives from low weekend activity + +## Files + +- **`SKILL.md`** - Agent instructions and workflow documentation +- **`usage_monitor.py`** - Combined SQL + alerting logic (303 lines) +- **`investigate_root_cause.sql`** - Org-level drill-down for Enterprise alerts +- **`README.md`** - This file + +## Interpreting Results + +**Positive spikes (increase):** +- Feature launches driving adoption +- Marketing campaigns succeeding +- Product improvements resonating +- Enterprise expansion + +**Negative spikes (decrease):** +- Churn events +- Product issues or bugs +- Competitive threats +- Enterprise account going dormant + +**For CRITICAL alerts:** +- Immediate investigation required +- Contact account managers (if Enterprise) +- Check for system outages or data pipeline issues + +**For WARNING alerts:** +- Monitor for persistence (repeats next day?) +- Investigate root cause during business hours + +**For NOTICE alerts:** +- Track trend over next few days +- Document for pattern analysis + +## Integration + +**Future enhancements:** +- Slack notifications for CRITICAL/WARNING alerts +- Linear issue creation for persistent alerts +- Historical alert log and trend analysis +- Automated root cause recommendations + +## Requirements + +- Python 3.7+ +- `google-cloud-bigquery` package +- BigQuery access to `ltx-dwh-prod-processed` dataset +- Execution project: `ltx-dwh-explore` + +## Data Source + +- **Table:** `ltx-dwh-prod-processed.web.ltxstudio_agg_user_date` +- **Partitioned by:** `dt` (DATE) +- **Lookback window:** 70 days (ensures 10+ same-DOW data points) +- **LT team excluded:** Already filtered at table level