Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
9dab446
feat: implement production usage monitoring with data-driven thresholds
DanielB945 Mar 8, 2026
732ec76
chore: remove experimental composite anomaly scoring
DanielB945 Mar 8, 2026
dda1743
feat: add production GPU cost thresholds to BE cost monitoring
DanielB945 Mar 8, 2026
1467b24
fix: GPU cost monitoring analyzes 3 days ago (not yesterday)
DanielB945 Mar 8, 2026
f1525ae
docs: add all shared files to revenue monitoring data sources
DanielB945 Mar 8, 2026
e5f64d2
docs: add all shared files to API runtime monitoring
DanielB945 Mar 8, 2026
9e94a60
refactor: restructure usage monitor to 6-part Agent Skills format
DanielB945 Mar 8, 2026
3bf1007
refactor: restructure all monitoring skills to 6-part Agent Skills fo…
DanielB945 Mar 8, 2026
ab1f8c2
refactor: combine SQL and Python into single usage monitor script
DanielB945 Mar 10, 2026
e031e24
Update usage monitor to detect spikes in both directions
DanielB945 Mar 10, 2026
79e2922
Remove detailed usage patterns from Overview
DanielB945 Mar 10, 2026
f771695
Refactor usage monitor skill: remove duplication, apply progressive d…
DanielB945 Mar 10, 2026
88257fa
Address PR review comments
DanielB945 Mar 10, 2026
27dfadb
Update usage monitor: 2σ threshold, NOTICE severity, remove video gen…
AssafHayEden Mar 10, 2026
db9816d
feat: add production GPU cost thresholds and 3-day lookback
DanielB945 Mar 10, 2026
dad369b
feat: restructure revenue monitor to 6-part Agent Skills format
DanielB945 Mar 10, 2026
ecb38f9
feat: restructure enterprise monitor to 6-part Agent Skills format
DanielB945 Mar 10, 2026
618140c
feat: restructure API runtime monitor to 6-part Agent Skills format
DanielB945 Mar 10, 2026
a77114e
refactor: streamline BE cost monitoring skill
DanielB945 Mar 11, 2026
4f9ee74
Merge BE cost monitoring agent (PR #43)
DanielB945 Mar 19, 2026
d994a36
Merge API runtime monitoring agent (PR #44)
DanielB945 Mar 19, 2026
ee9a25f
Merge enterprise account monitoring agent (PR #45)
DanielB945 Mar 19, 2026
39787ad
Merge revenue monitoring agent (PR #46)
DanielB945 Mar 19, 2026
e869600
Merge production usage monitoring (PR #38) - resolved conflicts
DanielB945 Mar 19, 2026
2002c85
Add usage monitor README documentation
DanielB945 Mar 19, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
335 changes: 257 additions & 78 deletions agents/monitoring/api-runtime/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,84 +1,263 @@
---
name: api-runtime-monitor
description: Monitors LTX API runtime performance, latency, error rates, and throughput. Alerts on performance degradation or errors.
tags: [monitoring, api, performance, latency, errors]
description: "Monitor LTX API runtime performance, latency, error rates, and throughput. Detects performance degradation and errors. Use when: (1) detecting API latency issues, (2) alerting on error rate spikes, (3) investigating throughput drops by endpoint/model/org."
tags: [monitoring, api, performance, latency, errors, throughput]
---

# API Runtime Monitor

## When to use

- "Monitor API latency"
- "Alert on API errors"
- "Track API throughput"
- "Monitor inference time"
- "Alert on API performance degradation"

## What it monitors

- **Latency**: Request processing time, inference time, queue time
- **Error rates**: % of failed requests, error types, error sources
- **Throughput**: Requests per hour/day, by endpoint/model
- **Performance**: P50/P95/P99 latency, success rate
- **Utilization**: API usage by org, model, resolution

## Steps

1. **Gather requirements from user:**
- Which performance metric to monitor (latency, errors, throughput)
- Alert threshold (e.g., "P95 latency > 30s", "error rate > 5%", "throughput drops > 20%")
- Time window (hourly, daily)
- Scope (all requests, specific endpoint, specific org)
- Notification channel

2. **Read shared files:**
- `shared/bq-schema.md` — GPU cost table (has API runtime data) and ltxvapi tables
- `shared/metric-standards.md` — Performance metric patterns

3. **Identify data source:**
- For LTX API: Use `ltxvapi_api_requests_with_be_costs` or `gpu_request_attribution_and_cost`
- **Key columns explained:**
- `request_processing_time_ms`: Total time from request submission to completion
- `request_inference_time_ms`: GPU processing time (actual model inference)
- `request_queue_time_ms`: Time waiting in queue before processing starts
- `result`: Request outcome (success, failed, timeout, etc.)
- `error_type`: Classification of errors (infrastructure vs applicative)
- `endpoint`: API endpoint called (e.g., /generate, /upscale)
- `model_type`: Model used (ltxv2, retake, etc.)
- `org_name`: Customer organization making the request

4. **Write monitoring SQL:**
- Query relevant performance metric
- Calculate percentiles (P50, P95, P99) for latency
- Calculate error rate (failed / total requests)
- Compare against baseline

5. **Present to user:**
- Show SQL query
- Show example alert format with performance breakdown
- Confirm threshold values

6. **Set up alert** (manual for now):
- Document SQL
- Configure notification to engineering team

## Reference files

| File | Read when |
|------|-----------|
| `shared/product-context.md` | LTX products and business context |
| `shared/bq-schema.md` | API tables and GPU cost table schema |
| `shared/metric-standards.md` | Performance metric patterns |
| `shared/event-registry.yaml` | Feature events (if analyzing event-driven metrics) |
| `shared/gpu-cost-query-templates.md` | GPU cost queries (if analyzing cost-related performance) |
| `shared/gpu-cost-analysis-patterns.md` | Cost analysis patterns (if analyzing cost-related performance) |

## Rules

- DO use APPROX_QUANTILES for percentile calculations (P50, P95, P99)
- DO separate errors by error_source (infrastructure vs applicative)
- DO filter by result = 'success' for success rate calculations
- DO break down by endpoint, model, and resolution for detailed analysis
- DO compare current performance against historical baseline
- DO alert engineering team for infrastructure errors, product team for applicative errors
- DO partition by dt for performance
## 1. Overview (Why?)

LTX API performance varies by endpoint, model, and customer organization. Latency issues, error rate spikes, and throughput drops can indicate infrastructure problems, model regressions, or customer-specific issues that require engineering intervention.

This skill provides **autonomous API runtime monitoring** that detects performance degradation (P95 latency spikes), error rate increases, throughput drops, and queue time issues — with breakdown by endpoint, model, and organization for root cause analysis.

**Problem solved**: Detect API performance problems and errors before they impact customer experience — with segment-level (endpoint/model/org) root cause identification.

## 2. Requirements (What?)

Monitor these outcomes autonomously:

- [ ] P95 latency spikes (> 2x baseline or > 60s)
- [ ] Error rate increases (> 5% or DoD increase > 50%)
- [ ] Throughput drops (> 30% DoD/WoW)
- [ ] Queue time excessive (> 50% of processing time)
- [ ] Infrastructure errors (> 10 requests/hour)
- [ ] Alerts include breakdown by endpoint, model, organization
- [ ] Results formatted by priority (infrastructure vs applicative errors)
- [ ] Findings routed to appropriate team (API team or Engineering)

## 3. Progress Tracker

* [ ] Read shared knowledge (schema, metrics, performance patterns)
* [ ] Identify data source (ltxvapi tables or GPU cost table)
* [ ] Write monitoring SQL with percentile calculations
* [ ] Execute query for target date range
* [ ] Analyze results by endpoint, model, organization
* [ ] Separate infrastructure vs applicative errors
* [ ] Present findings with performance breakdown
* [ ] Route alerts to appropriate team

## 4. Implementation Plan

### Phase 1: Read Alert Thresholds

**Generic thresholds** (data-driven analysis pending):
- P95 latency > 2x baseline or > 60s
- Error rate > 5% or DoD increase > 50%
- Throughput drops > 30% DoD/WoW
- Queue time > 50% of processing time
- Infrastructure errors > 10 requests/hour

[!IMPORTANT] These are generic thresholds. Consider creating production thresholds based on endpoint/model-specific analysis (similar to usage/GPU cost monitoring).

### Phase 2: Read Shared Knowledge

Before writing SQL, read:
- **`shared/product-context.md`** — LTX products, user types, business model, API context
- **`shared/bq-schema.md`** — GPU cost table (has API runtime data), ltxvapi tables, API request schema
- **`shared/metric-standards.md`** — Performance metric patterns (latency, error rates, throughput)
- **`shared/event-registry.yaml`** — Feature events (if analyzing event-driven metrics)
- **`shared/gpu-cost-query-templates.md`** — GPU cost queries (if analyzing cost-related performance)
- **`shared/gpu-cost-analysis-patterns.md`** — Cost analysis patterns (if analyzing cost-related performance)

**Data nuances**:
- Primary table: `ltx-dwh-prod-processed.web.ltxvapi_api_requests_with_be_costs`
- Alternative: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost`
- Partitioned by `action_ts` (TIMESTAMP) or `dt` (DATE) — filter for performance
- Key columns: `request_processing_time_ms`, `request_inference_time_ms`, `request_queue_time_ms`, `result`, `endpoint`, `model_type`, `org_name`

### Phase 3: Identify Data Source

✅ **PREFERRED: Use ltxvapi_api_requests_with_be_costs for API runtime metrics**

**Key columns**:
- `request_processing_time_ms`: Total time from request submission to completion
- `request_inference_time_ms`: GPU processing time (actual model inference)
- `request_queue_time_ms`: Time waiting in queue before processing starts
- `result`: Request outcome (success, failed, timeout, etc.)
- `error_type` or `error_source`: Classification of errors (infrastructure vs applicative)
- `endpoint`: API endpoint called (e.g., /generate, /upscale)
- `model_type`: Model used (ltxv2, retake, etc.)
- `org_name`: Customer organization making the request

[!IMPORTANT] Verify column name: `error_type` vs `error_source` in actual schema

### Phase 4: Write Monitoring SQL

✅ **PREFERRED: Calculate percentiles and error rates with baseline comparisons**

```sql
WITH api_metrics AS (
SELECT
DATE(action_ts) AS dt,
endpoint,
model_type,
org_name,
COUNT(*) AS total_requests,
COUNTIF(result = 'success') AS successful_requests,
COUNTIF(result != 'success') AS failed_requests,
SAFE_DIVIDE(COUNTIF(result != 'success'), COUNT(*)) * 100 AS error_rate_pct,
APPROX_QUANTILES(request_processing_time_ms, 100)[OFFSET(50)] AS p50_latency_ms,
APPROX_QUANTILES(request_processing_time_ms, 100)[OFFSET(95)] AS p95_latency_ms,
APPROX_QUANTILES(request_processing_time_ms, 100)[OFFSET(99)] AS p99_latency_ms,
AVG(request_queue_time_ms) AS avg_queue_time_ms,
AVG(request_inference_time_ms) AS avg_inference_time_ms
FROM `ltx-dwh-prod-processed.web.ltxvapi_api_requests_with_be_costs`
WHERE action_ts >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
AND action_ts < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
GROUP BY dt, endpoint, model_type, org_name
),
metrics_with_baseline AS (
SELECT
*,
AVG(p95_latency_ms) OVER (
PARTITION BY endpoint, model_type
ORDER BY dt ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING
) AS p95_latency_baseline_7d,
AVG(error_rate_pct) OVER (
PARTITION BY endpoint, model_type
ORDER BY dt ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING
) AS error_rate_baseline_7d
FROM api_metrics
)
SELECT * FROM metrics_with_baseline
WHERE dt = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY);
```

**Key patterns**:
- **Percentiles**: Use `APPROX_QUANTILES` for P50/P95/P99
- **Error rate**: `SAFE_DIVIDE(failed, total) * 100`
- **Baseline**: 7-day rolling average by endpoint and model
- **Time window**: Last 7 days (shorter than usage monitoring due to higher frequency data)

### Phase 5: Execute Query

Run query using:
```bash
bq --project_id=ltx-dwh-explore query --use_legacy_sql=false --format=pretty "
<query>
"
```

### Phase 6: Analyze Results

**For latency trends**:
- Compare P95 latency vs baseline (7-day avg)
- Flag if P95 > 2x baseline or > 60s absolute
- Identify which endpoint/model/org drove spikes

**For error rate analysis**:
- Compare error rate vs baseline
- Separate errors by `error_type`/`error_source` (infrastructure vs applicative)
- Flag if error rate > 5% or DoD increase > 50%

**For throughput**:
- Track requests per hour/day by endpoint
- Flag throughput drops > 30% DoD/WoW
- Identify which endpoints lost traffic

**For queue analysis**:
- Calculate queue time as % of total processing time
- Flag if queue time > 50% of processing time
- Indicates capacity/scaling issues

### Phase 7: Present Findings

Format results with:
- **Summary**: Key finding (e.g., "P95 latency spiked to 85s for /v1/text-to-video")
- **Root cause**: Which endpoint/model/org drove the issue
- **Breakdown**: Performance metrics by dimension
- **Error classification**: Infrastructure vs applicative errors
- **Recommendation**: Route to API team (applicative) or Engineering team (infrastructure)

**Alert format**:
```
⚠️ API PERFORMANCE ALERT:
• Endpoint: /v1/text-to-video
Model: ltxv2
Metric: P95 Latency
Current: 85s | Baseline: 30s
Change: +183%

Error rate: 8.2% (baseline: 2.1%)
Error type: Infrastructure

Recommendation: Alert Engineering team for infrastructure issue
```

### Phase 8: Route Alert

For ongoing monitoring:
1. Save SQL query
2. Set up in BigQuery scheduled query or Hex Thread
3. Configure notification by error type:
- Infrastructure errors → Engineering team
- Applicative errors → API/Product team
4. Include endpoint, model, and org details in alert

## 5. Context & References

### Shared Knowledge
- **`shared/product-context.md`** — LTX products and API context
- **`shared/bq-schema.md`** — API tables and GPU cost table schema
- **`shared/metric-standards.md`** — Performance metric patterns
- **`shared/event-registry.yaml`** — Feature events (if analyzing event-driven metrics)
- **`shared/gpu-cost-query-templates.md`** — GPU cost queries (if analyzing cost-related performance)
- **`shared/gpu-cost-analysis-patterns.md`** — Cost analysis patterns

### Data Sources

**Primary table**: `ltx-dwh-prod-processed.web.ltxvapi_api_requests_with_be_costs`
- Partitioned by `action_ts` (TIMESTAMP)
- Key columns: `request_processing_time_ms`, `request_inference_time_ms`, `request_queue_time_ms`, `result`, `endpoint`, `model_type`, `org_name`

**Alternative**: `ltx-dwh-prod-processed.gpu_costs.gpu_request_attribution_and_cost`
- Contains API runtime data but not primary source for performance metrics

### Endpoints
Common endpoints: `/v1/text-to-video`, `/v1/image-to-video`, `/v1/upscale`, `/generate`

### Models
Common models: `ltxv2`, `retake`, etc.

## 6. Constraints & Done

### DO NOT

- **DO NOT** use absolute thresholds without baseline comparison
- **DO NOT** mix infrastructure and applicative errors in same alert
- **DO NOT** skip partition filtering — always filter on `action_ts` or `dt` for performance
- **DO NOT** forget to separate errors by error type/source

[!IMPORTANT] Verify column name in schema: `error_type` vs `error_source`

### DO

- **DO** use `APPROX_QUANTILES` for percentile calculations (P50, P95, P99)
- **DO** separate errors by error_source (infrastructure vs applicative)
- **DO** filter by `result = 'success'` for success rate calculations
- **DO** break down by endpoint, model, and organization for detailed analysis
- **DO** compare current performance against historical baseline (7-day rolling avg)
- **DO** alert engineering team for infrastructure errors
- **DO** alert product/API team for applicative errors
- **DO** partition on `action_ts` or `dt` for performance
- **DO** use `ltx-dwh-explore` as execution project
- **DO** calculate error rate with `SAFE_DIVIDE(failed, total) * 100`
- **DO** flag P95 latency > 2x baseline or > 60s
- **DO** flag error rate > 5% or DoD increase > 50%
- **DO** flag throughput drops > 30% DoD/WoW
- **DO** flag queue time > 50% of processing time
- **DO** flag infrastructure errors > 10 requests/hour
- **DO** include endpoint, model, org details in all alerts
- **DO** validate unusual patterns with API/Engineering team before alerting

### Completion Criteria

✅ All performance metrics monitored (latency, errors, throughput, queue time)
✅ Alerts fire with thresholds (generic pending production analysis)
✅ Endpoint/model/org breakdown provided
✅ Errors separated by type (infrastructure vs applicative)
✅ Findings routed to appropriate team
✅ Partition filtering applied for performance
✅ Column name verified (error_type vs error_source)
Loading