Production AI pipeline monitoring β root cause detection, anomaly alerts, regression guard, and Gemini-powered recommendations.
pip install failure-forensicsfrom failure_forensics import trace
@trace(step="retrieval", version="v1")
def my_retrieval_function(query):
# your code here
passA self-hosted, zero-cost LLM pipeline observability tool that gives you root cause detection, anomaly alerts, A/B reporting, and a live terminal dashboard β without sending your data to any third-party service.
| Failure Forensics | LangSmith | Braintrust | |
|---|---|---|---|
| Cost | Free | Paid tiers | Paid tiers |
| Data privacy | Stays on your machine | Sent to cloud | Sent to cloud |
| Customization | Full control | Limited | Limited |
| Slack alerts | Built-in | Premium only | Premium only |
| A/B reporting | Built-in | Basic | Basic |
| Circuit breaker / trend | Built-in | β | β |
Failure Forensics is designed for teams who need production-grade observability without vendor lock-in.
Every pipeline run passes through a structured logging and analysis layer:
Pipeline Step β logger.py β requests.jsonl
β
ββββββββββββββββββ΄βββββββββββββββββ
β β
forensics.py pattern.py
(root cause detection) (time series + anomaly)
β β
versioning.py baseline.py
(v1 vs v2 comparison) (7-day moving average)
β β
ab_report.py alerts.py
(A/B comparison table) (Slack / console alert)
ββββββββββββββββββ¬βββββββββββββββββ
β
dashboard.py
(ASCII terminal dashboard)
failure-forensics/
βββ src/
β βββ logger.py # Logs every pipeline step to JSONL
β βββ forensics.py # Root cause detection (5 categories)
β βββ pattern.py # Time-series failure rate + anomaly detection
β βββ baseline.py # 7-day moving average + trend (IMPROVING/STABLE/DEGRADING)
β βββ alerts.py # Slack webhook + console alerts
β βββ versioning.py # Per-version failure rate stats
β βββ ab_report.py # A/B comparison report (table + JSON)
β βββ dashboard.py # ASCII bar chart terminal dashboard
βββ data/
β βββ logs/
β βββ requests.jsonl # All pipeline logs (gitignored)
βββ tests/
β βββ test_forensics.py # 8 unit tests
βββ config.py # Thresholds, Slack URL, step limits
βββ main.py # 5-scenario demo runner
βββ simulate.py # Realistic test data generator (100 runs, anomaly day)
βββ requirements.txt
git clone https://github.com/jasstt/failure-forensics.git
cd failure-forensics
pip install -r requirements.txtEdit config.py:
SLACK_WEBHOOK_URL = "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"If left empty, all alerts print to the console.
python main.pyThis runs 5 scenarios:
- Simulation β generates 100 realistic pipeline runs (2 prompt versions, anomaly day)
- Root cause analysis β detects the failing step and assigns a category
- 7-day pattern report β failure rate per day + step breakdown + anomaly check
- A/B report β
prompt_v1vsprompt_v2with per-step improvement table - Terminal dashboard β live ASCII bar charts, trend, top 5 failed runs
python tests/test_forensics.py
python tests/test_advanced.py| Katman | Γzellik | Teknoloji |
|---|---|---|
| 1 | Otomatik ΓΆneri motoru | Kural tabanlΔ± |
| 2 | AI destekli hata analizi | Gemini 2.5 Pro |
| 3 | Eval seti otomatik bΓΌyΓΌtme | Frequency analysis |
| 4 | Prompt optimizasyon aΓ§Δ±klamasΔ± | Gemini 2.5 Pro |
| 5 | Regression guard | Baseline comparison |
Senaryo 6: Regression Guard
Yeni bir prompt (v3) deploy edilmeden ΓΆnce otomatik regresyon kontrolΓΌ yapar:
REGRESSION CHECK β v3
Baseline (v2): 11.0% failure rate
Yeni (v3): 24.5% failure rate
Delta: +13.5pp β REGRESSION_DETECTED β
| Katman | Test | SonuΓ§ |
|---|---|---|
| 1 β Recommender | Kategori β ΓΆneri mapping | β PASS |
| 2 β LLM Analyzer | Gemini fallback | β PASS |
| 3 β Eval Collector | Duplicate prevention | β PASS |
| 4 β Prompt Optimizer | A/B aΓ§Δ±klama (v2: +10pp) | β PASS |
| 5 β Regression Guard | DETECTED + PASS senaryolarΔ± | β PASS |
- A/B: prompt_v2, v1'e gΓΆre 10pp iyileΕme
- Regression Guard: v3 deploy'u +6pp delta ile WARNING olarak engelledi
- Eval Collector: 5 yeni eval adayΔ± otomatik toplandΔ±
- LLM Analyzer: Gemini kapalΔ±yken kural tabanlΔ±na sorunsuz fallback
| Feature | Result |
|---|---|
| Unit Tests | 8/8 PASS β |
| Root cause categories | 5 types (RETRIEVAL_QUALITY, RERANKER_FAILURE, LLM_HALLUCINATION, CITATION_MISS, API_ERROR) |
| Anomaly detection | 20% delta threshold β flags when today's rate exceeds 7-day average by >20pp |
| A/B comparison | v2: 11.5pp improvement over v1 (22.5% β 11.0% failure rate) |
| Trend analysis | IMPROVING / STABLE / DEGRADING based on 7-day moving average |
| Slack integration | Webhook ready β fires on rate threshold, anomaly, or 3 consecutive failures |
| Parameter | Default | Description |
|---|---|---|
FAILURE_RATE_THRESHOLD |
0.25 |
Alert fires above this failure rate |
ANOMALY_THRESHOLD |
0.20 |
Flag if today exceeds 7-day avg by this delta |
SLACK_WEBHOOK_URL |
"" |
Empty = console output |
CONSECUTIVE_FAILURE_THRESHOLD |
3 |
Alert after N consecutive step failures |
STEP_THRESHOLDS |
see config | Per-step max acceptable failure rate |
| Category | Trigger |
|---|---|
RETRIEVAL_QUALITY |
Retrieval step fails β no results, low score |
RERANKER_FAILURE |
Reranker can't parse LLM response or times out |
LLM_HALLUCINATION |
Generation returns empty or uncited response |
CITATION_MISS |
Answer produced but no source citations found |
API_ERROR |
Timeout, 429 rate limit, 503 service unavailable |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π¬ FAILURE FORENSICS β Terminal Dashboard
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π
SON 7 GΓNΓN FAILURE RATE GRAFΔ°ΔΔ°
2026-06-03 [ββββββββββββββββββββββββββββββ] 13.0%
2026-06-07 [ββββββββββββββββββββββββββββββ] 27.3% β οΈ
2026-06-10 [ββββββββββββββββββββββββββββββ] 12.0%
π ADIM BAZINDA HATA DAΔILIMI
retrieval [ββββββββββββββββββββ] 38.0% (38/100 hatalΔ±)
reranking [ββββββββββββββββββββ] 13.0% (13/100 hatalΔ±)
generation [ββββββββββββββββββββ] 10.0% (10/100 hatalΔ±)
citation [ββββββββββββββββββββ] 6.0% (6/100 hatalΔ±)
β‘ ANOMALΔ°: β
Normal: BugΓΌn (12.0%) β 7g ort. (16.2%)
π TREND: β‘οΈ STABLE β Hareketli Ort: 16.0%
- Python standard library β
json,collections,datetime,threading - requests β Slack webhook HTTP calls
- python-dotenv β Environment variable management
No heavy dependencies. No cloud. No API keys required.
- FastAPI REST endpoint for remote log ingestion
- HTML report export
- PostgreSQL backend for large-scale log storage
- Multi-pipeline support (compare RAG vs fine-tuned model)
- Email alerts as alternative to Slack