diff --git a/docs/etl/ETL_AS_IS_ANALYSIS.md b/docs/etl/ETL_AS_IS_ANALYSIS.md new file mode 100644 index 0000000..75e7715 --- /dev/null +++ b/docs/etl/ETL_AS_IS_ANALYSIS.md @@ -0,0 +1,392 @@ +# ETL As-Is Analysis (Code as Source of Truth) +Generated from repository code inspection of /etl (recursive). + +## 0. Inventory +| File | Type | Primary responsibility | Entrypoint | Key dependencies | +| --- | --- | --- | --- | --- | +| etl/1_download_jira_tickets.py | script | Download and flatten Apache BookKeeper Jira issues into a raw CSV for downstream ETL. | CLI via `python etl/1_download_jira_tickets.py` (main guard). | Internal: `path_config.PROJECT_ROOT`; External: `requests`, `pandas`, `logging`. | +| etl/2_download_github_prs.py | script | Download GitHub PR list + per-PR details (reviews/checks/statuses), with caching and incremental mode, into a raw CSV. | CLI via `python etl/2_download_github_prs.py` (main guard). | Internal: `path_config.PROJECT_ROOT`; External: `requests`, `pandas`, `concurrent.futures`, `logging`. | +| etl/3_clean_and_merge.py | script | Clean Jira and PR raw CSVs, derive phase timestamps/durations, and merge into a single dataset. | CLI via `python etl/3_clean_and_merge.py` (main guard). | Internal: `path_config.PROJECT_ROOT`; External: `pandas`, `logging`, `re`. | +| etl/4_summarize_and_plot.py | script | Produce summary counts and a ticket-type pie chart from merged ticket/PR data. | CLI via `python etl/4_summarize_and_plot.py` (main guard). | Internal: `path_config.PROJECT_ROOT`; External: `pandas`, `matplotlib`, `logging`. | +| etl/5_estimate_parameters.py | script | Compute arrival/throughput stats and phase duration summaries; export parameter CSVs and backlog plot. | CLI via `python etl/5_estimate_parameters.py` (main guard). | Internal: `path_config.PROJECT_ROOT`; External: `pandas`, `numpy`, `matplotlib`, `logging`. | +| etl/6_diagnose_and_plot_tickets.py | script | Print per-ticket diagnostics and plot resolution time distribution. | CLI via `python etl/6_diagnose_and_plot_tickets.py` (main guard). | Internal: `path_config.PROJECT_ROOT`; External: `pandas`, `matplotlib`, `logging`. | +| etl/7_fit_distributions.py | script | Fit candidate distributions to phase durations and export fit statistics/plots. | CLI via `python etl/7_fit_distributions.py` (main guard). | Internal: `path_config.PROJECT_ROOT`; External: `pandas`, `numpy`, `scipy`, `matplotlib`, `logging`. | +| etl/8_export_fit_summary.py | script/cli | Convert per-stage fit statistics into a compact `fit_summary.csv` for simulation use. | CLI via `python etl/8_export_fit_summary.py` (argparse). | Internal: `path_config.PROJECT_ROOT`; External: `pandas`, `numpy`, `argparse`, `logging`. | +| etl/9_enrich_feedback_cols.py | script/cli | Enrich merged dataset with review/CI feedback signals and inferred developer/tester fields. | CLI via `python etl/9_enrich_feedback_cols.py` (argparse). | Internal: `path_config.PROJECT_ROOT`; External: `pandas`, `numpy`, `logging`. | +| etl/X_exponentiality_diagnostics.py | script | Run exploratory distribution diagnostics (KDE, fits, QQ) for resolution time hours. | Executed on import or `python etl/X_exponentiality_diagnostics.py` (no main guard). | External: `pandas`, `numpy`, `scipy`, `matplotlib`, `logging`. | +| etl/assignee_date.py | script | Parse Jira issue changelog JSON to extract assignment date and close date to CSV. | CLI via `python etl/assignee_date.py` (main guard). | External: `json`, `csv`, `datetime`, `pathlib`. | + +## 1. Big Picture +The ETL in `/etl` is a pipeline of stand-alone scripts that fetch raw Jira and GitHub data, clean/merge it into a ticket-centric dataset, derive phase durations, then compute descriptive statistics and distribution fits for downstream simulation/analysis. The core data flow is: Jira issues and GitHub PR details are downloaded into raw CSVs, then cleaned and merged on Jira key, then augmented/diagnosed/fit in subsequent scripts. Configuration is mostly file-path constants rooted at `path_config.PROJECT_ROOT`, plus environment variables for GitHub token/QPS behavior. Execution is not a single orchestrated runner; it is a sequence of independent scripts invoked manually in order (download → clean/merge → summarize/estimate/fit/enrich as needed). The pipeline makes extensive use of filesystem outputs under `etl/output/` (CSV, PNG, logs), so re-runs depend on the presence and freshness of those artifacts. + +## 2. Per-File Deep Dive (high-level first) + +### etl/1_download_jira_tickets.py +**Role in ETL:** Downloads Jira issues for the BOOKKEEPER project from Apache’s Jira API, flattens nested fields, and writes a raw CSV used by downstream cleaning and merges. It is the canonical Jira data ingest step for the ETL. It also logs counts and basic assignee coverage for debugging. +**Inputs:** Jira REST API `/rest/api/2/search` (network), JQL query for project BOOKKEEPER, pagination parameters. No local input files. +**Outputs:** `etl/output/csv/jira_issues_raw.csv` (flattened CSV), `output/logs/download_jira_tickets.log`. +**Main steps:** +1. Configure logging and output paths rooted at `PROJECT_ROOT`. +2. Paginate Jira issues via `startAt`/`maxResults` and a fixed JQL order. +3. Retry failed requests with backoff for transient errors. +4. Flatten nested JSON with `pandas.json_normalize` using dot-separated column names. +5. Select preferred columns (key, fields.*), falling back to all flattened columns if missing. +6. Write CSV output and log assignee coverage statistics. +**Key functions/classes:** +- `_jira_get`: GET with retries and backoff for Jira API calls. +- `download_all_issues`: paginated fetch loop over Jira issues. +- `main`: orchestrates download, flatten, select columns, and write CSV. +**Important logic & edge cases:** +- If no issues are downloaded, it still writes an empty CSV with headers so downstream steps can run. +- Only a subset of fields are requested (fields list), so other Jira data is not captured. +**Error handling & fallbacks:** +- Retries for 429/5xx with backoff; non-retryable errors raise `RuntimeError`. +- If preferred columns are missing, it writes all flattened columns. +**Side effects:** Writes CSV/logs; makes HTTP requests to Jira. +**Performance notes:** Pagination defaults to 1000 issues per call with a hard cap of 200 batches. +**How to run:** `python etl/1_download_jira_tickets.py`. + +### etl/2_download_github_prs.py +**Role in ETL:** Downloads GitHub pull requests for `apache/bookkeeper`, along with per-PR details (reviews, check-runs, combined statuses), and exports a raw CSV for merging. It implements multi-token rotation, caching with ETags, concurrency, and incremental updates to reduce API usage. +**Inputs:** GitHub REST API endpoints for PR list, PR reviews, check-runs, and commit statuses. Tokens from `etl/env/github_tokens.env` and/or `GITHUB_TOKENS` env var. Optionally a prior `github_prs_raw.csv` for incremental reuse. +**Outputs:** `etl/output/csv/github_prs_raw.csv` (raw PRs + derived review/CI fields), `output/logs/download_github_prs.log`, cache file `etl/cache/github_http_cache.json`. +**Main steps:** +1. Initialize logging, token pool, HTTP session, and cache. +2. Discover last PR page (via Link header) and fetch PR list pages concurrently. +3. Optionally reuse unchanged PR rows from prior CSV (incremental mode). +4. For remaining PRs, fetch reviews, check-runs, and combined status in parallel. +5. Derive review/CI summary signals and assemble row fields. +6. Write combined CSV and log quick stats. +**Key functions/classes:** +- `TokenPool`: rotates tokens on rate limit, tracks remaining/reset. +- `_req_get`: GET with QPS throttling, ETag cache, retries, and token rotation. +- `_list_all_prs_concurrent`: concurrent pagination fetch. +- `_process_one_pr`: fetch per-PR details and assemble output row. +- `main`: orchestrates the entire download and write. +**Important logic & edge cases:** +- Incremental mode reuses previous rows if `updated_at` is unchanged, skipping detail calls. +- 304 responses reuse cached bodies; if cache missing, it retries without ETag. +- If no tokens are available, it proceeds unauthenticated (low rate limits). +**Error handling & fallbacks:** +- Transient errors (429/5xx) are retried with backoff or `Retry-After`. +- Rate-limit 403 triggers token rotation or sleep until reset. +- Errors in per-PR processing are logged and skipped; the run continues. +**Side effects:** Writes CSV/logs, persists cache, and makes many network calls. +**Performance notes:** Uses concurrency (two ThreadPoolExecutors) and pooled HTTP connections with QPS throttling. +**How to run:** `python etl/2_download_github_prs.py`. + +### etl/3_clean_and_merge.py +**Role in ETL:** Cleans raw Jira issues and GitHub PR datasets, merges them on Jira key, and derives phase timestamps/durations for development, review, and testing. This script produces the canonical merged dataset used by subsequent analysis steps. +**Inputs:** `etl/output/csv/jira_issues_raw.csv`, `etl/output/csv/github_prs_raw.csv`. +**Outputs:** `etl/output/csv/jira_issues_clean.csv`, `etl/output/csv/github_prs_clean.csv`, `etl/output/csv/tickets_prs_merged.csv`, `etl/output/logs/clean_and_merge.log`. +**Main steps:** +1. Load Jira and PR raw CSVs. +2. Clean Jira tickets: drop duplicate keys, filter out unwanted resolutions, normalize timestamps. +3. Clean PRs: extract Jira key from title/body, normalize timestamps, handle missing merged_at. +4. Merge tickets and PRs on Jira key (left join). +5. Derive phase timestamps (dev/review/test) using PR-created and PR-merged/closed dates. +6. Calculate phase durations and total resolution time in days. +7. Write cleaned and merged CSVs. +**Key functions/classes:** +- `extract_jira_key`: regex extraction of `BOOKKEEPER-`. +- `clean_tickets`: dedupe/filter/normalize Jira data. +- `clean_prs`: extract key + parse PR timestamps. +- `derive_phase_times`: aggregate PR timestamps and compute durations. +**Important logic & edge cases:** +- Review end uses `merged_at` if available, otherwise `closed_at` as proxy. +- Duration values are set to `NaN` for missing timestamps or negative deltas. +- If a column is missing, it logs a warning and continues (fail-soft). +**Error handling & fallbacks:** +- Soft failure: missing columns are logged and left as NaN. +- No explicit try/except around CSV reads; exceptions propagate. +**Side effects:** Writes multiple CSVs and logs. +**Performance notes:** GroupBy aggregation per ticket key; otherwise straightforward pandas operations. +**How to run:** `python etl/3_clean_and_merge.py`. + +### etl/4_summarize_and_plot.py +**Role in ETL:** Provides simple descriptive statistics and a ticket-type pie chart from the merged dataset. Primarily for reporting and high-level summaries. +**Inputs:** `etl/output/csv/tickets_prs_merged.csv`. +**Outputs:** `etl/output/png/distribuzione_ticket_tipo.png`, `etl/output/csv/statistiche_riassuntive.csv`, `etl/output/logs/summarize_and_plot.log`. +**Main steps:** +1. Load merged CSV. +2. Compute counts by issue type and print them. +3. Identify reopened, in-progress, and closed-without-PR tickets. +4. Build and save a pie chart with legend. +5. Export the summary table as CSV. +**Key functions/classes:** None (inline script). +**Important logic & edge cases:** +- The pie chart uses a legend instead of labels to avoid overlap. +- Assumes columns like `fields.issuetype.name` and `fields.status.name` exist. +**Error handling & fallbacks:** +- No explicit error handling; missing columns would raise at runtime. +**Side effects:** Writes PNG, CSV, and logs. +**Performance notes:** O(n) aggregations; no heavy computation. +**How to run:** `python etl/4_summarize_and_plot.py`. + +### etl/5_estimate_parameters.py +**Role in ETL:** Computes global arrival/throughput metrics and per-phase duration summaries, exporting parameters needed for modeling or simulation. Also generates a backlog-over-time plot from created/resolution dates. +**Inputs:** `etl/output/csv/tickets_prs_merged.csv`. +**Outputs:** `etl/output/csv/phase_durations_wide.csv`, `etl/output/csv/phase_summary_stats.csv`, `etl/output/csv/parameter_estimates.csv`, `etl/output/png/backlog_over_time.png`, `etl/output/logs/estimate_parameters.log`. +**Main steps:** +1. Load merged CSV and parse creation/resolution timestamps. +2. Estimate inter-arrival times and arrival rate (tickets/day). +3. Compute global resolution time statistics (mean/median days). +4. Compute backlog time series (open tickets per day) and plot. +5. Export per-phase duration columns (wide form) and summary stats. +6. Export overall parameter estimates for downstream use. +**Key functions/classes:** +- `summarize_phase`: returns count, NaN share, mean, median, std, quartiles, min/max for a phase column. +**Important logic & edge cases:** +- Arrival rate is computed as inverse of mean inter-arrival; if mean is zero/NaN, rate is NaN. +- Backlog computation iterates daily across the full timeline (potentially large). +- Warns if phase duration columns are missing (expects output from `3_clean_and_merge.py`). +**Error handling & fallbacks:** +- CSV load errors cause SystemExit. +- Missing columns log warnings and skip related computations. +**Side effects:** Writes multiple CSVs/PNG and logs. +**Performance notes:** Backlog computation loops over each day in range and scans all tickets, which can be O(n * days). +**How to run:** `python etl/5_estimate_parameters.py`. + +### etl/6_diagnose_and_plot_tickets.py +**Role in ETL:** Produces per-ticket diagnostic output and a histogram of resolution times to identify data quality issues. It is a verbose diagnostic utility rather than a core pipeline step. +**Inputs:** `etl/output/csv/tickets_prs_merged.csv`. +**Outputs:** `etl/output/png/distribuzione_resolution_times_0_10000.png`, `etl/output/logs/diagnose_tickets.log`, plus extensive console output. +**Main steps:** +1. Load merged CSV and parse key timestamps if present. +2. Compute resolution time in hours (if possible). +3. Print each ticket’s key fields and flags for common inconsistencies. +4. Plot histogram of resolution times between 0 and 10,000 hours. +5. Log warnings for detected inconsistencies. +**Key functions/classes:** None (inline script). +**Important logic & edge cases:** +- Checks for missing keys, missing creation dates, resolution before creation, negative durations, and closed tickets without resolution dates. +- Histogram uses dynamic bin count based on sample size. +**Error handling & fallbacks:** +- CSV load errors cause exit(1). +- If no resolution-time data, histogram is skipped with a warning. +**Side effects:** Writes PNG/logs; prints per-ticket details to stdout. +**Performance notes:** Per-row iteration over the entire dataset; could be expensive on large datasets. +**How to run:** `python etl/6_diagnose_and_plot_tickets.py`. + +### etl/7_fit_distributions.py +**Role in ETL:** Fits candidate distributions (lognormal, Weibull, exponential, normal) to phase duration series and exports fit diagnostics and plots. Also produces a `fit_summary.csv` for downstream simulation configuration. +**Inputs:** `etl/output/csv/tickets_prs_merged.csv`. +**Outputs:** +- `etl/output/csv/distribution_fit_stats.csv` (legacy resolution-time fit) +- `etl/output/csv/distribution_fit_stats_development.csv` +- `etl/output/csv/distribution_fit_stats_review.csv` +- `etl/output/csv/distribution_fit_stats_testing.csv` +- `etl/output/csv/fit_summary.csv` +- `etl/output/png/confronto_fit_*.png` +- `etl/output/logs/fit_distributions.log` +**Main steps:** +1. Load merged CSV and compute `resolution_time_days` if missing. +2. For legacy resolution time, fit distributions and export stats/plot. +3. For each stage (development/review/testing), fit distributions on duration days. +4. Compute MSE against KDE, KS p-values, AIC/BIC, and plausibility. +5. Write per-stage stats CSVs and comparison plots. +6. Write a compact `fit_summary.csv` using the best fit per stage. +**Key functions/classes:** +- `_fit_distribution_set`: runs KDE-based curve fitting and metrics for candidate distributions. +- `_to_fit_summary_row`: maps winner to SciPy naming/params. +- `_mean_std_from_params`, `_plausible`, `_ks_aic_bic`: helper metrics. +**Important logic & edge cases:** +- Data is filtered to non-negative, finite values and capped at 10 years (`MAX_DAYS`). +- Requires at least 10 data points; otherwise skips fitting. +- Best fit is chosen by minimum MSE vs KDE curve. +**Error handling & fallbacks:** +- CSV load errors cause SystemExit. +- If no fits are possible, logs an error and does not output `fit_summary.csv`. +**Side effects:** Writes multiple CSVs/PNGs/logs. +**Performance notes:** KDE and curve fitting can be expensive; uses 1000-point grid for each fit. +**How to run:** `python etl/7_fit_distributions.py`. + +### etl/8_export_fit_summary.py +**Role in ETL:** Converts per-stage distribution fit statistics into a compact `fit_summary.csv` with SciPy-compatible parameter names and minimal fields for simulation. It is primarily a post-processing helper for `7_fit_distributions.py` outputs. +**Inputs:** `etl/output/csv/distribution_fit_stats_.csv` (or override paths). CLI options include stage selection and plausible-only filtering. +**Outputs:** `etl/output/csv/fit_summary.csv`, `etl/output/logs/export_fit_summary.log`. +**Main steps:** +1. Parse CLI args for base directory, output CSV, stages, and optional overrides. +2. For each stage, locate the matching distribution fit CSV. +3. Parse the `Parametri` column into numeric lists. +4. Choose the best fit row (lowest MAE/MSE, then AIC/BIC). +5. Map labels to SciPy distribution names and parameter fields. +6. Write a compact summary CSV. +**Key functions/classes:** +- `parse_params`: robust parsing of `Parametri` strings into float lists. +- `choose_winner`: selection by metric/AIC/BIC. +- `map_to_scipy_row`: maps Italian labels to SciPy naming. +**Important logic & edge cases:** +- Supports aliases for stage names (dev/development, test/testing, etc.). +- Fails with SystemExit if required columns are missing or no CSV found. +**Error handling & fallbacks:** +- Hard-fails on missing columns, missing files, or no plausible rows after filtering. +**Side effects:** Writes CSV/logs. +**Performance notes:** Minimal; just CSV IO and sorting. +**How to run:** `python etl/8_export_fit_summary.py --stages dev review testing` (defaults are set). + +### etl/9_enrich_feedback_cols.py +**Role in ETL:** Enriches the merged dataset with derived feedback/capacity signals (review rounds, rework flag, CI fail→fix) and inferred developer/tester identifiers. It updates the existing merged CSV or writes to a specified output. +**Inputs:** `etl/output/csv/tickets_prs_merged.csv` by default (CLI overridable). +**Outputs:** `etl/output/csv/tickets_prs_merged.csv` by default (overwrites input) and `output/logs/enrich_feedback.log`. +**Main steps:** +1. Load merged CSV from CLI path. +2. Derive `review_rounds` and `review_rework_flag` from numeric or string review signals. +3. Derive `ci_failed_then_fix` from check/status histories or boolean flags. +4. Infer `dev_user` and `tester` from prioritized columns. +5. Log coverage statistics and missing signals. +6. Write enriched CSV. +**Key functions/classes:** +- `enrich`: central enrichment logic for review/CI/dev/tester columns. +- `_to_listish`, `_has_fail_then_success`: parse and interpret list-like status fields. +**Important logic & edge cases:** +- For review rounds, numeric counters take precedence; otherwise, string states with `CHANGES_REQUESTED` imply rework. +- CI fail→fix requires a failure-like token preceding a success-like token. +- If no candidates exist, it logs warnings and omits columns. +**Error handling & fallbacks:** +- Relies on best-effort heuristics and logs when signals cannot be derived. +- No try/except around CSV load; errors propagate. +**Side effects:** Writes CSV/logs; overwrites input by default. +**Performance notes:** Column-wise operations; expected to scale linearly. +**How to run:** `python etl/9_enrich_feedback_cols.py --in-csv --out-csv `. + +### etl/X_exponentiality_diagnostics.py +**Role in ETL:** Runs a diagnostic suite to test whether resolution times are exponential or heavy-tailed using KDE, distribution fits, CDF/QQ plots, and KS tests. This appears to be an exploratory analysis script rather than a pipeline step. +**Inputs:** `etl/output/csv/tickets_prs_merged.csv` (relative path `./output/csv/tickets_prs_merged.csv` from `etl/`). +**Outputs:** `./output/png/diagnostic_*.png`, `../simulation/output/logs/exponentiality_diag.log`, stdout prints of skewness, kurtosis, and KS p-values. +**Main steps:** +1. Load merged CSV and compute resolution time in hours. +2. Filter resolution times to 0–10,000 hours. +3. Fit exponential, normal, lognormal, and Weibull distributions. +4. Plot histogram with KDE + fitted PDFs. +5. Plot empirical vs theoretical CDFs. +6. Generate QQ plots for each distribution. +7. Plot log-survival curves for tail behavior. +8. Print skewness, kurtosis, and KS p-values. +**Key functions/classes:** None (top-level script). +**Important logic & edge cases:** +- Runs immediately on import (no main guard). +- Exits if fewer than 10 valid data points. +**Error handling & fallbacks:** +- Minimal; relies on try/except around each distribution fit. +**Side effects:** Writes PNGs/logs, prints to stdout. +**Performance notes:** KDE and multiple fits can be expensive; uses 1000-point grids. +**How to run:** `python etl/X_exponentiality_diagnostics.py` (from `etl/` directory to resolve relative paths). + +### etl/assignee_date.py +**Role in ETL:** Extracts assignment and close dates from a Jira search JSON (including changelog histories) and writes a CSV. This is an auxiliary extractor and is not directly wired into the main ETL pipeline. +**Inputs:** `etl/search.json` in the same directory (fixed filename). +**Outputs:** `etl/output/search_output.csv` (relative to script directory). +**Main steps:** +1. Load `search.json` containing Jira issues and changelog. +2. For each issue, find the latest assignee change date (or creation date if none). +3. For each issue, find the last transition to a “done” status (Closed/Done/Resolved), falling back to resolution date. +4. Write a CSV with issue key, assignment date, and close date. +**Key functions/classes:** +- `parse_iso`: normalizes Jira timestamp strings into `datetime`. +- `get_assignment_date`: choose last assignee change or creation date. +- `get_close_date`: choose last done-status change or resolution date. +**Important logic & edge cases:** +- Changelog parsing is purely history-based; if history absent, it falls back. +- Assumes specific status names (`Closed`, `Done`, `Resolved`). +**Error handling & fallbacks:** +- If `search.json` is missing, prints an error and returns. +- Parsing errors return None and fall back to raw strings where applicable. +**Side effects:** Writes CSV; prints status to stdout. +**Performance notes:** Linear in number of issues and history entries. +**How to run:** `python etl/assignee_date.py`. + +## 3. Cross-Cutting Concerns +- **Logging strategy:** Every script configures logging locally, typically to both a file under `etl/output/logs` (or `output/logs`) and stdout. There is no shared logging configuration, so format and log location vary slightly per script. +- **Config strategy:** Paths are mostly derived from `path_config.PROJECT_ROOT`. GitHub downloader additionally uses environment variables for concurrency, QPS, token locations, and incremental mode. Other scripts use hard-coded filenames relative to `PROJECT_ROOT`. +- **Schema/contracts:** The core contract is `tickets_prs_merged.csv`, which includes Jira fields (e.g., `fields.created`, `fields.resolutiondate`) and PR fields (e.g., `created_at`, `closed_at`, `reviews_count`). There is no explicit schema validation; scripts assume columns exist and log warnings when missing. +- **Idempotency:** Download scripts overwrite their outputs; clean/merge and downstream analysis overwrite outputs as well. Re-running is mostly safe but can change results if upstream data has changed. Incremental mode in GitHub downloader reuses old rows by `updated_at`. +- **Determinism & randomness:** No randomness in ETL scripts. Outcomes depend on live API responses and on current CSVs. +- **External integrations:** Jira API and GitHub API (network). GitHub credentials from `etl/env/github_tokens.env` or `GITHUB_TOKENS` env var; Jira uses anonymous access. No DBs. +- **Hard-coded paths and assumptions:** Many scripts assume execution from repo root (or `etl/`) and specific relative output directories. `X_exponentiality_diagnostics.py` writes to `../simulation/output/...`, which is outside `etl` and assumes a particular working directory. `assignee_date.py` assumes `search.json` in the same folder. + +## 4. Data Contracts Map +- **`etl/output/csv/jira_issues_raw.csv`** + - Producer: `etl/1_download_jira_tickets.py` + - Consumer: `etl/3_clean_and_merge.py` + - Location: `etl/output/csv/jira_issues_raw.csv` + - Minimal schema: `key`, `fields.*` (specifically `fields.created`, `fields.resolutiondate`, `fields.issuetype.name`, `fields.status.name`, and assignee subfields if present). Fields are flattened with dot notation; exact columns depend on Jira API response. +- **`etl/output/csv/github_prs_raw.csv`** + - Producer: `etl/2_download_github_prs.py` + - Consumer: `etl/3_clean_and_merge.py` + - Location: `etl/output/csv/github_prs_raw.csv` + - Minimal schema: `number`, `title`, `body` (optional), `created_at`, `updated_at`, `closed_at`, `merged_at` (optional), review and CI fields (`reviews_count`, `requested_changes_count`, `pull_request_review_states`, `check_runs_conclusions`, `combined_status_states`). Some fields are JSON-serialized lists. +- **`etl/output/csv/jira_issues_clean.csv`** + - Producer: `etl/3_clean_and_merge.py` + - Consumer: None explicitly in `/etl` (intermediate output). + - Location: `etl/output/csv/jira_issues_clean.csv` + - Minimal schema: `key`, `fields.created`, `fields.resolutiondate`, `fields.resolution.name` (if available), plus cleaned/derived aliases. +- **`etl/output/csv/github_prs_clean.csv`** + - Producer: `etl/3_clean_and_merge.py` + - Consumer: None explicitly in `/etl` (intermediate output). + - Location: `etl/output/csv/github_prs_clean.csv` + - Minimal schema: `jira_key`, `created_at`, `updated_at`, `closed_at`, `merged_at` (if available). +- **`etl/output/csv/tickets_prs_merged.csv`** + - Producer: `etl/3_clean_and_merge.py` + - Consumers: `etl/4_summarize_and_plot.py`, `etl/5_estimate_parameters.py`, `etl/6_diagnose_and_plot_tickets.py`, `etl/7_fit_distributions.py`, `etl/9_enrich_feedback_cols.py`, `etl/X_exponentiality_diagnostics.py`. + - Location: `etl/output/csv/tickets_prs_merged.csv` + - Minimal schema: Jira columns (`fields.created`, `fields.resolutiondate`, `fields.issuetype.name`, `fields.status.name`), PR columns (`created_at`, `closed_at`, `merged_at` optionally), and derived phase durations (`dev_duration_days`, `review_duration_days`, `test_duration_days`) where produced. +- **`etl/output/csv/phase_durations_wide.csv`** + - Producer: `etl/5_estimate_parameters.py` + - Consumer: None explicitly in `/etl`. + - Location: `etl/output/csv/phase_durations_wide.csv` + - Minimal schema: `key` (if present), `dev_duration_days`, `review_duration_days`, `test_duration_days`. +- **`etl/output/csv/phase_summary_stats.csv`** + - Producer: `etl/5_estimate_parameters.py` + - Consumer: None explicitly in `/etl`. + - Location: `etl/output/csv/phase_summary_stats.csv` + - Minimal schema: columns from `summarize_phase` (phase, count, mean, median, std, quantiles). +- **`etl/output/csv/parameter_estimates.csv`** + - Producer: `etl/5_estimate_parameters.py` + - Consumer: None explicitly in `/etl`. + - Location: `etl/output/csv/parameter_estimates.csv` + - Minimal schema: `arrival_rate_per_day`, `mean_resolution_time_days`, `median_resolution_time_days`, `throughput_monthly_mean`. +- **`etl/output/csv/distribution_fit_stats*.csv`** + - Producer: `etl/7_fit_distributions.py` + - Consumer: `etl/8_export_fit_summary.py` + - Location: `etl/output/csv/distribution_fit_stats*.csv` + - Minimal schema: `Distribuzione`, `Parametri`, `MSE_KDE_PDF` (or MAE), plus KS/AIC/BIC if present. +- **`etl/output/csv/fit_summary.csv`** + - Producer: `etl/7_fit_distributions.py` or `etl/8_export_fit_summary.py` + - Consumer: Not visible in `/etl`; likely used by simulation (outside scope). + - Location: `etl/output/csv/fit_summary.csv` + - Minimal schema: `stage`, `dist`, and distribution parameters (e.g., `s`, `loc`, `scale`, `mu`, `sigma`). +- **`etl/output/png/*.png`** + - Producers: `etl/4_summarize_and_plot.py`, `etl/5_estimate_parameters.py`, `etl/6_diagnose_and_plot_tickets.py`, `etl/7_fit_distributions.py`, `etl/X_exponentiality_diagnostics.py`. + - Consumers: None in `/etl` (reporting artifacts). + - Location: `etl/output/png/` (except diagnostic script also writes to `./output/png` and `../simulation/output/png`). + +## 5. Execution Graph +``` +[etl/1_download_jira_tickets.py] --(writes jira_issues_raw.csv)--> [etl/3_clean_and_merge.py] +[etl/2_download_github_prs.py] --(writes github_prs_raw.csv)--> [etl/3_clean_and_merge.py] +[etl/3_clean_and_merge.py] --(writes tickets_prs_merged.csv)--> [etl/4_summarize_and_plot.py] +[etl/3_clean_and_merge.py] --(writes tickets_prs_merged.csv)--> [etl/5_estimate_parameters.py] +[etl/3_clean_and_merge.py] --(writes tickets_prs_merged.csv)--> [etl/6_diagnose_and_plot_tickets.py] +[etl/3_clean_and_merge.py] --(writes tickets_prs_merged.csv)--> [etl/7_fit_distributions.py] +[etl/3_clean_and_merge.py] --(writes tickets_prs_merged.csv)--> [etl/9_enrich_feedback_cols.py] +[etl/3_clean_and_merge.py] --(writes tickets_prs_merged.csv)--> [etl/X_exponentiality_diagnostics.py] + +[etl/7_fit_distributions.py] --(writes distribution_fit_stats_*.csv)--> [etl/8_export_fit_summary.py] + +# Import/call edges +(path_config.PROJECT_ROOT) <- imported by most scripts for path resolution +``` + +## 6. Risks / Bugs / Smells (Code-Based) +- **Working directory assumptions:** `X_exponentiality_diagnostics.py` uses relative paths (`./output/...` and `../simulation/output/...`) and runs at import time. Running it from a different directory will break path resolution or write to unexpected locations. +- **No schema validation on merged dataset:** Downstream scripts assume columns exist (e.g., `fields.issuetype.name`, `fields.status.name`). Missing columns will raise runtime errors without graceful handling (notably in `4_summarize_and_plot.py`). +- **Potentially expensive backlog computation:** `5_estimate_parameters.py` computes backlog by iterating over each day and scanning all rows, which can be O(n*days) and slow for large datasets. +- **Silent drop of “bad” Jira resolutions:** `3_clean_and_merge.py` filters out tickets with certain resolution names; this can materially change totals but is not configurable or parameterized. +- **Token rotation logic with anonymous token fallback:** GitHub downloader proceeds without auth tokens, which may lead to very slow or incomplete downloads; it logs warnings but does not block. +- **Incremental mode risk:** `2_download_github_prs.py` reuses previous rows if `updated_at` matches; if derived fields change without `updated_at` changes (unlikely but possible), stale data may persist. + +## 7. Open Questions (only those truly blocked by missing context) +- **Exact schemas of Jira and GitHub API responses** (which fields appear in CSVs): depends on API responses and permissions. To verify, inspect a generated `jira_issues_raw.csv` and `github_prs_raw.csv` after running the download scripts. +- **Presence of PR review/CI fields:** These fields depend on GitHub API availability and the repository’s CI configuration. To verify, run `2_download_github_prs.py` and inspect columns in the output CSV. +- **Upstream artifacts for `assignee_date.py`:** It requires a `search.json` file with changelog history, but this file is not produced by any script in `/etl`. To verify, locate or generate `etl/search.json` from Jira API with `expand=changelog`.