Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
168 changes: 117 additions & 51 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,11 @@ A CLI tool to identify pull request outliers in GitHub repositories using Z-scor
## Features

- **Fetch & Store**: retrieve PR data from GitHub (with rate-limit handling) and store it in a local SQLite database.
- **Outlier Detection**: Z-score analysis across multiple metrics — additions, deletions, changed files, comments, review duration, code churn, and comment density.
- **Classify**: Z-score analysis across multiple metrics — additions, deletions, changed files, comments, review duration, code churn, and comment density.
- **Baseline window**: define a historical measurement period so recent PRs are evaluated against an independent baseline rather than skewing their own statistics.
- **Primary-branch filter**: focus analysis on PRs that were not merged into the primary branch (e.g. feature-to-feature or abandoned branches).
- **Flexible output**: view results as a terminal table or export to JSON/CSV.
- **Deferred output**: when processing multiple repositories, results for all repos are printed together after all processing completes, with a summary of any repos that could not be classified.

## Installation

Expand All @@ -21,7 +23,7 @@ uv sync

## Usage

The tool works in two steps: **fetch** data, then **detect-outliers**.
The tool has three commands: **fetch**, **classify**, and **fetch-and-classify**.

### 1. Configure GitHub Token

Expand All @@ -34,17 +36,19 @@ Without a token the GitHub API rate limit is very low.
### 2. `fetch` — retrieve and store PR data

```bash
# Fetch PRs merged in the last 30 days (default) for a specific repo
# Fetch PRs created in the last 30 days (default) for a specific repo
uv run review-classify fetch --repo owner/repo

# Fetch PRs for an entire organization
uv run review-classify fetch --org your-org

# Fetch PRs within a specific date range
uv run review-classify fetch --repo owner/repo --start 2024-01-01 --end 2024-06-30
uv run review-classify fetch --repo owner/repo \
--collate-start 2024-01-01 --collate-end 2024-06-30

# Clear existing data before fetching
uv run review-classify fetch --repo owner/repo --reset-db --start 2024-01-01
uv run review-classify fetch --repo owner/repo \
--reset-db --collate-start 2024-01-01

# Run fetching using a TOML configuration file
uv run review-classify fetch --config config.toml
Expand All @@ -55,25 +59,30 @@ uv run review-classify fetch --config config.toml
| `--repo` / `-r` | GitHub repository (owner/repo). Can be specified multiple times. |
| `--org` / `-o` | GitHub organization. Fetches all repositories in the org. Can be specified multiple times. |
| `--config` / `-c` | Path to a TOML config file defining multiple repositories/organizations. |
| `--start` / `-s` | Start date for PR range (YYYY-MM-DD). Defaults to 30 days ago. |
| `--end` / `-e` | End date for PR range (YYYY-MM-DD). |
| `--collate-start` | Start date for PR collation range (YYYY-MM-DD). Defaults to 30 days ago. |
| `--collate-end` | End date for PR collation range (YYYY-MM-DD). |
| `--reset-db` | Delete all stored data before fetching. |
| `--verbose` / `-v` | Print progress details. |

### 3. `detect-outliers` — find unusual PRs
### 3. `classify` — find unusual PRs

Operates on data already fetched with `fetch`. Results for all repositories are printed together after all repos have been processed.

```bash
# Detect outliers across all stored PRs for a repo
uv run review-classify detect-outliers --repo owner/repo
# Classify all stored PRs for a repo
uv run review-classify classify --repo owner/repo

# Detect outliers for an entire organization
uv run review-classify detect-outliers --org your-org
# Classify PRs for an entire organization
uv run review-classify classify --org your-org

# Stricter threshold (fewer, more extreme outliers)
uv run review-classify detect-outliers --repo owner/repo --threshold 3.0
uv run review-classify classify --repo owner/repo --threshold 3.0

# Export to JSON
uv run review-classify detect-outliers --repo owner/repo --format json > outliers.json
uv run review-classify classify --repo owner/repo --format json > outliers.json

# Exclude PRs merged into the primary branch (main/master)
uv run review-classify classify --repo owner/repo --exclude-primary-merged
```

| Option | Description |
Expand All @@ -84,69 +93,126 @@ uv run review-classify detect-outliers --repo owner/repo --format json > outlier
| `--threshold` / `-t` | Z-score threshold for flagging an outlier. Default: `2.0`. |
| `--min-samples` | Minimum number of PRs required for analysis. Default: `30`. |
| `--format` / `-f` | Output format: `table` (default), `json`, or `csv`. |
| `--classify-start` | Start of the baseline measurement window (YYYY-MM-DD). |
| `--classify-end` | End of the baseline measurement window (YYYY-MM-DD). |
| `--start` | Start of the classification window (YYYY-MM-DD). |
| `--end` | End of the classification window (YYYY-MM-DD). |
| `--exclude-primary-merged` | Exclude PRs whose base branch is `main` or `master`. |
| `--verbose` / `-v` | Print progress details. |

#### Baseline window (`--classify-start` / `--classify-end`)
#### Classification window (`--start` / `--end`)

By default all stored PRs feed both the baseline statistics and the outlier evaluation. This is problematic: an unusually large PR inflates the mean and standard deviation it is measured against, masking itself as normal.

Use `--classify-start` and `--classify-end` to define a historical baseline window. Statistics are computed from PRs merged **within** that window; only PRs merged **after** `--classify-end` are evaluated and reported.
Use `--start` and `--end` to define a historical baseline window. Statistics are computed from PRs merged **within** that window; only PRs merged **after** `--end` are evaluated and reported.

```
[--classify-start ────────── --classify-end] >classify-end
baseline start baseline end PRs evaluated here
[--start ──────────────── --end] >end
baseline start baseline end PRs evaluated here
```

```bash
# Use Jan–Jun 2024 as the baseline; evaluate PRs merged after 2024-06-30
uv run review-classify detect-outliers --repo owner/repo \
--classify-start 2024-01-01 \
--classify-end 2024-06-30
uv run review-classify classify --repo owner/repo \
--start 2024-01-01 \
--end 2024-06-30

# Same, with stricter threshold and JSON output
uv run review-classify detect-outliers --repo owner/repo \
--classify-start 2024-01-01 \
--classify-end 2024-06-30 \
uv run review-classify classify --repo owner/repo \
--start 2024-01-01 \
--end 2024-06-30 \
--threshold 2.5 \
--format json > outliers.json
```

#### Per-repository analysis
#### Excluding primary-branch PRs (`--exclude-primary-merged`)

Pass `--exclude-primary-merged` to restrict analysis to PRs that were **not** merged into `main` or `master`. This is useful for focusing on PRs targeting feature branches, release branches, or PRs that may have been abandoned.

```bash
uv run review-classify classify --repo owner/repo --exclude-primary-merged
```

### 4. `fetch-and-classify` — fetch and classify in one step

Combines both steps. If PR data already exists in the local database for a repository, the fetch is skipped automatically. Use `--reset-db` to force a fresh fetch.

```bash
# Fetch (if needed) and classify in one command
uv run review-classify fetch-and-classify --repo owner/repo

# With explicit date ranges for both collation and classification
uv run review-classify fetch-and-classify --repo owner/repo \
--collate-start 2024-01-01 --collate-end 2024-12-31 \
--start 2024-01-01 --end 2024-06-30

# Force a fresh fetch even if data already exists
uv run review-classify fetch-and-classify --repo owner/repo --reset-db

# Exclude primary-branch PRs from the classification
uv run review-classify fetch-and-classify --repo owner/repo \
--exclude-primary-merged
```

| Option | Description |
| --- | --- |
| `--repo` / `-r` | GitHub repository (owner/repo). Can be specified multiple times. |
| `--org` / `-o` | GitHub organization. Can be specified multiple times. |
| `--config` / `-c` | Path to a TOML config file. |
| `--collate-start` | Start date for PR collation range (YYYY-MM-DD). |
| `--collate-end` | End date for PR collation range (YYYY-MM-DD). |
| `--start` | Start of the classification window (YYYY-MM-DD). |
| `--end` | End of the classification window (YYYY-MM-DD). |
| `--threshold` / `-t` | Z-score threshold for flagging an outlier. Default: `2.0`. |
| `--min-samples` | Minimum number of PRs required for analysis. Default: `30`. |
| `--format` / `-f` | Output format: `table` (default), `json`, or `csv`. |
| `--exclude-primary-merged` | Exclude PRs whose base branch is `main` or `master`. |
| `--reset-db` | Delete existing data and force a fresh fetch. |
| `--verbose` / `-v` | Print progress details. |

### Per-repository analysis

Outlier detection is always **scoped to a single repository**. When you target multiple repositories (via `--org`, multiple `--repo` flags, or a config file), each repository is analysed independently:

1. **Baseline statistics** — mean and standard deviation for every metric are computed from that repository's own merged PRs (optionally restricted to the baseline window).
1. **Baseline statistics** — mean and standard deviation for every metric are computed from that repository's own merged PRs (optionally restricted to the classification window).
2. **Z-scores** — each PR is scored against its own repository's statistics, not a cross-repository pool.
3. **Isolation** — a PR in `owner/repo-a` is never compared against PRs from `owner/repo-b`.

This means thresholds adapt to each project's natural pace and size. A large PR in a small, infrequently-updated repository is judged against that repository's history, not the (potentially very different) norms of a busier sibling repository in the same organisation.
This means thresholds adapt to each project's natural pace and size.

```
repo-a ──► stats(repo-a) ──► z-scores(repo-a PRs)
repo-b ──► stats(repo-b) ──► z-scores(repo-b PRs)
(independent)
```

### Deferred output

When processing multiple repositories, per-repo results are **not** printed as they are produced. Instead:

- After all repositories have been processed, results for every successfully classified repo are printed.
- Repositories that could not be classified (insufficient data, no PRs found, etc.) are listed in a summary block on stderr at the end.

### End-to-end example

```bash
# 1. Fetch a full year of history as the baseline
# Option A — two explicit steps
uv run review-classify fetch --repo owner/repo \
--start 2024-01-01 --end 2024-12-31
--collate-start 2024-01-01 --collate-end 2024-12-31

# 2. Evaluate PRs from January 2025 against that baseline
uv run review-classify detect-outliers --repo owner/repo \
--classify-start 2024-01-01 \
--classify-end 2024-12-31 \
uv run review-classify classify --repo owner/repo \
--start 2024-01-01 \
--end 2024-12-31 \
--format table

# Option B — single combined command
uv run review-classify fetch-and-classify --repo owner/repo \
--collate-start 2024-01-01 --collate-end 2024-12-31 \
--start 2024-01-01 --end 2024-12-31
```

## Configuration file

Both `fetch` and `detect-outliers` accept `--config <file.toml>` as an alternative to passing `--repo` / `--org` flags. The file is TOML and supports three sections:
`fetch`, `classify`, and `fetch-and-classify` all accept `--config <file.toml>` as an alternative to passing `--repo` / `--org` flags. The file is TOML and supports three sections:

| Section | Purpose |
| --- | --- |
Expand All @@ -160,12 +226,12 @@ Both `fetch` and `detect-outliers` accept `--config <file.toml>` as an alternati
# config.toml

[defaults]
start = "2024-01-01"
end = "2024-12-31"
threshold = 2.0
min_samples = 30
classify_start = "2024-01-01"
classify_end = "2024-06-30"
collate_start = "2024-01-01"
collate_end = "2024-12-31"
threshold = 2.0
min_samples = 30
start = "2024-01-01"
end = "2024-06-30"

# Individual repositories ─────────────────────────────────────────────────────

Expand All @@ -174,11 +240,11 @@ name = "owner/repo-a"
# inherits all [defaults]

[[repositories]]
name = "owner/repo-b"
start = "2024-06-01" # overrides [defaults] start
threshold = 2.5 # stricter outlier threshold for this repo
classify_start = "2024-06-01"
classify_end = "2024-09-30"
name = "owner/repo-b"
collate_start = "2024-06-01" # overrides [defaults] collate_start
threshold = 2.5 # stricter outlier threshold for this repo
start = "2024-06-01"
end = "2024-09-30"

# Organizations ───────────────────────────────────────────────────────────────

Expand All @@ -188,9 +254,9 @@ name = "my-org"
exclude_repos = ["my-org/archived-repo", "my-org/fork-only"]

[[organizations]]
name = "another-org"
start = "2024-03-01"
min_samples = 20
name = "another-org"
collate_start = "2024-03-01"
min_samples = 20
```

### Key rules
Expand Down
Loading
Loading