-
Notifications
You must be signed in to change notification settings - Fork 1
Probe concurrency guardrails and operations #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| # PaperScout architecture — concurrency | ||
|
|
||
| This document describes what runs on the asyncio event loop, what runs in threads, and how to add new data sources without introducing races. | ||
|
|
||
| ## Event loop (async, cooperative single-thread) | ||
|
|
||
| These components share one thread and the main event loop. They may await I/O but must not block the loop with synchronous work: | ||
|
|
||
| - **`Scheduler.run_forever` / `poll_once`** — orchestrates index refresh, probing, and notifications. | ||
| - **`WG21Index.refresh`** — fetches and parses wg21.link index (httpx async). | ||
| - **`ISOProber.run_cycle` / `_probe_one`** — concurrent HEAD probes via `asyncio.gather` and an httpx async client. | ||
| - **Slack Bolt handlers** — run on Bolt’s thread; they should not read mutable source state directly (use snapshots or health callbacks). | ||
|
|
||
| `ISOProber._stats` is updated from many coroutines in one `run_cycle()`. This is safe on the event loop because asyncio never preempts between awaits. A `threading.Lock` guards `_stats` as defense-in-depth if code is ever called from a worker thread by mistake. | ||
|
|
||
| ## Threads | ||
|
|
||
| | Thread | Role | | ||
| |--------|------| | ||
| | **Health server** (`health.py`) | Serves `GET /health`; reads `len(index.papers)` via a callback and scheduler snapshot fields. | | ||
| | **MessageQueue sender** (`scout.py`) | Drains Slack post queue with rate limiting. | | ||
| | **`run_blocking_io` / `asyncio.to_thread`** | Runs blocking psycopg2 calls (e.g. `UserWatchlist.matches_for_users`) off the loop. | | ||
|
|
||
| ## Concurrency rules | ||
|
|
||
| When adding or changing code: | ||
|
|
||
| 1. **Use `run_blocking_io()` (or `asyncio.to_thread`) only for pure blocking I/O** with no shared in-process mutable state. The function should use its own DB connection from the pool. | ||
| 2. **Never access `ISOProber._stats`, `WG21Index.papers`, `WG21Index._max_rev`, or other source internals from a thread.** Read them only on the event loop, or use lock-protected snapshots (`snapshot_stats()`). | ||
| 3. **`WG21Index.papers` is replaced wholesale on every `refresh()`** — do not mutate the dict in place. Assign a new dict from `_parse_and_index()`. | ||
| 4. **New HTTP data sources** should follow the async pattern (`httpx` + coroutines on the loop), like `WG21Index` and `ISOProber`. The optional open-std.org scraper in `sources.py` is a future extension point: if integrated, either keep it async on the loop or isolate it in a thread with no shared mutable state. | ||
|
|
||
| ## Related docs | ||
|
|
||
| - [probe-operations.md](probe-operations.md) — production probe volume, tuning, troubleshooting. | ||
| - [probe-performance.md](probe-performance.md) — synthetic CI benchmark (mock server, not isocpp.org). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,70 @@ | ||
| # ISO probe operations guide | ||
|
|
||
| This document describes the **production** isocpp.org HEAD probe cycle: normal volume, timing, configuration, and what to do when something degrades. For the **synthetic CI benchmark** (mock server), see [probe-performance.md](probe-performance.md). | ||
|
|
||
| ## Normal envelope (default settings) | ||
|
|
||
| At default configuration (~4,000 active P-numbers, `HTTP_CONCURRENCY=20`, 30-minute poll interval): | ||
|
|
||
| | Metric | Typical value | | ||
| |--------|----------------| | ||
| | HEAD requests per cycle | **~1,600–2,000** | | ||
| | Wall-clock duration | **~8–10 s** (depends on isocpp.org latency) | | ||
| | Hot probes | Watchlist + frontier window + recent index papers — every cycle | | ||
| | Cold probes | **1 / `COLD_CYCLE_DIVISOR`** of the cold pool per cycle (default 48 → full tail in ~24 h) | | ||
|
|
||
| Each cycle logs **`PROBE-START`** (human-readable) and **`PROBE-CYCLE-SUMMARY`** (JSON) with `cycle_requests`, `hot_probes`, `cold_probes`, `cycle_duration_s`, and per-outcome counts. | ||
|
|
||
| Hot/cold split is **not a fixed ratio** — it varies with frontier size, watchlist, and index dates. Use `hot_probes` / `cold_probes` in the summary line to see the split for a given cycle. | ||
|
|
||
| ## Key settings | ||
|
|
||
| > **Note:** Some issue templates refer to `POLL_INTERVAL_SECONDS`; the actual environment variable is **`POLL_INTERVAL_MINUTES`**. | ||
|
|
||
| | Variable | Default | Operational effect | Recommended range | | ||
| |----------|---------|-------------------|-------------------| | ||
| | `HTTP_CONCURRENCY` | `20` | Max concurrent HEAD requests (semaphore) | **10–20**; lower if you see 429s or high `errors` | | ||
| | `POLL_INTERVAL_MINUTES` | `30` | Target time between poll cycles | **30** default; **increase** if cycles routinely overrun the interval | | ||
| | `POLL_OVERRUN_COOLDOWN_SECONDS` | `300` | Minimum sleep after a cycle longer than the interval | **300** default; prevents tight loops when work backs up | | ||
| | `COLD_CYCLE_DIVISOR` | `48` | Cold pool sliced across N cycles (48 × 30 min ≈ 24 h) | Raise to spread load; lower for faster cold coverage | | ||
| | `ALERT_MODIFIED_HOURS` | `24` | Only alert on hits with recent `Last-Modified` | — | | ||
|
|
||
| See [`.env.example`](../.env.example) and the README env tables for frontier, hot, and cold tuning (`FRONTIER_WINDOW_*`, `HOT_LOOKBACK_MONTHS`, etc.). | ||
|
|
||
| ## Degradation signals | ||
|
|
||
| | Signal | What it means | | ||
| |--------|----------------| | ||
| | `cycle_duration_s` > `POLL_INTERVAL_MINUTES` × 60 | Cycle **overrun** — scheduler applies `POLL_OVERRUN_COOLDOWN_SECONDS` before the next poll | | ||
| | No successful poll for **> 2 × `POLL_INTERVAL_MINUTES`** | Stale-poll ops alert (see `monitor.py`) — treat as incident | | ||
| | `errors / cycle_requests` **> 5%** for 2+ consecutive cycles | Network or upstream outage — check logs for `PROBE-ERR` | | ||
| | HTTP **429** from isocpp.org | Rate limiting — reduce concurrency and/or widen poll interval | | ||
|
|
||
| Parse **`PROBE-CYCLE-SUMMARY`** from logs (grep or log aggregator) for machine-readable fields: `hit_total`, `miss`, `errors`, `skipped_discovered`, `skipped_in_index`. | ||
|
|
||
| ## What to do if… | ||
|
|
||
| ### Cycle takes longer than the poll interval | ||
|
|
||
| 1. Grep logs for `PROBE-CYCLE-SUMMARY` — note `cycle_duration_s` and `cycle_requests`. | ||
| 2. If duration tracks request count, isocpp.org may be slow; consider lowering `HTTP_CONCURRENCY` slightly (less burst) or increasing `POLL_INTERVAL_MINUTES`. | ||
| 3. Confirm `POLL_OVERRUN_COOLDOWN_SECONDS` is in effect (see `SCHEDULER-SLEEP` lines). | ||
|
|
||
| ### Error rate exceeds 5% | ||
|
|
||
| 1. Compute `errors / cycle_requests` from the last few `PROBE-CYCLE-SUMMARY` lines. | ||
| 2. If sustained, check network and isocpp.org status; avoid raising `HTTP_CONCURRENCY` until errors drop. | ||
| 3. Review `PROBE-ERR` debug lines for `failure_category` (timeout vs network). | ||
|
|
||
| ### isocpp.org returns 429 | ||
|
|
||
| 1. Halve `HTTP_CONCURRENCY` (e.g. 20 → 10). | ||
| 2. Increase `POLL_INTERVAL_MINUTES` (e.g. 30 → 45) to reduce sustained request rate. | ||
| 3. Monitor `errors` and 429 patterns for several cycles before tuning back up. | ||
|
|
||
| ## Related documentation | ||
|
|
||
| - [README — Two-Frequency Probing Strategy](../README.md#two-frequency-probing-strategy) | ||
| - [handoff.md](handoff.md) — design rationale | ||
| - [architecture.md](architecture.md) — concurrency model | ||
| - [probe-performance.md](probe-performance.md) — CI benchmark and regression gate |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| """Helpers for running blocking I/O off the asyncio event loop.""" | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import asyncio | ||
| from collections.abc import Callable | ||
| from typing import TypeVar | ||
|
|
||
| _T = TypeVar("_T") | ||
|
|
||
|
|
||
| async def run_blocking_io(fn: Callable[..., _T], /, *args, **kwargs) -> _T: | ||
| """Run *fn* in a worker thread without blocking the event loop. | ||
|
|
||
| Use only for pure blocking I/O (e.g. psycopg2 queries) that does **not** | ||
| touch shared in-process mutable state such as ``ISOProber._stats``, | ||
| ``WG21Index.papers``, or other source internals. Each call should use its | ||
| own database connection from the pool. | ||
|
|
||
| See ``docs/architecture.md`` for the full concurrency model. | ||
| """ | ||
| return await asyncio.to_thread(fn, *args, **kwargs) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.