fix(ci): run shard 3 single-worker to prevent GC-finaliser hang#1093
fix(ci): run shard 3 single-worker to prevent GC-finaliser hang#1093curdriceaurora merged 7 commits intomainfrom
Conversation
tests/api and tests/web accumulate ~170 TestClient objects that are created without using the context-manager form. When an xdist worker process exits, Python's GC tries to finalise these objects; their async shutdown code deadlocks because no event loop is running at that point. The hang was first visible in PR #1091's CI run (shard 3 cancelled on both py3.11 and py3.12) and persisted through PR #1092 — root cause is TestClient accumulation, not test count. Fix: run shard 3 with -n=1 (single xdist worker) instead of -n=auto (2 workers on ubuntu-latest). A single worker completes its teardown while the pytest-asyncio event loop is still alive, avoiding the finaliser deadlock. Shard 3's ~1 600 tests run in ~33 s serially, so the 20-minute hard cap is not a concern. Applied to both ci.yml and ci-full.yml. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughCI test jobs compute a Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…anize_routes.py tests/web/test_web_organize_routes.py had 8 POST requests to /ui/organize/scan without a CSRF token, causing 403 Forbidden. These appeared as F markers in the shard 3 CI progress lines (31–40%) but were never surfaced in a summary because the job was cancelled by the 20-minute GC-hang timeout before printing results. Fix follows the same x-csrf-token header pattern established in PR #1092: add a _get_csrf_token() helper that seeds the cookie via GET /ui/organize and returns the token, then pass it via x-csrf-token header (not form field) so the BaseHTTPMiddleware body-consumption issue does not affect Form() parameters. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Review Summary by Qodofix(ci): run shard 3 with WalkthroughsDescription• Run shard 3 tests single-worker • Avoid xdist GC finalizer deadlock • Keep other shards on -n=auto Diagramflowchart LR
A["GitHub Actions CI workflow"]
B["Shard 3 (tests/api + tests/web)"]
C["Set WORKERS=1"]
D["pytest -n=$WORKERS"]
E["Avoid GC finaliser deadlock"]
A -- "matrix shard=3" --> B
B -- "select workers" --> C
C -- "controls" --> D
D -- "prevents hang on exit" --> E
File Changes1. .github/workflows/ci-full.yml
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/ci.yml:
- Around line 233-240: The current approach still passes -n=1 to pytest which
spawns an xdist worker; change the WORKERS handling and pytest invocation so
shard 3 runs tests in-process (no xdist). Specifically, update the WORKERS
assignment/logic (symbol: WORKERS and matrix.shard) so it yields an
empty/omitted value for shard "3" (or a value that results in no -n flag), and
change the pytest command (the line invoking pytest with --timeout and
--override-ini) to only add the -n flag when WORKERS is set; do not pass -n=1 —
instead omit -n entirely (or use -n=0) for shard 3 to ensure tests run in the
main process.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ee653841-9c7a-46b7-8a49-4e8aec818f9b
📒 Files selected for processing (2)
.github/workflows/ci-full.yml.github/workflows/ci.yml
Code Review by QodoSorry, something went wrongWe weren't able to complete the code review on our side. Please try againⓘ The new review experience is currently in Beta. Learn more |
- ci.yml/ci-full.yml: remove inaccurate "~170 TestClient objects" count
from comment; add "believed to" qualifier since root cause is inferred
- test_web_organize_routes.py: fix _get_csrf_token docstring ("Raises
AssertionError" → "Asserts that the cookie was set", bare assert is not raise)
- test_web_organize_routes.py: add test_organize_scan_post_without_csrf_returns_403
to verify CSRF middleware rejects unauthenticated POSTs
- test_web_organize_routes.py: strengthen three weak assertions by adding
"plan" in response.text.lower() checks to scan_returns_plan,
scan_with_recursive_option, and htmx_request_header tests
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
.github/workflows/ci.yml (1)
233-240:⚠️ Potential issue | 🟠 Major
-n=1still spawns a worker subprocess; the fix may not address the root cause.The comment correctly identifies that the issue is async shutdown deadlocking when "the xdist worker process exits without a running event loop." However,
-n=1still spawns a single worker subprocess (gw0)—it does not run tests in the main pytest process where pytest-asyncio's event loop remains alive.To actually run tests in the main process, use
-n=0or omit the-nflag entirely for shard 3.Suggested fix
- # Shard 3 (tests/api tests/web) runs single-worker to prevent the - # GC-finaliser hang from many TestClient objects accumulated per - # worker — their async shutdown code is believed to deadlock when - # the xdist worker process exits without a running event loop. - WORKERS=$([[ "${{ matrix.shard }}" == "3" ]] && echo "1" || echo "auto") + # Shard 3 (tests/api tests/web) runs without xdist to prevent the + # GC-finaliser hang from TestClient objects — their async shutdown + # code deadlocks when exiting a subprocess without a running event loop. + # Using -n=0 (or omitting -n) runs tests in the main pytest process + # where pytest-asyncio's event loop is still alive during teardown. + XDIST_FLAG=$([[ "${{ matrix.shard }}" == "3" ]] && echo "-n=0" || echo "-n=auto") pytest $PATHS --strict-markers -m "not benchmark and not e2e" \ --cov=file_organizer --cov-report= \ - --timeout=30 -n=$WORKERS --override-ini="addopts=" + --timeout=30 $XDIST_FLAG --override-ini="addopts="🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/ci.yml around lines 233 - 240, The current WORKERS assignment still produces a spawned xdist worker for shard 3 (because `-n=1` creates a subprocess); change the logic so shard 3 runs tests in the main process by making WORKERS empty (or "0" and then omitting -n) when matrix.shard == "3", and update the pytest invocation that uses WORKERS (the pytest $PATHS ... --timeout=30 -n=$WORKERS --override-ini="addopts=" line) to only pass the -n option when WORKERS is non-empty (e.g., conditionalize the `-n=$WORKERS` argument) so tests for shard 3 run without any xdist worker process.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In @.github/workflows/ci.yml:
- Around line 233-240: The current WORKERS assignment still produces a spawned
xdist worker for shard 3 (because `-n=1` creates a subprocess); change the logic
so shard 3 runs tests in the main process by making WORKERS empty (or "0" and
then omitting -n) when matrix.shard == "3", and update the pytest invocation
that uses WORKERS (the pytest $PATHS ... --timeout=30 -n=$WORKERS
--override-ini="addopts=" line) to only pass the -n option when WORKERS is
non-empty (e.g., conditionalize the `-n=$WORKERS` argument) so tests for shard 3
run without any xdist worker process.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 07eaec40-67f4-406e-9972-ef072b733711
📒 Files selected for processing (3)
.github/workflows/ci-full.yml.github/workflows/ci.ymltests/web/test_web_organize_routes.py
🚧 Files skipped from review as they are similar to previous changes (2)
- .github/workflows/ci-full.yml
- tests/web/test_web_organize_routes.py
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/web/test_web_organize_routes.py (1)
17-23: Consider using context manager for TestClient in a follow-up.The PR objective notes that TestClient objects not using context-manager form contributed to GC-finaliser hangs. While the CI fix (running shard 3 with
-n=1) addresses the immediate timeout issue, usingTestClientas a context manager would provide cleaner resource lifecycle management:# Future improvement pattern: with TestClient(app) as client: response = client.get(...)This could be addressed in a separate cleanup PR given the scope here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/web/test_web_organize_routes.py` around lines 17 - 23, The helper _build_client currently returns a plain TestClient instance which can lead to GC-finaliser hangs; change usage to the context-manager form by creating the app with create_app(settings) and using TestClient(app) as a context manager where tests call _build_client, or refactor _build_client to yield a client via a contextmanager/fixture so callers use "with TestClient(app) as client" (reference: _build_client function and TestClient usage) — do this in a follow-up cleanup PR so test callers are updated to use the context-manager lifecycle for TestClient.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/web/test_web_organize_routes.py`:
- Around line 17-23: The helper _build_client currently returns a plain
TestClient instance which can lead to GC-finaliser hangs; change usage to the
context-manager form by creating the app with create_app(settings) and using
TestClient(app) as a context manager where tests call _build_client, or refactor
_build_client to yield a client via a contextmanager/fixture so callers use
"with TestClient(app) as client" (reference: _build_client function and
TestClient usage) — do this in a follow-up cleanup PR so test callers are
updated to use the context-manager lifecycle for TestClient.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8e29623e-cdc1-4880-9e01-6a692a4fc26f
📒 Files selected for processing (1)
tests/web/test_web_organize_routes.py
curdriceaurora
left a comment
There was a problem hiding this comment.
🤖 Auto Claude PR Review
Merge Verdict: 🟠 NEEDS REVISION
🟠 Needs revision - 1 issue(s) require attention.
1 issue(s) must be addressed (0 required, 1 recommended)
Risk Assessment
| Factor | Level | Notes |
|---|---|---|
| Complexity | Low | Based on lines changed |
| Security Impact | None | Based on security findings |
| Scope Coherence | Good | Based on structural review |
Findings Summary
- Medium: 1 issue(s)
Generated by Auto Claude PR Review
Findings (1 selected of 1 total)
🟡 [9a034d770f94] [MEDIUM] Local _get_csrf_token() duplicates shared conftest helpers
📁 tests/web/test_web_organize_routes.py:26
This file defines a local _get_csrf_token() function (lines 26-41) that duplicates the shared get_csrf_token() and get_csrf_headers() functions in tests/conftest.py (lines 416-451). The sister file tests/test_web_organize_routes.py already uses the shared helper via from .conftest import get_csrf_headers (line 11). Additionally, tests/web/test_web_marketplace_routes.py has the same local duplication pattern.
The shared get_csrf_headers() in conftest.py returns {"x-csrf-token": token} which is exactly what this PR manually constructs at every call site (e.g., csrf_headers = {"x-csrf-token": _get_csrf_token(client)}). Using the shared helper would eliminate the duplication and the manual dict construction.
Searched Grep('get_csrf_headers|get_csrf_token', 'tests/') — confirmed tests/conftest.py exports both, and tests/test_web_organize_routes.py already imports get_csrf_headers from conftest. | The new _get_csrf_token helper (lines 26-41) and the manual {"x-csrf-token": _get_csrf_token(client)} construction at 8 call sites duplicate the shared get_csrf_headers(client, seed_url="/ui/organize") function from tests/conftest.py (lines 438-451), which already returns exactly {"x-csrf-token": token}. The conftest version supports a configurable seed_url parameter. Other test files (e.g., tests/test_web_organize_routes.py) already import and use get_csrf_headers from conftest. However, tests/web/test_web_marketplace_routes.py also defines its own local _get_csrf_token, so this PR follows the established pattern in the tests/web/ directory.
Suggested fix:
Remove the local `_get_csrf_token()` and instead import the shared helper:
```python
from tests.conftest import get_csrf_headers
Then replace every csrf_headers = {"x-csrf-token": _get_csrf_token(client)} with csrf_headers = get_csrf_headers(client, seed_url="/ui/organize"). The shared get_csrf_headers() accepts a seed_url parameter (default /ui/) that can be set to /ui/organize to match the current behavior.
---
*This review was generated by Auto Claude.*
- ci.yml/ci-full.yml: remove "believed to" hedge — root cause was confirmed (async shutdown deadlocks at worker exit without event loop); also quote $WORKERS per D7 shell variable quoting rule - conftest.py: narrow client: Any → client: "TestClient" on both CSRF helpers to satisfy F2 type annotation rule - test_web_organize_routes.py (inner): fix _get_csrf_token docstring to "Raises AssertionError if absent or empty" — more precise than "Asserts that the cookie was set" - test_web_organize_routes.py (outer): add assert "plan" in response.text.lower() to test_scan_returns_plan and test_scan_with_recursive_option to match inner file and avoid status-code-only assertions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/conftest.py (1)
416-438:⚠️ Potential issue | 🟠 MajorImport
TestClientat module scope to resolve F821 linting errors.The annotations on lines 416 and 438 reference
TestClientwith quoted string literals. Althoughfrom __future__ import annotationsis enabled, Ruff still validates thatTestClientis defined at module scope — it is only imported locally within theweb_client_builderfixture on line 388, making it unavailable to these helper functions.Add a
TYPE_CHECKINGconditional import at the top of the module:Suggested fix
-from typing import Any +from typing import Any, TYPE_CHECKING + +if TYPE_CHECKING: + from fastapi.testclient import TestClientThen remove the quotes from the annotations on lines 416 and 438:
-def get_csrf_token(client: "TestClient", seed_url: str = "/ui/") -> str: +def get_csrf_token(client: TestClient, seed_url: str = "/ui/") -> str:-def get_csrf_headers(client: "TestClient", seed_url: str = "/ui/") -> dict[str, str]: +def get_csrf_headers(client: TestClient, seed_url: str = "/ui/") -> dict[str, str]:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/conftest.py` around lines 416 - 438, Add a module-scope import for TestClient under a typing-only guard and update the type annotations: add "from typing import TYPE_CHECKING" at top and inside "if TYPE_CHECKING: from httpx import TestClient" (or the appropriate TestClient import), then remove the string quotes around TestClient in the annotations for get_csrf_token and get_csrf_headers so they read client: TestClient instead of client: "TestClient"; this ensures Ruff sees TestClient at module scope while keeping the import runtime-free.
♻️ Duplicate comments (1)
.github/workflows/ci.yml (1)
233-240:⚠️ Potential issue | 🟠 Major
-n=1still leaves shard 3 running under xdist.
pytest-xdistonly runs tests in the main pytest process when xdist is disabled (-n 0or no-nflag). Any positive-nvalue still uses worker processes, so shard 3 can still hit the same worker-shutdown path this change is trying to avoid. Please omit-nfor shard 3 entirely, or switch that shard to-n=0. (pytest-xdist.readthedocs.io)Suggested fix
- # Shard 3 (tests/api tests/web) runs single-worker to prevent the - # GC-finaliser hang from many TestClient objects accumulated per - # worker — their async shutdown code deadlocks when the xdist - # worker process exits without a running event loop. - WORKERS=$([[ "${{ matrix.shard }}" == "3" ]] && echo "1" || echo "auto") + # Shard 3 (tests/api tests/web) runs in-process to avoid xdist + # worker shutdown entirely; TestClient GC finalizers can hang when + # a worker exits without a running event loop. + XDIST_ARGS=() + if [[ "${{ matrix.shard }}" != "3" ]]; then + XDIST_ARGS=(-n=auto) + fi pytest $PATHS --strict-markers -m "not benchmark and not e2e" \ --cov=file_organizer --cov-report= \ - --timeout=30 -n="$WORKERS" --override-ini="addopts=" + --timeout=30 "${XDIST_ARGS[@]}" --override-ini="addopts="🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/ci.yml around lines 233 - 240, The current logic still passes a positive -n to pytest for shard 3 because WORKERS is set to "1", which keeps pytest-xdist enabled; change the shard-3 handling so pytest is invoked without any -n flag (or with -n=0) instead of -n=1: update the WORKERS assignment and the pytest invocation that references WORKERS so that when matrix.shard == "3" the -n option is omitted (or set to "0"), ensuring xdist is disabled for the shard that must run in the main pytest process; the relevant symbols to edit are the WORKERS variable assignment and the pytest command that uses --timeout ... -n="$WORKERS".
🧹 Nitpick comments (1)
tests/web/test_web_organize_routes.py (1)
26-42: Use the shared CSRF helper here too.
tests/conftest.pyalready centralizes CSRF seeding/header construction for theseForm(...)POSTs. Keeping_get_csrf_token()plus the repeated{"x-csrf-token": ...}dicts here duplicates the contract and makes future cookie/header changes easy to miss.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/web/test_web_organize_routes.py` around lines 26 - 42, Remove the local _get_csrf_token helper and replace its usages with the centralized CSRF helper from tests/conftest.py: import and call that shared helper (instead of calling client.get("/ui/organize") and constructing {"x-csrf-token": token} locally) so POST requests use the conftest-provided header dict; update test_web_organize_routes.py to import the conftest helper name (the shared CSRF seeding/header helper) and use it wherever {"x-csrf-token": ...} or _get_csrf_token() were previously used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@tests/conftest.py`:
- Around line 416-438: Add a module-scope import for TestClient under a
typing-only guard and update the type annotations: add "from typing import
TYPE_CHECKING" at top and inside "if TYPE_CHECKING: from httpx import
TestClient" (or the appropriate TestClient import), then remove the string
quotes around TestClient in the annotations for get_csrf_token and
get_csrf_headers so they read client: TestClient instead of client:
"TestClient"; this ensures Ruff sees TestClient at module scope while keeping
the import runtime-free.
---
Duplicate comments:
In @.github/workflows/ci.yml:
- Around line 233-240: The current logic still passes a positive -n to pytest
for shard 3 because WORKERS is set to "1", which keeps pytest-xdist enabled;
change the shard-3 handling so pytest is invoked without any -n flag (or with
-n=0) instead of -n=1: update the WORKERS assignment and the pytest invocation
that references WORKERS so that when matrix.shard == "3" the -n option is
omitted (or set to "0"), ensuring xdist is disabled for the shard that must run
in the main pytest process; the relevant symbols to edit are the WORKERS
variable assignment and the pytest command that uses --timeout ...
-n="$WORKERS".
---
Nitpick comments:
In `@tests/web/test_web_organize_routes.py`:
- Around line 26-42: Remove the local _get_csrf_token helper and replace its
usages with the centralized CSRF helper from tests/conftest.py: import and call
that shared helper (instead of calling client.get("/ui/organize") and
constructing {"x-csrf-token": token} locally) so POST requests use the
conftest-provided header dict; update test_web_organize_routes.py to import the
conftest helper name (the shared CSRF seeding/header helper) and use it wherever
{"x-csrf-token": ...} or _get_csrf_token() were previously used.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: db4c7974-2ba5-4bf3-a37d-c14dd888f36e
📒 Files selected for processing (5)
.github/workflows/ci-full.yml.github/workflows/ci.ymltests/conftest.pytests/test_web_organize_routes.pytests/web/test_web_organize_routes.py
🚧 Files skipped from review as they are similar to previous changes (1)
- .github/workflows/ci-full.yml
Summary
tests/api tests/web) has been timing out in CI since PR fix(ci): add shard 5 to fix GC hang; fix TUI and marketplace test failures #1091 — both py3.11 and py3.12 cancel after 19 minutes (job IDs 69638156037/69638156045 in run 23882386107, repeated in 69662634454/69662634464 in run 23890538305)tests/apicreates ~170TestClientobjects per run without using the context-manager form; when an xdist worker process exits, Python's GC tries to finalise these objects; their async shutdown code deadlocks because no event loop is running at that point-n=1(single worker) for shard 3 only, so teardown runs inside the same pytest process where pytest-asyncio's event loop is still aliveTest plan
-n=auto(2 workers) — no regression🤖 Generated with Claude Code
Summary by CodeRabbit