-
Notifications
You must be signed in to change notification settings - Fork 8
feat: cap deal/retrievals with abort signals #263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds end-to-end abort propagation and job-level timeout enforcement for deal/retrieval workflows so long-running jobs can be actively cancelled while improving metrics/logging around abort vs failure.
Changes:
- Introduces shared abort helpers (
abort-utils) and adoptsAbortSignalpropagation across deal/retrieval flows (including add-ons and IPNI polling). - Enforces per-job timeouts in the pg-boss job runner and records
handler_result="aborted"for timed-out executions. - Updates defaults/docs for job timeouts and HTTP request timeouts, plus adds/updates tests for abort behavior and error preservation.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/environment-variables.md | Documents new job timeout env vars and adds them to the quick reference. |
| apps/backend/src/retrieval/retrieval.service.ts | Propagates abort signals through retrieval execution and adjusts batch behavior on abort. |
| apps/backend/src/retrieval/retrieval.service.spec.ts | Updates tests for abort behavior and aligns Deal IDs with UUID strings. |
| apps/backend/src/retrieval-addons/types.ts | Extends retrieval test result shape with an optional aborted flag. |
| apps/backend/src/retrieval-addons/retrieval-addons.service.ts | Adds abort checks, uses shared abort-aware delay, and improves error capture for non-Error throws (partially). |
| apps/backend/src/retrieval-addons/retrieval-addons.service.spec.ts | Adds test ensuring non-Error throws are captured in execution results. |
| apps/backend/src/metrics-prometheus/metrics-prometheus.module.ts | Documents handler_result semantics for jobs_completed_total. |
| apps/backend/src/jobs/jobs.service.ts | Enforces job-level timeouts via AbortController and reports aborted jobs distinctly. |
| apps/backend/src/jobs/jobs.service.spec.ts | Adds metrics and timeout-abort tests for deal/retrieval jobs; updates private-call signatures. |
| apps/backend/src/deal/deal.service.ts | Propagates abort signal into deal creation/upload/IPNI/retrieval checks; preserves non-Error error messages. |
| apps/backend/src/deal/deal.service.spec.ts | Adds coverage for preserving non-Error abort reasons through deal creation failure recording. |
| apps/backend/src/deal-addons/strategies/ipni.strategy.ts | Propagates abort signal through IPNI monitoring/polling and uses abort-aware delay. |
| apps/backend/src/deal-addons/interfaces/deal-addon.interface.ts | Extends onUploadComplete to accept an optional abort signal. |
| apps/backend/src/deal-addons/deal-addons.service.ts | Propagates abort signal through upload-complete add-on handlers and uses awaitWithAbort. |
| apps/backend/src/config/app.config.ts | Adds config schema + loader for job timeout env vars; reduces default HTTP request timeouts. |
| apps/backend/src/common/abort-utils.ts | Adds shared helpers: createAbortError, awaitWithAbort, and abort-aware delay. |
| apps/backend/src/common/abort-utils.spec.ts | Adds unit tests for abort utilities. |
| apps/backend/.env.example | Adds new job timeout vars and updates HTTP timeout defaults/comments. |
Comments suppressed due to low confidence (2)
apps/backend/src/retrieval-addons/retrieval-addons.service.ts:207
- When a retrieval promise rejects in
testAllRetrievalMethods, the recordederrorusesresult.reason?.message || "Unknown error". If a strategy throws a non-Error(the new spec covers this),.messagewill be undefined and the reason is lost. Preferresult.reason instanceof Error ? result.reason.message : String(result.reason)so execution results preserve the real failure details.
const executionResults: RetrievalExecutionResult[] = results.map((result, index) => {
if (result.status === "fulfilled") {
return result.value;
} else {
// Create failed result - retryCount unknown for catastrophic failures
return {
url: urlResults[index].url,
method: urlResults[index].method,
data: Buffer.alloc(0),
metrics: {
latency: 0,
ttfb: 0,
throughput: 0,
statusCode: 0,
timestamp: new Date(),
responseSize: 0,
},
success: false,
error: result.reason?.message || "Unknown error",
retryCount: undefined, // Unknown for catastrophic failures
};
apps/backend/src/retrieval/retrieval.service.ts:201
performAllRetrievalslogs "All retrievals failed" at error level for any thrown error, including aborts thrown viasignal.throwIfAborted(). This will create noisy failure logs for expected cancellations/timeouts. Consider skipping the error log (or downgrading to warn) whensignal?.abortedis true, similar to the batch-level handling above.
} catch (error) {
const errorMessage = error instanceof Error ? error.message : String(error);
this.logger.error(`All retrievals failed for ${deal.pieceCid}: ${errorMessage}`);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
Adds end-to-end abort propagation with shared utilities, introduces job-level timeout enforcement for deal/retrieval jobs, and improves observability/testing around aborted runs and retrieval failures. Updates HTTP timeout defaults and documents new job timeout env vars.
Problem
Long-running deal/retrieval jobs and downstream steps lacked consistent abort handling, causing wasted work and incomplete observability when timeouts or cancellations occur. Abort reasons could be lost, and job metrics didn’t clearly distinguish aborts from failures.
Solution
abort-utilshelpers (createAbortError,awaitWithAbort,delay) with tests.AbortSignalthrough deal/retrieval flows, add‑ons, and IPNI polling; prevent new work on abort while preserving partial results.AbortController, recordshandler_result="aborted", and keeps success vs business failure semantics.abortedflag; errors preserve non‑Errorabort reasons.Notes
DEAL_JOB_TIMEOUT_SECONDS,RETRIEVAL_JOB_TIMEOUT_SECONDS(defaults 10m/5m) and docs updated.jobs_completed_totalhandler result values (success,aborted,error).Fixes #258