Skip to content

Codex/non blocking task failures#554

Merged
kelvinkipruto merged 7 commits intomainfrom
codex/non-blocking-task-failures
Feb 27, 2026
Merged

Codex/non blocking task failures#554
kelvinkipruto merged 7 commits intomainfrom
codex/non-blocking-task-failures

Conversation

@kelvinkipruto
Copy link
Contributor

@kelvinkipruto kelvinkipruto commented Feb 27, 2026

Description

This PR makes fetchAirtableDocuments non-blocking for per-record data issues (e.g. missing/unmapped political entity) while preserving task retry behavior for transient task-level failures.

Fixes # (issue)

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Screenshots

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation

@vercel
Copy link

vercel bot commented Feb 27, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
PromiseTracker Ready Ready Preview, Comment Feb 27, 2026 9:52am

Request Review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 86b9c9602d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@kelvinkipruto
Copy link
Contributor Author

@claude Review

@claude
Copy link

claude bot commented Feb 27, 2026

Claude finished @kelvinkipruto's task in 1m 48s —— View job


PR Review

  • Gathering context
  • Reading changed files
  • Analyzing error handling patterns
  • Writing review feedback

Code Review: Non-Blocking Task Failures

The overall intent is solid — preventing a single task failure from blocking the rest of a scheduled workflow is a meaningful reliability improvement. The item-level error isolation (inner catch blocks per-document/per-status) is well done and the partial failure logging is clear. However there are a few meaningful concerns:


P1 — Retry semantics silently disabled for transient failures

fetchAirtableDocuments has retries: 2 and downloadDocuments has retries: 2. Both now have outer catch blocks that return { output: { recoverableError: true } } instead of re-throwing. From Payload's perspective this is a successful task completion, so the retry mechanism will never trigger for these tasks.

This means if the Airtable API is temporarily rate-limited, there's a network hiccup, or the DB is momentarily unavailable — the task silently reports success with zero processing rather than retrying up to 2 more times as intended.

// fetchAirtableDocuments.ts:257-278 — this path suppresses retries
} catch (error) {
  // ...
  return {                        // ← Payload sees "success"
    output: {
      recoverableError: true,     // ← only in logs, not observed by retry logic
      error: errorMessage,
    },
  };
}

The same applies to fetchMeedanPromiseStatus.ts:101-121 and syncMeedanPromises.ts:50-72 (though those tasks don't have explicit retry configs).

Suggestion: Consider re-throwing for clearly transient errors (network failures, rate limits, DB connection errors) while returning recoverable for errors that represent expected partial outcomes. Or, if the decision is to fully embrace the "always continue" model, remove the retries configs from fetchAirtableDocuments and downloadDocuments to avoid the misleading implication that retries are happening. Fix this →


P1 — updatePromiseStatus silently skips on Meedan API outage

The config guard correctly hard-stops (throws) for missing credentials. But if fetchProjectMediaStatuses fails (Meedan API down, network error), the outer catch now returns { output: { updated: 0, recoverableError: true } }. This means a Meedan outage causes promise statuses to be silently skipped for the entire workflow run rather than being retried.

Unlike fetchPromiseStatuses and syncMeedanPromises, this task has no retries config, so skipping is the only outcome on failure. Worth evaluating whether a failed updatePromiseStatus should surface as a job failure or remain silent.


P2 — No differentiation between transient and permanent failures in outer catch

All runtime errors are treated identically in the outer catch blocks. There's no distinction between:

  • Transient errors (network timeout, API rate limit, DB connection refused) — which should trigger retries or surface as failures
  • Permanent/expected errors (malformed data, missing optional field) — which are legitimate "recoverable" cases

A common pattern to distinguish these is checking for specific error types or error codes before deciding whether to swallow or re-throw:

} catch (error) {
  if (isTransientError(error)) {
    throw error; // let retry logic handle it
  }
  // log and return recoverable for permanent errors
}

Positive observations

  • Partial failure handling in fetchAirtableDocuments (lines 226–249): Replacing throw new Error(...) when failedDocs.length > 0 with a warn log is the right call. Previously, 9 successful creates + 1 failure would throw and potentially cause re-processing of the successful 9 on retry.
  • updatePromiseStatus config refactor: Moving the Meedan config guard outside the try block (lines 34–41) cleanly separates hard configuration failures from soft runtime failures. This is a good pattern and matches the PR's stated guardrails.
  • Item-level error isolation: The per-document/per-status inner catch blocks in all tasks are well-structured, logging useful context (IDs, titles, airtableIDs) for debugging.
  • Consistent error message extraction: Standardizing on error instanceof Error ? error.message : String(error ?? "") across all catch blocks (the ?? "" handles null) is a minor but clean improvement.

Summary

The approach of non-blocking task failures is valid and improves workflow resilience. The main risk is that the retries configurations on fetchAirtableDocuments and downloadDocuments are now dead code — transient failures will be swallowed rather than retried. Whether that's acceptable depends on whether the retry behavior was relied upon. The PR is otherwise well-structured and the logging improvements make operational debugging easier.

@kelvinkipruto
Copy link
Contributor Author

@claude Review

@kelvinkipruto
Copy link
Contributor Author

@codex review

@claude
Copy link

claude bot commented Feb 27, 2026

Claude finished @kelvinkipruto's task in 3m 4s —— View job


PR Review

  • Gathering context
  • Reading changed files
  • Analyzing error handling patterns
  • Writing review feedback

Code Review: Non-Blocking Task Failures

Note on previous review: The previous Claude review described recoverableError: true outputs and retry suppression concerns, but those apply to an intermediate commit. The latest commit (781218c fix(tasks): restore retries and skip invalid Airtable docs) reverted that approach — most tasks still re-throw on failure. The review below reflects the current HEAD.


What this PR actually changes

The PR description says "Converted task-level catch blocks to log + return recoverable output" but the current code tells a different story. The real changes are:

  1. fetchAirtableDocuments.ts — the main behavioral change: partial document creation failures no longer cause the task to throw. A pre-validation loop + Map-based lookup now separates validation from creation, and the failedDocs.length > 0 path logs a warn instead of throwing.
  2. updatePromiseStatus.ts — Meedan config guard moved outside the try block to separate hard config failures from soft runtime failures.
  3. All tasksString(error ?? "") handles null/undefined error values consistently.

Most tasks still re-throw in their outer catch blocks, so retries: 2 on fetchAirtableDocuments and downloadDocuments is still effective (unlike what the previous review warned about).


P1 — Dead code path in Promise.allSettled mapper

fetchAirtableDocuments.ts:195–204

The pre-validation loop (lines 165–193) only places a doc into docsReadyToCreate if its politicalEntityAirtableID is non-null and entityByAirtableID.get(...) returns a truthy value. Yet inside the Promise.allSettled mapper, the entity is re-looked up and null is returned if not found:

const entity = politicalEntityAirtableID
  ? entityByAirtableID.get(politicalEntityAirtableID)
  : null;

if (!entity) {
  return null;  // ← dead code: can never be reached for docs in docsReadyToCreate
}

The null return value then triggers the "Missing related political entity" failure path at line 235. This is unreachable for any doc that passed validation, making it confusing dead code. The mapper can safely assume entity exists since validation already guaranteed it. Fix this →


P2 — PR description is stale

The description says "Converted task-level catch blocks to log + return recoverable output (instead of re-throwing)" but this was reverted in the latest commit. The actual non-blocking behavior is narrower: only partial doc creation failures in fetchAirtableDocuments are non-blocking. The "What Changed" section and bullet list in the PR body should be updated to reflect the current code, or readers will expect behavior that doesn't exist (e.g., silent swallowing of Meedan/Airtable outages).


P2 — findGlobal errors in updatePromiseStatus lose context

Moving payload.findGlobal(...) outside the try block (lines 34–40 in updatePromiseStatus.ts) means a transient DB error during settings fetch will no longer be caught by the task's own catch, so the requestedDocumentIds context is lost from the error log. It will still be caught by the withTaskTracing wrapper (as task.error with Sentry capture), so it's not a silent failure — but the structured error log will have less diagnostic context than before.

This is a minor observability tradeoff, not a blocking concern. Worth noting for the logging contract expectation.


Positive observations

  • Pre-validation + Map lookup in fetchAirtableDocuments: Separating validation from creation is the right approach. The old code passed invalid docs through Promise.allSettled and only reported errors afterward. Now invalid docs are caught eagerly, their Airtable status is updated immediately, and the docsReadyToCreate list is clean. The O(1) Map lookup over O(n) Array.find is a minor but clean perf improvement.

  • Partial failure fix is semantically correct: Previously, if 9 docs were created and 1 failed, the task threw — which would trigger retries: 2 and potentially re-create the 9 successful docs. The new logger.warn path correctly treats this as a partial success, leaving retries for true system failures.

  • withTaskTracing provides a safety net: The tracing wrapper (in utils.ts:147–187) wraps the entire handler in try/catch with Sentry capture, meaning any exception that escapes the task's own catch (including the moved findGlobal call) is still logged and reported.

  • String(error ?? "") consistency: Minor but correct improvement — String(null) returns "null" which is unhelpful in log output; String(null ?? "") returns "" which is cleaner.


Summary

The core change (partial doc failure → warn instead of throw) is correct and well-structured. The remaining tasks still throw on system failures as expected, so retry semantics are intact. The main actionable items are the dead code path in the Promise.allSettled mapper (P1) and updating the PR description to match the actual behavior.

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Chef's kiss.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@kelvinkipruto
Copy link
Contributor Author

@claude Review

@claude
Copy link

claude bot commented Feb 27, 2026

Claude finished @kelvinkipruto's task in 2m 18s —— View job


PR Review

  • Gathering context
  • Reading changed files
  • Analyzing changes
  • Writing review feedback

Code Review: Non-Blocking Task Failures

Note: This review reflects the current HEAD (a5916f9) after the previous review feedback was addressed.


What this PR actually changes

  1. fetchAirtableDocuments.ts — the substantive behavioral change: a pre-validation loop now separates docs with missing/unmapped politicalEntity from valid docs before Promise.allSettled. Invalid docs get their Airtable status updated immediately (Failed: ...) and are added to failedDocs. Only docsReadyToCreate (validated docs) are passed to the creation step. After creation, if any docs in failedDocs, the old throw is replaced with logger.warn — partial failures are no longer blocking.

  2. updatePromiseStatus.tspayload.findGlobal and the Meedan config guard moved outside the try block, separating hard config errors from soft runtime failures.

  3. All other tasks (downloadDocuments, extractDocuments, extractPromises, fetchMeedanPromiseStatus, syncMeedanPromises) — only String(error ?? "") normalization (handles null gracefully in log output).

All outer catch blocks still throw error, so retries: 2 on fetchAirtableDocuments and downloadDocuments remains fully effective. The previous review concern about retries being disabled does not apply to the current code.


P1 — setDocumentStatus failure in pre-validation loop aborts remaining docs

fetchAirtableDocuments.ts:178,191 — the pre-validation loop calls await setDocumentStatus(doc.id, ...) synchronously for each failed doc. If setDocumentStatus throws (transient Airtable API error), the outer catch fires and the entire task fails — including all docs that would have been in docsReadyToCreate and processed successfully.

The creation results loop (line 248) has the same exposure, but it runs after creation. The pre-validation loop runs before — so a status-write failure for doc #1 in a batch of 100 would prevent docs #2–100 from ever being created.

The task's retries: 2 provides some protection, but on retry all docs will be re-evaluated. Consider whether setDocumentStatus failures in the validation loop should also be absorbed (logged, not thrown), to match the intent of the PR. Or at minimum, document this as a known trade-off. Fix this →


P2 — findGlobal DB errors in updatePromiseStatus lose requestedDocumentIds context

updatePromiseStatus.ts:34–36payload.findGlobal is now outside try. If the DB is transiently unavailable during settings fetch, the local catch (line 172) does not fire, so the error log at lines 175–180 (which includes requestedDocumentIds) is skipped. The withTaskTracing wrapper will still capture the error for Sentry, but with less structured context.

This is a minor observability tradeoff — acceptable given the withTaskTracing safety net, but worth documenting in code or noting in the PR description.


Positive observations

  • Pre-validation + Map is the right design: Eagerly filtering invalid docs before creation avoids the old problem where a single bad doc could abort 9 successful creates and re-trigger retries for all of them. The O(1) Map.get() over the previous Array.find() is a clean improvement.

  • Dead code removed in latest commit (a5916f9): The previous reviews correctly identified the entity re-lookup inside the Promise.allSettled mapper as dead code (since pre-validation already guaranteed entity presence). This has been cleaned up — the mapper now destructures { doc, entity } directly from docsReadyToCreate. Good follow-through.

  • Retry semantics intact: All tasks still re-throw in their outer catch blocks. The retries: 2 configs on fetchAirtableDocuments and downloadDocuments are effective. This PR does not regress retry behavior.

  • Config guard pattern is consistent: Both fetchAirtableDocuments (Airtable config) and updatePromiseStatus (Meedan config) now guard config outside try. This makes it clear that missing config is a hard failure, not a recoverable one.

  • String(error ?? "") across all tasks: Clean, handles null/undefined thrown values (which String(error) would render as "null" or "undefined" in logs).


Summary

The core change (pre-validation + partial failure warn instead of throw in fetchAirtableDocuments) is correct and well-structured. The remaining concern is the setDocumentStatus calls in the pre-validation loop being able to abort the entire batch on a transient Airtable error — which is at odds with the PR's stated intent of non-blocking per-record failures. All other changes are minor refinements. This is otherwise ready.

Remove redundant errorMessage variable declarations and directly construct error messages within logger.error calls. This reduces code verbosity and improves readability by eliminating unnecessary intermediate variables.
Remove unnecessary nullish coalescing with empty string when converting errors to strings. The String() constructor already handles null/undefined by converting them to "null"/"undefined", making the ?? "" redundant.
@kelvinkipruto kelvinkipruto merged commit 45f2b62 into main Feb 27, 2026
5 of 6 checks passed
@kelvinkipruto kelvinkipruto deleted the codex/non-blocking-task-failures branch February 27, 2026 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant