feat: partial failure visibility with completed_with_failures status#92
Conversation
When an action processes N items and some fail while others succeed, the action is now marked "completed_with_failures" instead of silently completing as "completed". This gives operators visibility into partial failures without halting downstream actions. Changes: - Storage backend: add input_snapshot param to set_disposition() and get_failed_items() method on abstract interface - SQLite: add input_snapshot column with schema migration for existing DBs - Result collector: persists JSON-serialized input snapshot for failed items - Executor: queries storage after success to detect partial failures - State manager: completed_with_failures as terminal status, circuit breaker ignores it (descendants run on successful records) - Tally: shows "N OK | M PARTIAL | S SKIP | K ERROR" - Level lines: yellow for levels with partial failures
- Extract _resolve_completion_status() helper and use it in both online and batch completion paths (fixes batch actions ignoring partial failures) - dependency.py: accept completed_with_failures for upstream checks (fixes downstream blocking on partial failures) - manifest.py: is_action_completed() and get_completed_actions() accept completed_with_failures (fixes unnecessary re-runs on resume)
Muizzkolapo
left a comment
There was a problem hiding this comment.
Review: Phase 2 — Partial Failure Visibility
What's addressed
All three issues from the audit are fixed:
| Issue | Status | How |
|---|---|---|
| ARCH-007 (#89) — batch path blind to partial failures | Fixed | Extracted _resolve_completion_status() shared helper, called from _handle_run_success(), _handle_batch_check(), and _handle_batch_check_async() |
ARCH-008 (#90) — dependency.py:167 blocks on completed_with_failures |
Fixed | Set check: in {"completed", "completed_with_failures"} |
ARCH-009 (#91) — manifest.py excludes completed_with_failures |
Fixed | Both is_action_completed() and get_completed_actions() updated |
The _resolve_completion_status() helper is the right call — single definition, used by all three completion paths. No more divergence surface for this check.
The six == "completed" → in {"completed", "completed_with_failures"} updates across executor.py, coordinator.py, execution_events.py, action_executor.py all look correct and consistent.
Storage layer, result collector snapshot persistence, state manager terminal sets, tally display — all clean.
Issues to fix
1. _verify_completion_status() hardcodes status="completed" — loses partial failure signal on resume
executor.py (current L212-216 in the PR):
return (
True,
ActionExecutionResult(
success=True, status="completed", metrics=ExecutionMetrics(duration=0.0)
),
)When a completed_with_failures action is verified on re-run (L555-562 / L617-624), the gate correctly enters — but the returned result hardcodes "completed". The caller returns this directly. So on resume, every completed_with_failures action silently upgrades to "completed", erasing the partial failure signal from downstream consumers (tally, tracker, coordinator).
Fix: Use the actual status from the state manager:
actual_status = self.deps.state_manager.get_status(action_name)
return (
True,
ActionExecutionResult(
success=True, status=actual_status, metrics=ExecutionMetrics(duration=0.0)
),
)Same fix needed at L227-232 (the fallback path when no storage backend exists).
2. BatchCompleteEvent hardcodes completed=1, failed=0 even on partial failure
executor.py L680-685 (and the async twin at L754-759):
fire_event(
BatchCompleteEvent(
...
completed=1,
failed=0,
...
)
)When final_status is completed_with_failures, this event still reports 0 failures. Any dashboard or log consumer watching BatchCompleteEvent will see clean completions when items actually failed.
Fix: Reflect the actual status:
fire_event(
BatchCompleteEvent(
...
completed=1 if final_status == "completed" else 0,
failed=0 if final_status == "completed" else 1,
...
)
)Or better, add a partial field to BatchCompleteEvent if it exists.
3. No tests for the actual completed_with_failures path
The test changes (test_executor_lifecycle.py, test_executor_events.py) mock get_failed_items to return [] — this prevents test breakage but doesn't test the new behavior. There's no test that verifies:
get_failed_items()returns items → action status iscompleted_with_failures(notcompleted)get_failed_items()raises → graceful fallback tocompleted- Batch completion with item failures →
completed_with_failures(the ARCH-007 fix) - Tally shows correct PARTIAL count when partial failures exist
- Level line is yellow (not green) for partial failures
Recommend at minimum adding to test_executor_lifecycle.py:
def test_handle_run_success_with_item_failures(mock_deps):
mock_deps.action_runner.storage_backend.get_failed_items.return_value = [
{"record_id": "rec-1", "disposition": "failed", "reason": "timeout"}
]
executor = ActionExecutor(mock_deps)
result = executor._handle_run_success(params, "/out", 1.0, None)
assert result.status == "completed_with_failures"
mock_deps.state_manager.update_status.assert_called_with(
"test_action", "completed_with_failures", record_limit=None, file_limit=None
)
def test_handle_batch_check_with_item_failures(mock_deps):
mock_deps.batch_manager.handle_batch_agent.return_value = ("/out", "completed")
mock_deps.action_runner.storage_backend.get_failed_items.return_value = [
{"record_id": "rec-1", "disposition": "failed", "reason": "api_error"}
]
executor = ActionExecutor(mock_deps)
result = executor._handle_batch_check("action_a", 0, config, start_time)
assert result.status == "completed_with_failures"Minor
tasks/todo.mdstill describes Phase 2 with theCollectionSummaryapproach (items 2.2, 2.3). The actual implementation is simpler — queryingget_failed_items()directly. Worth updating so the plan matches what shipped.
Verdict
The core architecture is right — shared helper, symmetric batch/online paths, all downstream status checks updated. Fix items 1-3 above before merge. Item 1 is a correctness bug (data loss on resume), item 2 is observability, item 3 is coverage.
WARNINGs fixed: - is_completed() now includes completed_with_failures (stale-completion verification works on resume) - Added PARTIAL to COLORS dict (tally coloring now works) - _resolve_completion_status exception path logs at WARNING not DEBUG - JSON truncation produces valid JSON with __truncated__ sentinel instead of mid-character slice - TOCTOU acknowledged (low impact, documented) MINORs fixed: - mark_action_completed() accepts status parameter - Migration ALTER TABLE logs at DEBUG on both branches - json.dumps failure logs at DEBUG - All test coverage gaps filled (13 new tests): _resolve_completion_status (4 paths), get_failed_items sentinel filtering, circuit breaker ignores partial, is_completed_with_failures, terminal sets, is_workflow_complete with partial, level coloring for partial failures
The set {"completed", "completed_with_failures"} was repeated as inline
literals in 9 locations across 6 files. Extracted as COMPLETED_STATUSES
and TERMINAL_STATUSES frozensets in state.py — single source of truth
when new statuses are added.
Summary
completed_with_failuresinstead of silently completing ascompletedN OK | M PARTIAL | S SKIP | K ERRORand level lines show yellow for partial failurescompleted_with_failures— downstream actions run on the successful recordsChanges
set_disposition()gainsinput_snapshotparam on abstract interface;get_failed_items()convenience method added; SQLite schema migration forinput_snapshotcolumninput_snapshot(fromsource_snapshotorinput_record) when writing FAILED dispositions_handle_run_success()queries storage for item-level failures after the runner returns; setscompleted_with_failureswhen partial failures detectedcompleted_with_failuresadded to terminal sets (get_pending_actions,is_workflow_done,is_workflow_complete)WorkflowCompleteEventgainsactions_partial; formatter includes PARTIAL count; level lines yellow for partial failurescompleted_with_failuresalongsidecompletedTest plan
ruff check .— clean (1 pre-existing issue in test_limits.py)ruff format --check— cleanpytest— 4279 passed, 2 skipped, 0 failurescompleted_with_failures→ tally shows PARTIAL → downstream actions still run on good records