Skip to content

feat: partial failure visibility with completed_with_failures status#92

Merged
Muizzkolapo merged 5 commits intomainfrom
feat/partial-failure-visibility-and-sidecar-manifest
Apr 2, 2026
Merged

feat: partial failure visibility with completed_with_failures status#92
Muizzkolapo merged 5 commits intomainfrom
feat/partial-failure-visibility-and-sidecar-manifest

Conversation

@Muizzkolapo
Copy link
Copy Markdown
Owner

Summary

  • When an action processes N items and some fail while others succeed, the action is now marked completed_with_failures instead of silently completing as completed
  • Failed items have their input snapshot persisted in the disposition table (JSON-serialized, 10KB cap) — enables future retry functionality
  • The tally now shows N OK | M PARTIAL | S SKIP | K ERROR and level lines show yellow for partial failures
  • The circuit breaker correctly ignores completed_with_failures — downstream actions run on the successful records

Changes

  • Storage layer: set_disposition() gains input_snapshot param on abstract interface; get_failed_items() convenience method added; SQLite schema migration for input_snapshot column
  • Result collector: passes input_snapshot (from source_snapshot or input_record) when writing FAILED dispositions
  • Executor: _handle_run_success() queries storage for item-level failures after the runner returns; sets completed_with_failures when partial failures detected
  • State manager: completed_with_failures added to terminal sets (get_pending_actions, is_workflow_done, is_workflow_complete)
  • Display: WorkflowCompleteEvent gains actions_partial; formatter includes PARTIAL count; level lines yellow for partial failures
  • Status checks: 6 locations updated to accept completed_with_failures alongside completed

Test plan

  • ruff check . — clean (1 pre-existing issue in test_limits.py)
  • ruff format --check — clean
  • pytest — 4279 passed, 2 skipped, 0 failures
  • Manual: run workflow where 1 item fails → action shows completed_with_failures → tally shows PARTIAL → downstream actions still run on good records

When an action processes N items and some fail while others succeed,
the action is now marked "completed_with_failures" instead of silently
completing as "completed". This gives operators visibility into partial
failures without halting downstream actions.

Changes:
- Storage backend: add input_snapshot param to set_disposition() and
  get_failed_items() method on abstract interface
- SQLite: add input_snapshot column with schema migration for existing DBs
- Result collector: persists JSON-serialized input snapshot for failed items
- Executor: queries storage after success to detect partial failures
- State manager: completed_with_failures as terminal status, circuit
  breaker ignores it (descendants run on successful records)
- Tally: shows "N OK | M PARTIAL | S SKIP | K ERROR"
- Level lines: yellow for levels with partial failures
- Extract _resolve_completion_status() helper and use it in both
  online and batch completion paths (fixes batch actions ignoring
  partial failures)
- dependency.py: accept completed_with_failures for upstream checks
  (fixes downstream blocking on partial failures)
- manifest.py: is_action_completed() and get_completed_actions()
  accept completed_with_failures (fixes unnecessary re-runs on resume)
Copy link
Copy Markdown
Owner Author

@Muizzkolapo Muizzkolapo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Phase 2 — Partial Failure Visibility

What's addressed

All three issues from the audit are fixed:

Issue Status How
ARCH-007 (#89) — batch path blind to partial failures Fixed Extracted _resolve_completion_status() shared helper, called from _handle_run_success(), _handle_batch_check(), and _handle_batch_check_async()
ARCH-008 (#90) — dependency.py:167 blocks on completed_with_failures Fixed Set check: in {"completed", "completed_with_failures"}
ARCH-009 (#91) — manifest.py excludes completed_with_failures Fixed Both is_action_completed() and get_completed_actions() updated

The _resolve_completion_status() helper is the right call — single definition, used by all three completion paths. No more divergence surface for this check.

The six == "completed"in {"completed", "completed_with_failures"} updates across executor.py, coordinator.py, execution_events.py, action_executor.py all look correct and consistent.

Storage layer, result collector snapshot persistence, state manager terminal sets, tally display — all clean.

Issues to fix

1. _verify_completion_status() hardcodes status="completed" — loses partial failure signal on resume

executor.py (current L212-216 in the PR):

return (
    True,
    ActionExecutionResult(
        success=True, status="completed", metrics=ExecutionMetrics(duration=0.0)
    ),
)

When a completed_with_failures action is verified on re-run (L555-562 / L617-624), the gate correctly enters — but the returned result hardcodes "completed". The caller returns this directly. So on resume, every completed_with_failures action silently upgrades to "completed", erasing the partial failure signal from downstream consumers (tally, tracker, coordinator).

Fix: Use the actual status from the state manager:

actual_status = self.deps.state_manager.get_status(action_name)
return (
    True,
    ActionExecutionResult(
        success=True, status=actual_status, metrics=ExecutionMetrics(duration=0.0)
    ),
)

Same fix needed at L227-232 (the fallback path when no storage backend exists).

2. BatchCompleteEvent hardcodes completed=1, failed=0 even on partial failure

executor.py L680-685 (and the async twin at L754-759):

fire_event(
    BatchCompleteEvent(
        ...
        completed=1,
        failed=0,
        ...
    )
)

When final_status is completed_with_failures, this event still reports 0 failures. Any dashboard or log consumer watching BatchCompleteEvent will see clean completions when items actually failed.

Fix: Reflect the actual status:

fire_event(
    BatchCompleteEvent(
        ...
        completed=1 if final_status == "completed" else 0,
        failed=0 if final_status == "completed" else 1,
        ...
    )
)

Or better, add a partial field to BatchCompleteEvent if it exists.

3. No tests for the actual completed_with_failures path

The test changes (test_executor_lifecycle.py, test_executor_events.py) mock get_failed_items to return [] — this prevents test breakage but doesn't test the new behavior. There's no test that verifies:

  • get_failed_items() returns items → action status is completed_with_failures (not completed)
  • get_failed_items() raises → graceful fallback to completed
  • Batch completion with item failures → completed_with_failures (the ARCH-007 fix)
  • Tally shows correct PARTIAL count when partial failures exist
  • Level line is yellow (not green) for partial failures

Recommend at minimum adding to test_executor_lifecycle.py:

def test_handle_run_success_with_item_failures(mock_deps):
    mock_deps.action_runner.storage_backend.get_failed_items.return_value = [
        {"record_id": "rec-1", "disposition": "failed", "reason": "timeout"}
    ]
    executor = ActionExecutor(mock_deps)
    result = executor._handle_run_success(params, "/out", 1.0, None)
    assert result.status == "completed_with_failures"
    mock_deps.state_manager.update_status.assert_called_with(
        "test_action", "completed_with_failures", record_limit=None, file_limit=None
    )

def test_handle_batch_check_with_item_failures(mock_deps):
    mock_deps.batch_manager.handle_batch_agent.return_value = ("/out", "completed")
    mock_deps.action_runner.storage_backend.get_failed_items.return_value = [
        {"record_id": "rec-1", "disposition": "failed", "reason": "api_error"}
    ]
    executor = ActionExecutor(mock_deps)
    result = executor._handle_batch_check("action_a", 0, config, start_time)
    assert result.status == "completed_with_failures"

Minor

  • tasks/todo.md still describes Phase 2 with the CollectionSummary approach (items 2.2, 2.3). The actual implementation is simpler — querying get_failed_items() directly. Worth updating so the plan matches what shipped.

Verdict

The core architecture is right — shared helper, symmetric batch/online paths, all downstream status checks updated. Fix items 1-3 above before merge. Item 1 is a correctness bug (data loss on resume), item 2 is observability, item 3 is coverage.

Muizzkolapo and others added 3 commits April 2, 2026 11:46
WARNINGs fixed:
- is_completed() now includes completed_with_failures (stale-completion
  verification works on resume)
- Added PARTIAL to COLORS dict (tally coloring now works)
- _resolve_completion_status exception path logs at WARNING not DEBUG
- JSON truncation produces valid JSON with __truncated__ sentinel
  instead of mid-character slice
- TOCTOU acknowledged (low impact, documented)

MINORs fixed:
- mark_action_completed() accepts status parameter
- Migration ALTER TABLE logs at DEBUG on both branches
- json.dumps failure logs at DEBUG
- All test coverage gaps filled (13 new tests):
  _resolve_completion_status (4 paths), get_failed_items sentinel
  filtering, circuit breaker ignores partial, is_completed_with_failures,
  terminal sets, is_workflow_complete with partial, level coloring
  for partial failures
The set {"completed", "completed_with_failures"} was repeated as inline
literals in 9 locations across 6 files. Extracted as COMPLETED_STATUSES
and TERMINAL_STATUSES frozensets in state.py — single source of truth
when new statuses are added.
@Muizzkolapo Muizzkolapo merged commit f4a1b32 into main Apr 2, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant