Skip to content

Conversation

@v1r3n
Copy link
Contributor

@v1r3n v1r3n commented Feb 4, 2026

feat: Resilience for Multi-Homed Workers (Circuit Breaker + Poll Timeout)

Summary

This PR significantly enhances the Multi-Homed Worker implementation (polling tasks from multiple Conductor servers simultaneously) by adding critical resilience mechanisms. It ensures that slow, unresponsive, or down servers do not degrade the performance of the worker or block it from processing tasks from healthy servers.

Key Changes

1. Resilience & Fault Tolerance

  • Circuit Breaker Pattern: Implemented a per-server circuit breaker.
    • Logic: If a server fails 3 consecutive times (_CIRCUIT_FAILURE_THRESHOLD), it is marked as "open" and skipped for 30 seconds (_CIRCUIT_RESET_SECONDS).
    • Benefit: Prevents wasted poll cycles and log spam when a server is down.
  • Poll Timeout: Added a hard timeout (_POLL_TIMEOUT_SECONDS = 5s) to the batch polling mechanism.
    • Logic: Uses asyncio.wait_for (Async) or future.result(timeout) (Sync) to ensure the poll loop never hangs indefinitely on a slow server.
    • Benefit: "Straggler" servers don't block task processing from healthy servers.

2. Performance & Efficiency

  • Reusable Poll Executor:
    • For synchronous TaskRunner, we now maintain a persistent ThreadPoolExecutor specifically for polling, sized exactly to the number of servers.
    • Benefit: Eliminates thread creation/destruction overhead per poll cycle.
  • Optimal Capacity Distribution:
    • Polling capacity is now distributed mathematically evenly across servers (handling remainders), ensuring we strictly respect the worker's thread_count.
  • Single Server Optimization:
    • If only 1 server is configured, we bypass the executor overhead entirely and poll directly.

3. Thread Safety & Cleanup

  • Map Cleanup: Added robust cleanup logic for the _task_server_map (which routes task updates back to the correct server).
    • Explicit pop() operations ensure no memory leaks even for long-running or async tasks.
  • Async Handling: Verified and fixed handling for ASNYC_TASK_RUNNING and TaskInProgress states to ensure maps are cleaned up correctly upon eventual completion.

4. Usability & Configuration

  • Environment Variables: Added Configuration.from_env_multi() to easily configure multi-homed workers via comma-separated values:
    export CONDUCTOR_SERVER_URL=https://server1/api,https://server2/api
    export CONDUCTOR_AUTH_KEY=key1,key2
    export CONDUCTOR_AUTH_SECRET=secret1,secret2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants