diff --git a/AGENTS.md b/AGENTS.md index 767d0fd4..891fa655 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1,781 +1,92 @@ -# Conductor SDK Architecture & Implementation Guide +# AI Agent Guide for Conductor Python SDK -A comprehensive guide for implementing Conductor SDKs across all languages (Java, Go, C#, JavaScript/TypeScript, Clojure) based on the Python SDK reference architecture. +This document provides context and instructions for AI agents working on the `conductor-python` repository. -## Executive Summary +## 1. Project Overview -This guide provides a complete blueprint for creating or refactoring Conductor SDKs to match the architecture, API design, and documentation standards established in the Python SDK. Each language should maintain its idiomatic patterns while following the core architectural principles. +The **Conductor Python SDK** allows Python applications to interact with [Netflix Conductor](https://conductor.netflix.com/) and [Orkes Conductor](https://orkes.io/). It enables developers to: +1. **Create Workers**: Poll for tasks and execute business logic. +2. **Manage Workflows**: Start, stop, pause, and query workflows. +3. **Manage Metadata**: Register task and workflow definitions. ---- +## 2. Repository Structure -## ๐Ÿ—๏ธ SDK Architecture Blueprint +| Directory | Description | +|-----------|-------------| +| `src/conductor/client` | Core SDK source code. | +| `src/conductor/client/automator` | **Worker Framework**: `TaskRunner`, `AsyncTaskRunner`, `TaskHandler`. | +| `src/conductor/client/http` | **API Layer**: Low-level HTTP clients (OpenAPI generated). | +| `src/conductor/client/worker` | **Worker Interfaces**: `@worker_task` decorator, `WorkerInterface`. | +| `src/conductor/client/configuration` | **Configuration**: Settings, Auth, multi-homed logic. | +| `tests/unit` | Unit tests (fast, mocked). | +| `tests/integration` | Integration tests (require running Conductor server). | +| `examples/` | usage examples for users. | -### Core Architecture Layers +## 3. Key Components & Architecture -``` -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Application Layer โ”‚ -โ”‚ (User's Application Code) โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ†“ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ High-Level Clients โ”‚ -โ”‚ (OrkesClients, WorkflowExecutor, Workers) โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ†“ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Domain-Specific Clients โ”‚ -โ”‚ (TaskClient, WorkflowClient, SecretClient...) โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ†“ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Orkes Implementations โ”‚ -โ”‚ (OrkesTaskClient, OrkesWorkflowClient...) โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ†“ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ Resource API Layer โ”‚ -โ”‚ (TaskResourceApi, WorkflowResourceApi...) โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ - โ†“ -โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” -โ”‚ HTTP/API Client โ”‚ -โ”‚ (ApiClient, HTTP Transport) โ”‚ -โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ -``` +### 3.1 Worker Framework (`automator/`) -### Client Hierarchy Pattern +The worker framework is the most complex part of the SDK. It handles the "polling loop" pattern. -``` -AbstractClient (Interface/ABC) - โ†‘ -OrkesBaseClient (Shared Implementation) - โ†‘ -OrkesSpecificClient (Concrete Implementation) -``` +* **`TaskHandler`**: Entry point. Manages `TaskRunner` instances. Auto-detects configuration. +* **`TaskRunner` (Sync)**: + * Uses `ThreadPoolExecutor` for concurrent task execution. + * Supports **Multi-Homed Polling** (polling multiple servers in parallel). + * **Crucial Logic**: `__batch_poll_tasks` handles the poll loop, circuit breakers, and timeouts. +* **`AsyncTaskRunner` (Async)**: + * Uses `asyncio` event loop. + * Optimized for high-concurrency I/O bound tasks. + * Also supports multi-homed polling. ---- +### 3.2 Configuration (`configuration/`) -## ๐Ÿ“ฆ Package Structure +The `Configuration` class manages server connection details. +* **Factory Method**: `Configuration.from_env_multi()` allows creating multiple config objects from comma-separated env vars (`CONDUCTOR_SERVER_URL`). +* **Authentication**: Handled via `AuthenticationSettings` (key/secret). -### Standard Package Organization +### 3.3 Multi-Homed Workers (Feature Highlight) -``` -conductor-{language}/ -โ”œโ”€โ”€ src/ -โ”‚ โ””โ”€โ”€ conductor/ -โ”‚ โ”œโ”€โ”€ client/ -โ”‚ โ”‚ โ”œโ”€โ”€ {domain}_client.{ext} # Abstract interfaces -โ”‚ โ”‚ โ”œโ”€โ”€ orkes/ -โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ orkes_base_client.{ext} -โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ orkes_{domain}_client.{ext} -โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ models/ -โ”‚ โ”‚ โ”œโ”€โ”€ http/ -โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ api/ # Generated from OpenAPI -โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ *_resource_api.{ext} -โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ models/ # Generated models -โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ api_client.{ext} -โ”‚ โ”‚ โ”œโ”€โ”€ automator/ -โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ task_runner.{ext} -โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ async_task_runner.{ext} -โ”‚ โ”‚ โ”œโ”€โ”€ configuration/ -โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ configuration.{ext} -โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ settings/ -โ”‚ โ”‚ โ”œโ”€โ”€ worker/ -โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ worker_task.{ext} -โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ worker_discovery.{ext} -โ”‚ โ”‚ โ””โ”€โ”€ workflow/ -โ”‚ โ”‚ โ”œโ”€โ”€ conductor_workflow.{ext} -โ”‚ โ”‚ โ””โ”€โ”€ task/ -โ”œโ”€โ”€ examples/ -โ”‚ โ”œโ”€โ”€ workers_e2e.{ext} # End-to-end example -โ”‚ โ”œโ”€โ”€ {feature}_journey.{ext} # 100% API coverage demos -โ”‚ โ””โ”€โ”€ README.md # Examples catalog -โ”œโ”€โ”€ docs/ -โ”‚ โ”œโ”€โ”€ AUTHORIZATION.md # 49 APIs -โ”‚ โ”œโ”€โ”€ METADATA.md # 21 APIs -โ”‚ โ”œโ”€โ”€ INTEGRATION.md # 28+ providers -โ”‚ โ”œโ”€โ”€ TASK_MANAGEMENT.md # 11 APIs -โ”‚ โ”œโ”€โ”€ SECRET_MANAGEMENT.md # 9 APIs -โ”‚ โ”œโ”€โ”€ WORKFLOW_TESTING.md -โ”‚ โ””โ”€โ”€ ... -โ””โ”€โ”€ tests/ - โ”œโ”€โ”€ unit/ - โ”œโ”€โ”€ integration/ - โ””โ”€โ”€ e2e/ -``` +Refactored in Jan 2026 to support High Availability. +* **Concept**: Workers poll from N servers. +* **Implementation**: + * `_poll_executor`: Dedicated thread pool for polling (TaskRunner). + * `_server_map`: Internal map `{task_id: server_index}` to route updates back to the correct server. + * **Resilience**: Circuit Breaker (skips down servers for 30s) and Poll Timeout (5s). ---- +## 4. Development Guidelines -## ๐ŸŽฏ Implementation Checklist +### 4.1 Running Tests -### Phase 1: Core Infrastructure +* **Unit Tests**: Run with `pytest`. + ```bash + PYTHONPATH=src pytest tests/unit + ``` +* **Key Test Files**: + * `tests/unit/automator/test_multi_homed.py`: Validates multi-server logic and circuit breakers. + * `tests/unit/automator/test_task_runner.py`: Validates core runner logic. -#### 1.1 Configuration System +### 4.2 Logging -- [ ] Create Configuration class with builder pattern -- [ ] Support environment variables -- [ ] Implement hierarchical configuration (all โ†’ domain โ†’ task) -- [ ] Add authentication settings (key/secret, token) -- [ ] Include retry configuration -- [ ] Add connection pooling settings +* Use `logging.getLogger(__name__)`. +* The SDK has extensive logging for debugging polling issues. -#### 1.2 HTTP/API Layer +### 4.3 Code Style -- [ ] Generate models from OpenAPI specification -- [ ] Generate resource API classes -- [ ] Implement ApiClient with: - - [ ] Connection pooling - - [ ] Retry logic with exponential backoff - - [ ] Request/response interceptors - - [ ] Error handling and mapping - - [ ] Metrics collection hooks +* Follow PEP 8. +* Type hints are strongly encouraged (`def foo(bar: str) -> int:`). +* Use `logger.debug` for high-volume logs (like polling loops). -#### 1.3 Base Client Architecture +## 5. Common Tasks -- [ ] Create abstract base clients (interfaces) -- [ ] Implement OrkesBaseClient aggregating all APIs -- [ ] Add proper dependency injection -- [ ] Implement client factory pattern +### Adding a New API Method +1. Check `src/conductor/client/http/api` (Generated code). Do NOT edit manually if possible. +2. Add high-level method to `OrkesTaskClient` or `OrkesWorkflowClient`. -### Phase 2: Domain Clients - -For each domain, implement: - -#### 2.1 Task Client - -``` -Abstract Interface (11 methods): -- poll_task(task_type, worker_id?, domain?) -- batch_poll_tasks(task_type, worker_id?, count?, timeout?, domain?) -- get_task(task_id) -- update_task(task_result) -- update_task_by_ref_name(workflow_id, ref_name, status, output, worker_id?) -- update_task_sync(workflow_id, ref_name, status, output, worker_id?) -- get_queue_size_for_task(task_type) -- add_task_log(task_id, message) -- get_task_logs(task_id) -- get_task_poll_data(task_type) -- signal_task(workflow_id, ref_name, data) -``` - -#### 2.2 Workflow Client - -``` -Abstract Interface (20+ methods): -- start_workflow(start_request) -- get_workflow(workflow_id, include_tasks?) -- get_workflow_status(workflow_id, include_output?, include_variables?) -- delete_workflow(workflow_id, archive?) -- terminate_workflow(workflow_id, reason?, trigger_failure?) -- pause_workflow(workflow_id) -- resume_workflow(workflow_id) -- restart_workflow(workflow_id, use_latest_def?) -- retry_workflow(workflow_id, resume_subworkflow?) -- rerun_workflow(workflow_id, rerun_request) -- skip_task_from_workflow(workflow_id, task_ref, skip_request) -- test_workflow(test_request) -- search(start?, size?, free_text?, query?) -- execute_workflow(start_request, request_id?, wait_until?, wait_seconds?) -[... additional methods] -``` - -#### 2.3 Metadata Client (21 APIs) - -#### 2.4 Authorization Client (49 APIs) - -#### 2.5 Secret Client (9 APIs) - -#### 2.6 Integration Client (28+ providers) - -#### 2.7 Prompt Client (8 APIs) - -#### 2.8 Schedule Client (15 APIs) - -### Phase 3: Worker Framework - -#### 3.1 Worker Task Decorator/Annotation - -- [ ] Create worker registration system -- [ ] Implement task discovery -- [ ] Add worker lifecycle management -- [ ] Support both sync and async workers - -#### 3.2 Task Runner - -- [ ] Implement TaskRunner with thread pool -- [ ] Implement AsyncTaskRunner with event loop -- [ ] Add metrics collection -- [ ] Implement graceful shutdown -- [ ] Add health checks - -#### 3.3 Worker Features - -- [ ] Task context injection -- [ ] Automatic retries -- [ ] TaskInProgress support for long-running tasks -- [ ] Error handling (retryable vs terminal) -- [ ] Worker discovery from packages - -### Phase 4: Workflow DSL - -- [ ] Implement ConductorWorkflow builder -- [ ] Add all task types (Simple, HTTP, Switch, Fork, DoWhile, etc.) -- [ ] Support method chaining -- [ ] Add workflow validation -- [ ] Implement workflow testing utilities - -### Phase 5: Examples - -#### 5.1 Core Examples - -- [ ] `workers_e2e` - Complete end-to-end example -- [ ] `worker_example` - Worker patterns -- [ ] `task_context_example` - Long-running tasks -- [ ] `workflow_example` - Workflow creation -- [ ] `test_workflows` - Testing patterns - -#### 5.2 Journey Examples (100% API Coverage) - -- [ ] `authorization_journey` - All 49 authorization APIs -- [ ] `metadata_journey` - All 21 metadata APIs -- [ ] `integration_journey` - All integration providers -- [ ] `schedule_journey` - All 15 schedule APIs -- [ ] `prompt_journey` - All 8 prompt APIs -- [ ] `secret_journey` - All 9 secret APIs - -### Phase 6: Documentation - -- [ ] Create all API reference documents (see Documentation section) -- [ ] Add Quick Start for each module -- [ ] Include complete working examples -- [ ] Document all models -- [ ] Add error handling guides -- [ ] Include best practices - ---- - -## ๐ŸŒ Language-Specific Implementation - -### Java Implementation - -```java -// Package Structure -com.conductor.sdk/ -โ”œโ”€โ”€ client/ -โ”‚ โ”œโ”€โ”€ TaskClient.java // Interface -โ”‚ โ”œโ”€โ”€ orkes/ -โ”‚ โ”‚ โ”œโ”€โ”€ OrkesBaseClient.java -โ”‚ โ”‚ โ””โ”€โ”€ OrkesTaskClient.java // Implementation -โ”‚ โ””โ”€โ”€ http/ -โ”‚ โ”œโ”€โ”€ api/ // Generated -โ”‚ โ””โ”€โ”€ models/ // Generated - -// Client Pattern -public interface TaskClient { - Optional pollTask(String taskType, String workerId, String domain); - List batchPollTasks(String taskType, BatchPollRequest request); - // ... other methods -} - -public class OrkesTaskClient extends OrkesBaseClient implements TaskClient { - @Override - public Optional pollTask(String taskType, String workerId, String domain) { - return Optional.ofNullable( - taskResourceApi.poll(taskType, workerId, domain) - ); - } -} - -// Configuration -Configuration config = Configuration.builder() - .serverUrl("http://localhost:8080/api") - .authentication(keyId, keySecret) - .connectionPool(10, 30, TimeUnit.SECONDS) - .retryPolicy(3, 1000) - .build(); - -// Worker Pattern -@WorkerTask("process_order") -public class OrderProcessor implements Worker { - @Override - public TaskResult execute(Task task) { - OrderInput input = task.getInputData(OrderInput.class); - // Process - return TaskResult.complete(output); - } -} - -// Task Runner -TaskRunnerConfigurer configurer = TaskRunnerConfigurer.builder() - .configuration(config) - .workers(new OrderProcessor(), new PaymentProcessor()) - .threadCount(10) - .build(); - -configurer.start(); -``` - -### Go Implementation - -```go -// Package Structure -github.com/conductor-oss/conductor-go/ -โ”œโ”€โ”€ client/ -โ”‚ โ”œโ”€โ”€ task_client.go // Interface -โ”‚ โ”œโ”€โ”€ orkes/ -โ”‚ โ”‚ โ”œโ”€โ”€ base_client.go -โ”‚ โ”‚ โ””โ”€โ”€ task_client.go // Implementation -โ”‚ โ””โ”€โ”€ http/ -โ”‚ โ”œโ”€โ”€ api/ // Generated -โ”‚ โ””โ”€โ”€ models/ // Generated - -// Client Pattern -type TaskClient interface { - PollTask(ctx context.Context, taskType string, opts ...PollOption) (*Task, error) - BatchPollTasks(ctx context.Context, taskType string, opts ...PollOption) ([]*Task, error) - // ... other methods -} - -type orkesTaskClient struct { - *BaseClient - api *TaskResourceAPI -} - -func (c *orkesTaskClient) PollTask(ctx context.Context, taskType string, opts ...PollOption) (*Task, error) { - options := &pollOptions{} - for _, opt := range opts { - opt(options) - } - return c.api.Poll(ctx, taskType, options.WorkerID, options.Domain) -} - -// Configuration -config := client.NewConfig( - client.WithServerURL("http://localhost:8080/api"), - client.WithAuthentication(keyID, keySecret), - client.WithConnectionPool(10, 30*time.Second), - client.WithRetryPolicy(3, time.Second), -) - -// Worker Pattern -type OrderProcessor struct{} - -func (p *OrderProcessor) TaskType() string { - return "process_order" -} - -func (p *OrderProcessor) Execute(ctx context.Context, task *Task) (*TaskResult, error) { - var input OrderInput - if err := task.GetInputData(&input); err != nil { - return nil, err - } - // Process - return NewTaskResultComplete(output), nil -} - -// Task Runner -runner := worker.NewTaskRunner( - worker.WithConfig(config), - worker.WithWorkers(&OrderProcessor{}, &PaymentProcessor{}), - worker.WithThreadCount(10), -) - -runner.Start(ctx) -``` - -### TypeScript/JavaScript Implementation - -```typescript -// Package Structure -@conductor-oss/conductor-sdk/ -โ”œโ”€โ”€ src/ -โ”‚ โ”œโ”€โ”€ client/ -โ”‚ โ”‚ โ”œโ”€โ”€ TaskClient.ts // Interface -โ”‚ โ”‚ โ”œโ”€โ”€ orkes/ -โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ OrkesBaseClient.ts -โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ OrkesTaskClient.ts // Implementation -โ”‚ โ”‚ โ””โ”€โ”€ http/ -โ”‚ โ”‚ โ”œโ”€โ”€ api/ // Generated -โ”‚ โ”‚ โ””โ”€โ”€ models/ // Generated - -// Client Pattern -export interface TaskClient { - pollTask(taskType: string, workerId?: string, domain?: string): Promise; - batchPollTasks(taskType: string, options?: BatchPollOptions): Promise; - // ... other methods -} - -export class OrkesTaskClient extends OrkesBaseClient implements TaskClient { - async pollTask(taskType: string, workerId?: string, domain?: string): Promise { - return await this.taskApi.poll(taskType, { workerId, domain }); - } -} - -// Configuration -const config = new Configuration({ - serverUrl: 'http://localhost:8080/api', - authentication: { - keyId: 'your-key', - keySecret: 'your-secret' - }, - connectionPool: { - maxConnections: 10, - keepAliveTimeout: 30000 - }, - retry: { - maxAttempts: 3, - backoffMs: 1000 - } -}); - -// Worker Pattern (Decorators) -@WorkerTask('process_order') -export class OrderProcessor implements Worker { - async execute(task: Task): Promise { - const input = task.inputData as OrderInput; - // Process - return TaskResult.complete(output); - } -} - -// Worker Pattern (Functional) -export const processOrder = workerTask('process_order', async (task: Task) => { - const input = task.inputData as OrderInput; - // Process - return output; -}); - -// Task Runner -const runner = new TaskRunner({ - config, - workers: [OrderProcessor, PaymentProcessor], - // or functional: workers: [processOrder, processPayment], - options: { - threadCount: 10, - pollInterval: 100 - } -}); - -await runner.start(); -``` - -### C# Implementation - -```csharp -// Package Structure -Conductor.Client/ -โ”œโ”€โ”€ Client/ -โ”‚ โ”œโ”€โ”€ ITaskClient.cs // Interface -โ”‚ โ”œโ”€โ”€ Orkes/ -โ”‚ โ”‚ โ”œโ”€โ”€ OrkesBaseClient.cs -โ”‚ โ”‚ โ””โ”€โ”€ OrkesTaskClient.cs // Implementation -โ”‚ โ””โ”€โ”€ Http/ -โ”‚ โ”œโ”€โ”€ Api/ // Generated -โ”‚ โ””โ”€โ”€ Models/ // Generated - -// Client Pattern -public interface ITaskClient -{ - Task PollTaskAsync(string taskType, string? workerId = null, string? domain = null); - Task> BatchPollTasksAsync(string taskType, BatchPollOptions? options = null); - // ... other methods -} - -public class OrkesTaskClient : OrkesBaseClient, ITaskClient -{ - public async Task PollTaskAsync(string taskType, string? workerId = null, string? domain = null) - { - return await TaskApi.PollAsync(taskType, workerId, domain); - } -} - -// Configuration -var config = new Configuration -{ - ServerUrl = "http://localhost:8080/api", - Authentication = new AuthenticationSettings - { - KeyId = "your-key", - KeySecret = "your-secret" - }, - ConnectionPool = new PoolSettings - { - MaxConnections = 10, - KeepAliveTimeout = TimeSpan.FromSeconds(30) - }, - Retry = new RetryPolicy - { - MaxAttempts = 3, - BackoffMs = 1000 - } -}; - -// Worker Pattern (Attributes) -[WorkerTask("process_order")] -public class OrderProcessor : IWorker -{ - public async Task ExecuteAsync(ConductorTask task) - { - var input = task.GetInputData(); - // Process - return TaskResult.Complete(output); - } -} - -// Task Runner -var runner = new TaskRunner(config) - .AddWorker() - .AddWorker() - .WithOptions(new RunnerOptions - { - ThreadCount = 10, - PollInterval = TimeSpan.FromMilliseconds(100) - }); - -await runner.StartAsync(); -``` - ---- - -## ๐Ÿ“‹ API Method Naming Conventions - -### Consistent Naming Across All Clients - -| Operation | Method Pattern | Example | -|-----------|---------------|---------| -| Create | `create{Resource}` / `save{Resource}` | `createWorkflow`, `saveSchedule` | -| Read (single) | `get{Resource}` | `getTask`, `getWorkflow` | -| Read (list) | `list{Resources}` / `getAll{Resources}` | `listTasks`, `getAllSchedules` | -| Update | `update{Resource}` | `updateTask`, `updateWorkflow` | -| Delete | `delete{Resource}` | `deleteWorkflow`, `deleteSecret` | -| Search | `search{Resources}` | `searchWorkflows`, `searchTasks` | -| Execute | `{action}{Resource}` | `pauseWorkflow`, `resumeSchedule` | -| Test | `test{Resource}` | `testWorkflow` | - -### Parameter Patterns - -``` -Required parameters: Direct method parameters -Optional parameters: Options object or builder pattern - -Example: -- pollTask(taskType: string, options?: PollOptions) -- updateTask(taskId: string, result: TaskResult) -``` - ---- - -## ๐Ÿ“š Documentation Structure - -### Required Documentation Files - -``` -docs/ -โ”œโ”€โ”€ AUTHORIZATION.md # 49 APIs - User, Group, Application, Permissions -โ”œโ”€โ”€ METADATA.md # 21 APIs - Task & Workflow definitions -โ”œโ”€โ”€ INTEGRATION.md # 28+ providers - AI/LLM integrations -โ”œโ”€โ”€ PROMPT.md # 8 APIs - Prompt template management -โ”œโ”€โ”€ SCHEDULE.md # 15 APIs - Workflow scheduling -โ”œโ”€โ”€ SECRET_MANAGEMENT.md # 9 APIs - Secret storage -โ”œโ”€โ”€ TASK_MANAGEMENT.md # 11 APIs - Task operations -โ”œโ”€โ”€ WORKFLOW.md # Workflow operations -โ”œโ”€โ”€ WORKFLOW_TESTING.md # Testing guide -โ”œโ”€โ”€ WORKER.md # Worker implementation -โ””โ”€โ”€ README.md # SDK overview -``` - -### Documentation Template for Each Module - -```markdown -# [Module] API Reference - -Complete API reference for [module] operations in Conductor [Language] SDK. - -> ๐Ÿ“š **Complete Working Example**: See [example.ext] for comprehensive implementation. - -## Quick Start - -```language -// 10-15 line minimal example -``` - -## Quick Links -- [API Category 1](#api-category-1) -- [API Category 2](#api-category-2) -- [API Details](#api-details) -- [Model Reference](#model-reference) -- [Error Handling](#error-handling) -- [Best Practices](#best-practices) - -## API Category Tables - -| Method | Endpoint | Description | Example | -|--------|----------|-------------|---------| -| `methodName()` | `HTTP_VERB /path` | Description | [Link](#anchor) | - -## API Details - -[Detailed examples for each API method] - -## Model Reference - -[Model/class definitions] - -## Error Handling - -[Common errors and handling patterns] - -## Best Practices - -[Good vs bad examples with โœ… and โŒ] - -## Complete Working Example - -[50-150 line runnable example] -``` - ---- - -## ๐Ÿงช Testing Requirements - -### Test Coverage Goals - -| Component | Unit Tests | Integration Tests | E2E Tests | -|-----------|------------|-------------------|-----------| -| Clients | 90% | 80% | - | -| Workers | 95% | 85% | 70% | -| Workflow DSL | 90% | 80% | - | -| Examples | - | 100% | 100% | -``` - -### Test Structure -``` -tests/ -โ”œโ”€โ”€ unit/ -โ”‚ โ”œโ”€โ”€ client/ -โ”‚ โ”‚ โ”œโ”€โ”€ test_task_client.{ext} -โ”‚ โ”‚ โ””โ”€โ”€ test_workflow_client.{ext} -โ”‚ โ”œโ”€โ”€ worker/ -โ”‚ โ”‚ โ””โ”€โ”€ test_worker_discovery.{ext} -โ”‚ โ””โ”€โ”€ workflow/ -โ”‚ โ””โ”€โ”€ test_workflow_builder.{ext} -โ”œโ”€โ”€ integration/ -โ”‚ โ”œโ”€โ”€ test_worker_execution.{ext} -โ”‚ โ”œโ”€โ”€ test_workflow_execution.{ext} -โ”‚ โ””โ”€โ”€ test_error_handling.{ext} -โ””โ”€โ”€ e2e/ - โ”œโ”€โ”€ test_authorization_journey.{ext} - โ””โ”€โ”€ test_complete_flow.{ext} -``` - ---- - -## ๐ŸŽฏ Success Criteria - -### Architecture -- [ ] Follows layered architecture pattern -- [ ] Maintains separation of concerns -- [ ] Uses dependency injection -- [ ] Implements proper abstractions - -### API Design -- [ ] Consistent method naming -- [ ] Predictable parameter patterns -- [ ] Strong typing with models -- [ ] Comprehensive error handling - -### Documentation -- [ ] 100% API coverage -- [ ] Quick start for each module -- [ ] Complete working examples -- [ ] Best practices documented - -### Testing -- [ ] >90% unit test coverage -- [ ] Integration tests for all APIs -- [ ] Journey tests demonstrate 100% API usage -- [ ] Examples are executable tests - -### Developer Experience -- [ ] Intuitive API design -- [ ] Excellent IDE support -- [ ] Clear error messages -- [ ] Comprehensive logging - ---- - -## ๐Ÿ“Š Validation Checklist - -Before considering an SDK complete: - -### Code Quality -- [ ] Follows language idioms -- [ ] Consistent code style -- [ ] No code duplication -- [ ] Proper error handling -- [ ] Comprehensive logging - -### API Completeness -- [ ] All 49 Authorization APIs -- [ ] All 21 Metadata APIs -- [ ] All 15 Schedule APIs -- [ ] All 11 Task APIs -- [ ] All 9 Secret APIs -- [ ] All 8 Prompt APIs -- [ ] All Integration providers - -### Documentation -- [ ] All API docs created -- [ ] Quick starts work -- [ ] Examples run successfully -- [ ] Cross-references valid -- [ ] No broken links - -### Testing -- [ ] Unit test coverage >90% -- [ ] Integration tests pass -- [ ] Journey examples complete -- [ ] CI/CD configured - -### Package -- [ ] Published to package registry -- [ ] Versioning follows semver -- [ ] CHANGELOG maintained -- [ ] LICENSE included - ---- - -## ๐Ÿ”ง Tooling Requirements - -### Code Generation -- OpenAPI Generator for API/models -- Custom generators for boilerplate - -### Build System -- Language-appropriate build tool -- Dependency management -- Version management -- Package publishing - -### CI/CD Pipeline -- Unit tests on every commit -- Integration tests on PR -- Documentation generation -- Package publishing on release - ---- - -## ๐Ÿ“ž Support & Questions - -For SDK implementation questions: - -1. Reference Python SDK for patterns -2. Check this guide for architecture -3. Maintain consistency across SDKs -4. Prioritize developer experience - -Remember: The goal is to make Conductor easy to use in every language while maintaining consistency and completeness. - ---- +### Debugging Worker Issues +1. Check `TaskRunner.run()` loop. +2. Verify `_task_server_map` logic if updates fail. +3. Check `_auth_failures` and circuit breaker state if polling stops. +## 6. Artifacts +* **`AGENTS.md`**: This file. +* **`SDK_IMPLEMENTATION_GUIDE.md`**: General architecture guide for *all* language SDKs (formerly AGENTS.md). diff --git a/README.md b/README.md index 369d6455..ed5cee2c 100644 --- a/README.md +++ b/README.md @@ -100,6 +100,8 @@ The SDK requires Python 3.9+. To install the SDK, use the following command: ```shell python3 -m pip install conductor-python ``` +## Working with Conductor Server + ## ๐Ÿš€ Quick Start @@ -343,6 +345,53 @@ Run the application and view the execution status from Conductor's UI Console. > [!NOTE] > That's it - you just created and executed your first distributed Python app! +## Multi-Homed Workers (High Availability) + +Workers can poll tasks from **multiple Conductor servers** simultaneously for high availability and disaster recovery. This is useful when running active-active or active-passive Conductor clusters. + +### Configuration via Environment Variables + +```bash +# Multiple servers (comma-separated) +export CONDUCTOR_SERVER_URL=https://east.example.com/api,https://west.example.com/api + +# Auth credentials per server (must match server count) +export CONDUCTOR_AUTH_KEY=key1,key2 +export CONDUCTOR_AUTH_SECRET=secret1,secret2 +``` + +```python +# Workers automatically poll all configured servers +handler = TaskHandler() # Auto-detects from env vars +handler.start_processes() +``` + +### Programmatic Configuration + +```python +from conductor.client.configuration.configuration import Configuration +from conductor.client.automator.task_handler import TaskHandler + +handler = TaskHandler(configuration=[ + Configuration( + server_api_url="https://east.example.com/api", + authentication_settings=AuthenticationSettings(key_id="key1", key_secret="secret1") + ), + Configuration( + server_api_url="https://west.example.com/api", + authentication_settings=AuthenticationSettings(key_id="key2", key_secret="secret2") + ), +]) +``` + +### How It Works + +- Workers **poll all servers in parallel** each cycle +- Tasks are tracked to their originating server +- Updates are routed back to the correct server +- Task definitions are registered to **all** servers +- **Backward compatible** - single server config still works + ## Learn More about Conductor Python SDK There are three main ways you can use Conductor when building durable, resilient, distributed applications. diff --git a/SDK_IMPLEMENTATION_GUIDE.md b/SDK_IMPLEMENTATION_GUIDE.md new file mode 100644 index 00000000..767d0fd4 --- /dev/null +++ b/SDK_IMPLEMENTATION_GUIDE.md @@ -0,0 +1,781 @@ +# Conductor SDK Architecture & Implementation Guide + +A comprehensive guide for implementing Conductor SDKs across all languages (Java, Go, C#, JavaScript/TypeScript, Clojure) based on the Python SDK reference architecture. + +## Executive Summary + +This guide provides a complete blueprint for creating or refactoring Conductor SDKs to match the architecture, API design, and documentation standards established in the Python SDK. Each language should maintain its idiomatic patterns while following the core architectural principles. + +--- + +## ๐Ÿ—๏ธ SDK Architecture Blueprint + +### Core Architecture Layers + +``` +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Application Layer โ”‚ +โ”‚ (User's Application Code) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ†“ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ High-Level Clients โ”‚ +โ”‚ (OrkesClients, WorkflowExecutor, Workers) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ†“ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Domain-Specific Clients โ”‚ +โ”‚ (TaskClient, WorkflowClient, SecretClient...) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ†“ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Orkes Implementations โ”‚ +โ”‚ (OrkesTaskClient, OrkesWorkflowClient...) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ†“ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ Resource API Layer โ”‚ +โ”‚ (TaskResourceApi, WorkflowResourceApi...) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ + โ†“ +โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” +โ”‚ HTTP/API Client โ”‚ +โ”‚ (ApiClient, HTTP Transport) โ”‚ +โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ +``` + +### Client Hierarchy Pattern + +``` +AbstractClient (Interface/ABC) + โ†‘ +OrkesBaseClient (Shared Implementation) + โ†‘ +OrkesSpecificClient (Concrete Implementation) +``` + +--- + +## ๐Ÿ“ฆ Package Structure + +### Standard Package Organization + +``` +conductor-{language}/ +โ”œโ”€โ”€ src/ +โ”‚ โ””โ”€โ”€ conductor/ +โ”‚ โ”œโ”€โ”€ client/ +โ”‚ โ”‚ โ”œโ”€โ”€ {domain}_client.{ext} # Abstract interfaces +โ”‚ โ”‚ โ”œโ”€โ”€ orkes/ +โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ orkes_base_client.{ext} +โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ orkes_{domain}_client.{ext} +โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ models/ +โ”‚ โ”‚ โ”œโ”€โ”€ http/ +โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ api/ # Generated from OpenAPI +โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ *_resource_api.{ext} +โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ models/ # Generated models +โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ api_client.{ext} +โ”‚ โ”‚ โ”œโ”€โ”€ automator/ +โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ task_runner.{ext} +โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ async_task_runner.{ext} +โ”‚ โ”‚ โ”œโ”€โ”€ configuration/ +โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ configuration.{ext} +โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ settings/ +โ”‚ โ”‚ โ”œโ”€โ”€ worker/ +โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ worker_task.{ext} +โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ worker_discovery.{ext} +โ”‚ โ”‚ โ””โ”€โ”€ workflow/ +โ”‚ โ”‚ โ”œโ”€โ”€ conductor_workflow.{ext} +โ”‚ โ”‚ โ””โ”€โ”€ task/ +โ”œโ”€โ”€ examples/ +โ”‚ โ”œโ”€โ”€ workers_e2e.{ext} # End-to-end example +โ”‚ โ”œโ”€โ”€ {feature}_journey.{ext} # 100% API coverage demos +โ”‚ โ””โ”€โ”€ README.md # Examples catalog +โ”œโ”€โ”€ docs/ +โ”‚ โ”œโ”€โ”€ AUTHORIZATION.md # 49 APIs +โ”‚ โ”œโ”€โ”€ METADATA.md # 21 APIs +โ”‚ โ”œโ”€โ”€ INTEGRATION.md # 28+ providers +โ”‚ โ”œโ”€โ”€ TASK_MANAGEMENT.md # 11 APIs +โ”‚ โ”œโ”€โ”€ SECRET_MANAGEMENT.md # 9 APIs +โ”‚ โ”œโ”€โ”€ WORKFLOW_TESTING.md +โ”‚ โ””โ”€โ”€ ... +โ””โ”€โ”€ tests/ + โ”œโ”€โ”€ unit/ + โ”œโ”€โ”€ integration/ + โ””โ”€โ”€ e2e/ +``` + +--- + +## ๐ŸŽฏ Implementation Checklist + +### Phase 1: Core Infrastructure + +#### 1.1 Configuration System + +- [ ] Create Configuration class with builder pattern +- [ ] Support environment variables +- [ ] Implement hierarchical configuration (all โ†’ domain โ†’ task) +- [ ] Add authentication settings (key/secret, token) +- [ ] Include retry configuration +- [ ] Add connection pooling settings + +#### 1.2 HTTP/API Layer + +- [ ] Generate models from OpenAPI specification +- [ ] Generate resource API classes +- [ ] Implement ApiClient with: + - [ ] Connection pooling + - [ ] Retry logic with exponential backoff + - [ ] Request/response interceptors + - [ ] Error handling and mapping + - [ ] Metrics collection hooks + +#### 1.3 Base Client Architecture + +- [ ] Create abstract base clients (interfaces) +- [ ] Implement OrkesBaseClient aggregating all APIs +- [ ] Add proper dependency injection +- [ ] Implement client factory pattern + +### Phase 2: Domain Clients + +For each domain, implement: + +#### 2.1 Task Client + +``` +Abstract Interface (11 methods): +- poll_task(task_type, worker_id?, domain?) +- batch_poll_tasks(task_type, worker_id?, count?, timeout?, domain?) +- get_task(task_id) +- update_task(task_result) +- update_task_by_ref_name(workflow_id, ref_name, status, output, worker_id?) +- update_task_sync(workflow_id, ref_name, status, output, worker_id?) +- get_queue_size_for_task(task_type) +- add_task_log(task_id, message) +- get_task_logs(task_id) +- get_task_poll_data(task_type) +- signal_task(workflow_id, ref_name, data) +``` + +#### 2.2 Workflow Client + +``` +Abstract Interface (20+ methods): +- start_workflow(start_request) +- get_workflow(workflow_id, include_tasks?) +- get_workflow_status(workflow_id, include_output?, include_variables?) +- delete_workflow(workflow_id, archive?) +- terminate_workflow(workflow_id, reason?, trigger_failure?) +- pause_workflow(workflow_id) +- resume_workflow(workflow_id) +- restart_workflow(workflow_id, use_latest_def?) +- retry_workflow(workflow_id, resume_subworkflow?) +- rerun_workflow(workflow_id, rerun_request) +- skip_task_from_workflow(workflow_id, task_ref, skip_request) +- test_workflow(test_request) +- search(start?, size?, free_text?, query?) +- execute_workflow(start_request, request_id?, wait_until?, wait_seconds?) +[... additional methods] +``` + +#### 2.3 Metadata Client (21 APIs) + +#### 2.4 Authorization Client (49 APIs) + +#### 2.5 Secret Client (9 APIs) + +#### 2.6 Integration Client (28+ providers) + +#### 2.7 Prompt Client (8 APIs) + +#### 2.8 Schedule Client (15 APIs) + +### Phase 3: Worker Framework + +#### 3.1 Worker Task Decorator/Annotation + +- [ ] Create worker registration system +- [ ] Implement task discovery +- [ ] Add worker lifecycle management +- [ ] Support both sync and async workers + +#### 3.2 Task Runner + +- [ ] Implement TaskRunner with thread pool +- [ ] Implement AsyncTaskRunner with event loop +- [ ] Add metrics collection +- [ ] Implement graceful shutdown +- [ ] Add health checks + +#### 3.3 Worker Features + +- [ ] Task context injection +- [ ] Automatic retries +- [ ] TaskInProgress support for long-running tasks +- [ ] Error handling (retryable vs terminal) +- [ ] Worker discovery from packages + +### Phase 4: Workflow DSL + +- [ ] Implement ConductorWorkflow builder +- [ ] Add all task types (Simple, HTTP, Switch, Fork, DoWhile, etc.) +- [ ] Support method chaining +- [ ] Add workflow validation +- [ ] Implement workflow testing utilities + +### Phase 5: Examples + +#### 5.1 Core Examples + +- [ ] `workers_e2e` - Complete end-to-end example +- [ ] `worker_example` - Worker patterns +- [ ] `task_context_example` - Long-running tasks +- [ ] `workflow_example` - Workflow creation +- [ ] `test_workflows` - Testing patterns + +#### 5.2 Journey Examples (100% API Coverage) + +- [ ] `authorization_journey` - All 49 authorization APIs +- [ ] `metadata_journey` - All 21 metadata APIs +- [ ] `integration_journey` - All integration providers +- [ ] `schedule_journey` - All 15 schedule APIs +- [ ] `prompt_journey` - All 8 prompt APIs +- [ ] `secret_journey` - All 9 secret APIs + +### Phase 6: Documentation + +- [ ] Create all API reference documents (see Documentation section) +- [ ] Add Quick Start for each module +- [ ] Include complete working examples +- [ ] Document all models +- [ ] Add error handling guides +- [ ] Include best practices + +--- + +## ๐ŸŒ Language-Specific Implementation + +### Java Implementation + +```java +// Package Structure +com.conductor.sdk/ +โ”œโ”€โ”€ client/ +โ”‚ โ”œโ”€โ”€ TaskClient.java // Interface +โ”‚ โ”œโ”€โ”€ orkes/ +โ”‚ โ”‚ โ”œโ”€โ”€ OrkesBaseClient.java +โ”‚ โ”‚ โ””โ”€โ”€ OrkesTaskClient.java // Implementation +โ”‚ โ””โ”€โ”€ http/ +โ”‚ โ”œโ”€โ”€ api/ // Generated +โ”‚ โ””โ”€โ”€ models/ // Generated + +// Client Pattern +public interface TaskClient { + Optional pollTask(String taskType, String workerId, String domain); + List batchPollTasks(String taskType, BatchPollRequest request); + // ... other methods +} + +public class OrkesTaskClient extends OrkesBaseClient implements TaskClient { + @Override + public Optional pollTask(String taskType, String workerId, String domain) { + return Optional.ofNullable( + taskResourceApi.poll(taskType, workerId, domain) + ); + } +} + +// Configuration +Configuration config = Configuration.builder() + .serverUrl("http://localhost:8080/api") + .authentication(keyId, keySecret) + .connectionPool(10, 30, TimeUnit.SECONDS) + .retryPolicy(3, 1000) + .build(); + +// Worker Pattern +@WorkerTask("process_order") +public class OrderProcessor implements Worker { + @Override + public TaskResult execute(Task task) { + OrderInput input = task.getInputData(OrderInput.class); + // Process + return TaskResult.complete(output); + } +} + +// Task Runner +TaskRunnerConfigurer configurer = TaskRunnerConfigurer.builder() + .configuration(config) + .workers(new OrderProcessor(), new PaymentProcessor()) + .threadCount(10) + .build(); + +configurer.start(); +``` + +### Go Implementation + +```go +// Package Structure +github.com/conductor-oss/conductor-go/ +โ”œโ”€โ”€ client/ +โ”‚ โ”œโ”€โ”€ task_client.go // Interface +โ”‚ โ”œโ”€โ”€ orkes/ +โ”‚ โ”‚ โ”œโ”€โ”€ base_client.go +โ”‚ โ”‚ โ””โ”€โ”€ task_client.go // Implementation +โ”‚ โ””โ”€โ”€ http/ +โ”‚ โ”œโ”€โ”€ api/ // Generated +โ”‚ โ””โ”€โ”€ models/ // Generated + +// Client Pattern +type TaskClient interface { + PollTask(ctx context.Context, taskType string, opts ...PollOption) (*Task, error) + BatchPollTasks(ctx context.Context, taskType string, opts ...PollOption) ([]*Task, error) + // ... other methods +} + +type orkesTaskClient struct { + *BaseClient + api *TaskResourceAPI +} + +func (c *orkesTaskClient) PollTask(ctx context.Context, taskType string, opts ...PollOption) (*Task, error) { + options := &pollOptions{} + for _, opt := range opts { + opt(options) + } + return c.api.Poll(ctx, taskType, options.WorkerID, options.Domain) +} + +// Configuration +config := client.NewConfig( + client.WithServerURL("http://localhost:8080/api"), + client.WithAuthentication(keyID, keySecret), + client.WithConnectionPool(10, 30*time.Second), + client.WithRetryPolicy(3, time.Second), +) + +// Worker Pattern +type OrderProcessor struct{} + +func (p *OrderProcessor) TaskType() string { + return "process_order" +} + +func (p *OrderProcessor) Execute(ctx context.Context, task *Task) (*TaskResult, error) { + var input OrderInput + if err := task.GetInputData(&input); err != nil { + return nil, err + } + // Process + return NewTaskResultComplete(output), nil +} + +// Task Runner +runner := worker.NewTaskRunner( + worker.WithConfig(config), + worker.WithWorkers(&OrderProcessor{}, &PaymentProcessor{}), + worker.WithThreadCount(10), +) + +runner.Start(ctx) +``` + +### TypeScript/JavaScript Implementation + +```typescript +// Package Structure +@conductor-oss/conductor-sdk/ +โ”œโ”€โ”€ src/ +โ”‚ โ”œโ”€โ”€ client/ +โ”‚ โ”‚ โ”œโ”€โ”€ TaskClient.ts // Interface +โ”‚ โ”‚ โ”œโ”€โ”€ orkes/ +โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ OrkesBaseClient.ts +โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ OrkesTaskClient.ts // Implementation +โ”‚ โ”‚ โ””โ”€โ”€ http/ +โ”‚ โ”‚ โ”œโ”€โ”€ api/ // Generated +โ”‚ โ”‚ โ””โ”€โ”€ models/ // Generated + +// Client Pattern +export interface TaskClient { + pollTask(taskType: string, workerId?: string, domain?: string): Promise; + batchPollTasks(taskType: string, options?: BatchPollOptions): Promise; + // ... other methods +} + +export class OrkesTaskClient extends OrkesBaseClient implements TaskClient { + async pollTask(taskType: string, workerId?: string, domain?: string): Promise { + return await this.taskApi.poll(taskType, { workerId, domain }); + } +} + +// Configuration +const config = new Configuration({ + serverUrl: 'http://localhost:8080/api', + authentication: { + keyId: 'your-key', + keySecret: 'your-secret' + }, + connectionPool: { + maxConnections: 10, + keepAliveTimeout: 30000 + }, + retry: { + maxAttempts: 3, + backoffMs: 1000 + } +}); + +// Worker Pattern (Decorators) +@WorkerTask('process_order') +export class OrderProcessor implements Worker { + async execute(task: Task): Promise { + const input = task.inputData as OrderInput; + // Process + return TaskResult.complete(output); + } +} + +// Worker Pattern (Functional) +export const processOrder = workerTask('process_order', async (task: Task) => { + const input = task.inputData as OrderInput; + // Process + return output; +}); + +// Task Runner +const runner = new TaskRunner({ + config, + workers: [OrderProcessor, PaymentProcessor], + // or functional: workers: [processOrder, processPayment], + options: { + threadCount: 10, + pollInterval: 100 + } +}); + +await runner.start(); +``` + +### C# Implementation + +```csharp +// Package Structure +Conductor.Client/ +โ”œโ”€โ”€ Client/ +โ”‚ โ”œโ”€โ”€ ITaskClient.cs // Interface +โ”‚ โ”œโ”€โ”€ Orkes/ +โ”‚ โ”‚ โ”œโ”€โ”€ OrkesBaseClient.cs +โ”‚ โ”‚ โ””โ”€โ”€ OrkesTaskClient.cs // Implementation +โ”‚ โ””โ”€โ”€ Http/ +โ”‚ โ”œโ”€โ”€ Api/ // Generated +โ”‚ โ””โ”€โ”€ Models/ // Generated + +// Client Pattern +public interface ITaskClient +{ + Task PollTaskAsync(string taskType, string? workerId = null, string? domain = null); + Task> BatchPollTasksAsync(string taskType, BatchPollOptions? options = null); + // ... other methods +} + +public class OrkesTaskClient : OrkesBaseClient, ITaskClient +{ + public async Task PollTaskAsync(string taskType, string? workerId = null, string? domain = null) + { + return await TaskApi.PollAsync(taskType, workerId, domain); + } +} + +// Configuration +var config = new Configuration +{ + ServerUrl = "http://localhost:8080/api", + Authentication = new AuthenticationSettings + { + KeyId = "your-key", + KeySecret = "your-secret" + }, + ConnectionPool = new PoolSettings + { + MaxConnections = 10, + KeepAliveTimeout = TimeSpan.FromSeconds(30) + }, + Retry = new RetryPolicy + { + MaxAttempts = 3, + BackoffMs = 1000 + } +}; + +// Worker Pattern (Attributes) +[WorkerTask("process_order")] +public class OrderProcessor : IWorker +{ + public async Task ExecuteAsync(ConductorTask task) + { + var input = task.GetInputData(); + // Process + return TaskResult.Complete(output); + } +} + +// Task Runner +var runner = new TaskRunner(config) + .AddWorker() + .AddWorker() + .WithOptions(new RunnerOptions + { + ThreadCount = 10, + PollInterval = TimeSpan.FromMilliseconds(100) + }); + +await runner.StartAsync(); +``` + +--- + +## ๐Ÿ“‹ API Method Naming Conventions + +### Consistent Naming Across All Clients + +| Operation | Method Pattern | Example | +|-----------|---------------|---------| +| Create | `create{Resource}` / `save{Resource}` | `createWorkflow`, `saveSchedule` | +| Read (single) | `get{Resource}` | `getTask`, `getWorkflow` | +| Read (list) | `list{Resources}` / `getAll{Resources}` | `listTasks`, `getAllSchedules` | +| Update | `update{Resource}` | `updateTask`, `updateWorkflow` | +| Delete | `delete{Resource}` | `deleteWorkflow`, `deleteSecret` | +| Search | `search{Resources}` | `searchWorkflows`, `searchTasks` | +| Execute | `{action}{Resource}` | `pauseWorkflow`, `resumeSchedule` | +| Test | `test{Resource}` | `testWorkflow` | + +### Parameter Patterns + +``` +Required parameters: Direct method parameters +Optional parameters: Options object or builder pattern + +Example: +- pollTask(taskType: string, options?: PollOptions) +- updateTask(taskId: string, result: TaskResult) +``` + +--- + +## ๐Ÿ“š Documentation Structure + +### Required Documentation Files + +``` +docs/ +โ”œโ”€โ”€ AUTHORIZATION.md # 49 APIs - User, Group, Application, Permissions +โ”œโ”€โ”€ METADATA.md # 21 APIs - Task & Workflow definitions +โ”œโ”€โ”€ INTEGRATION.md # 28+ providers - AI/LLM integrations +โ”œโ”€โ”€ PROMPT.md # 8 APIs - Prompt template management +โ”œโ”€โ”€ SCHEDULE.md # 15 APIs - Workflow scheduling +โ”œโ”€โ”€ SECRET_MANAGEMENT.md # 9 APIs - Secret storage +โ”œโ”€โ”€ TASK_MANAGEMENT.md # 11 APIs - Task operations +โ”œโ”€โ”€ WORKFLOW.md # Workflow operations +โ”œโ”€โ”€ WORKFLOW_TESTING.md # Testing guide +โ”œโ”€โ”€ WORKER.md # Worker implementation +โ””โ”€โ”€ README.md # SDK overview +``` + +### Documentation Template for Each Module + +```markdown +# [Module] API Reference + +Complete API reference for [module] operations in Conductor [Language] SDK. + +> ๐Ÿ“š **Complete Working Example**: See [example.ext] for comprehensive implementation. + +## Quick Start + +```language +// 10-15 line minimal example +``` + +## Quick Links +- [API Category 1](#api-category-1) +- [API Category 2](#api-category-2) +- [API Details](#api-details) +- [Model Reference](#model-reference) +- [Error Handling](#error-handling) +- [Best Practices](#best-practices) + +## API Category Tables + +| Method | Endpoint | Description | Example | +|--------|----------|-------------|---------| +| `methodName()` | `HTTP_VERB /path` | Description | [Link](#anchor) | + +## API Details + +[Detailed examples for each API method] + +## Model Reference + +[Model/class definitions] + +## Error Handling + +[Common errors and handling patterns] + +## Best Practices + +[Good vs bad examples with โœ… and โŒ] + +## Complete Working Example + +[50-150 line runnable example] +``` + +--- + +## ๐Ÿงช Testing Requirements + +### Test Coverage Goals + +| Component | Unit Tests | Integration Tests | E2E Tests | +|-----------|------------|-------------------|-----------| +| Clients | 90% | 80% | - | +| Workers | 95% | 85% | 70% | +| Workflow DSL | 90% | 80% | - | +| Examples | - | 100% | 100% | +``` + +### Test Structure +``` +tests/ +โ”œโ”€โ”€ unit/ +โ”‚ โ”œโ”€โ”€ client/ +โ”‚ โ”‚ โ”œโ”€โ”€ test_task_client.{ext} +โ”‚ โ”‚ โ””โ”€โ”€ test_workflow_client.{ext} +โ”‚ โ”œโ”€โ”€ worker/ +โ”‚ โ”‚ โ””โ”€โ”€ test_worker_discovery.{ext} +โ”‚ โ””โ”€โ”€ workflow/ +โ”‚ โ””โ”€โ”€ test_workflow_builder.{ext} +โ”œโ”€โ”€ integration/ +โ”‚ โ”œโ”€โ”€ test_worker_execution.{ext} +โ”‚ โ”œโ”€โ”€ test_workflow_execution.{ext} +โ”‚ โ””โ”€โ”€ test_error_handling.{ext} +โ””โ”€โ”€ e2e/ + โ”œโ”€โ”€ test_authorization_journey.{ext} + โ””โ”€โ”€ test_complete_flow.{ext} +``` + +--- + +## ๐ŸŽฏ Success Criteria + +### Architecture +- [ ] Follows layered architecture pattern +- [ ] Maintains separation of concerns +- [ ] Uses dependency injection +- [ ] Implements proper abstractions + +### API Design +- [ ] Consistent method naming +- [ ] Predictable parameter patterns +- [ ] Strong typing with models +- [ ] Comprehensive error handling + +### Documentation +- [ ] 100% API coverage +- [ ] Quick start for each module +- [ ] Complete working examples +- [ ] Best practices documented + +### Testing +- [ ] >90% unit test coverage +- [ ] Integration tests for all APIs +- [ ] Journey tests demonstrate 100% API usage +- [ ] Examples are executable tests + +### Developer Experience +- [ ] Intuitive API design +- [ ] Excellent IDE support +- [ ] Clear error messages +- [ ] Comprehensive logging + +--- + +## ๐Ÿ“Š Validation Checklist + +Before considering an SDK complete: + +### Code Quality +- [ ] Follows language idioms +- [ ] Consistent code style +- [ ] No code duplication +- [ ] Proper error handling +- [ ] Comprehensive logging + +### API Completeness +- [ ] All 49 Authorization APIs +- [ ] All 21 Metadata APIs +- [ ] All 15 Schedule APIs +- [ ] All 11 Task APIs +- [ ] All 9 Secret APIs +- [ ] All 8 Prompt APIs +- [ ] All Integration providers + +### Documentation +- [ ] All API docs created +- [ ] Quick starts work +- [ ] Examples run successfully +- [ ] Cross-references valid +- [ ] No broken links + +### Testing +- [ ] Unit test coverage >90% +- [ ] Integration tests pass +- [ ] Journey examples complete +- [ ] CI/CD configured + +### Package +- [ ] Published to package registry +- [ ] Versioning follows semver +- [ ] CHANGELOG maintained +- [ ] LICENSE included + +--- + +## ๐Ÿ”ง Tooling Requirements + +### Code Generation +- OpenAPI Generator for API/models +- Custom generators for boilerplate + +### Build System +- Language-appropriate build tool +- Dependency management +- Version management +- Package publishing + +### CI/CD Pipeline +- Unit tests on every commit +- Integration tests on PR +- Documentation generation +- Package publishing on release + +--- + +## ๐Ÿ“ž Support & Questions + +For SDK implementation questions: + +1. Reference Python SDK for patterns +2. Check this guide for architecture +3. Maintain consistency across SDKs +4. Prioritize developer experience + +Remember: The goal is to make Conductor easy to use in every language while maintaining consistency and completeness. + +--- + diff --git a/docs/WORKER.md b/docs/WORKER.md index 42e6a4d4..71417031 100644 --- a/docs/WORKER.md +++ b/docs/WORKER.md @@ -270,7 +270,75 @@ If you paste the above code in a file called main.py, you can launch the workers python3 main.py ``` +## Multi-Homed Workers (High Availability) + +Workers can poll tasks from **multiple Conductor servers** simultaneously for high availability, disaster recovery, and active-active deployments. + +### Configuration via Environment Variables + +```bash +# Multiple servers (comma-separated) +export CONDUCTOR_SERVER_URL=https://east.example.com/api,https://west.example.com/api + +# Auth credentials per server (must match server count) +export CONDUCTOR_AUTH_KEY=key1,key2 +export CONDUCTOR_AUTH_SECRET=secret1,secret2 +``` + +Workers automatically detect and use all configured servers: + +```python +from conductor.client.automator.task_handler import TaskHandler + +# Auto-detects multi-homed config from env vars +handler = TaskHandler(scan_for_annotated_workers=True) +handler.start_processes() +``` + +### Programmatic Configuration + +```python +from conductor.client.configuration.configuration import Configuration +from conductor.client.configuration.settings.authentication_settings import AuthenticationSettings +from conductor.client.automator.task_handler import TaskHandler + +configs = [ + Configuration( + server_api_url="https://east.example.com/api", + authentication_settings=AuthenticationSettings(key_id="key1", key_secret="secret1") + ), + Configuration( + server_api_url="https://west.example.com/api", + authentication_settings=AuthenticationSettings(key_id="key2", key_secret="secret2") + ), +] + +handler = TaskHandler(configuration=configs, scan_for_annotated_workers=True) +handler.start_processes() +``` + +### How It Works + +| Behavior | Description | +|----------|-------------| +| **Parallel Polling** | Workers poll all servers simultaneously each cycle | +| **Task Tracking** | Each task is tracked to its originating server | +| **Correct Routing** | Updates are routed back to the correct server | +| **Metadata Registration** | Task definitions are registered to all servers | +| **Backward Compatible** | Single server configuration still works unchanged | + +### Validation + +The SDK validates that credential counts match server counts: + +```bash +# This will raise ValueError at startup: +export CONDUCTOR_SERVER_URL=https://east.example.com/api,https://west.example.com/api +export CONDUCTOR_AUTH_KEY=key1 # Only 1 key for 2 servers - ERROR! +``` + ## Task Domains + Workers can be configured to start polling for work that is tagged by a task domain. See more on domains [here](https://orkes.io/content/developer-guides/task-to-domain). diff --git a/examples/README.md b/examples/README.md index 0b7366f7..5ebc2f38 100644 --- a/examples/README.md +++ b/examples/README.md @@ -24,6 +24,7 @@ python examples/workers_e2e.py | File | Description | Run | |------|-------------|-----| | **workers_e2e.py** | โญ Start here - sync + async workers | `python examples/workers_e2e.py` | +| **multi_homed_workers.py** | Poll from multiple servers (HA) | `python examples/multi_homed_workers.py` | | **worker_example.py** | Comprehensive patterns (None returns, TaskInProgress) | `python examples/worker_example.py` | | **worker_configuration_example.py** | Hierarchical configuration (env vars) | `python examples/worker_configuration_example.py` | | **task_context_example.py** | Task context (logs, poll_count, task_id) | `python examples/task_context_example.py` | @@ -202,6 +203,7 @@ curl http://localhost:8000/metrics examples/ โ”œโ”€โ”€ Core Workers โ”‚ โ”œโ”€โ”€ workers_e2e.py # โญ Start here +โ”‚ โ”œโ”€โ”€ multi_homed_workers.py # Multi-server HA โ”‚ โ”œโ”€โ”€ worker_example.py # Comprehensive patterns โ”‚ โ”œโ”€โ”€ worker_configuration_example.py # Env var configuration โ”‚ โ”œโ”€โ”€ task_context_example.py # Long-running tasks diff --git a/examples/multi_homed_workers.py b/examples/multi_homed_workers.py new file mode 100644 index 00000000..4074e0c5 --- /dev/null +++ b/examples/multi_homed_workers.py @@ -0,0 +1,205 @@ +""" +Multi-Homed Workers Example + +This example demonstrates how to configure workers to poll tasks from +multiple Conductor servers simultaneously for high availability. + +Multi-homed workers provide: +- Disaster recovery: If one server goes down, workers continue polling others +- Active-active deployments: Workers poll all servers in parallel +- Geographic distribution: Poll from servers in different regions +- Built-in resilience: Circuit breaker (skip down servers), timeouts, and rapid recovery + +Usage: +------ +Option 1: Environment Variables (recommended for production) + export CONDUCTOR_SERVER_URL=https://east.example.com/api,https://west.example.com/api + export CONDUCTOR_AUTH_KEY=key1,key2 + export CONDUCTOR_AUTH_SECRET=secret1,secret2 + python multi_homed_workers.py + +Option 2: Programmatic Configuration + python multi_homed_workers.py --programmatic + +Option 3: Mixed (single server, backward compatible) + export CONDUCTOR_SERVER_URL=https://conductor.example.com/api + python multi_homed_workers.py +""" + +import argparse +import logging +import os +import sys +import time + +from conductor.client.automator.task_handler import TaskHandler +from conductor.client.configuration.configuration import Configuration +from conductor.client.configuration.settings.authentication_settings import AuthenticationSettings +from conductor.client.worker.worker_task import worker_task + +logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s') +logger = logging.getLogger(__name__) + + +# ============================================================================= +# Worker Definitions +# ============================================================================= + +@worker_task(task_definition_name='multi_homed_example_task') +def example_worker(name: str) -> dict: + """ + Simple worker that processes tasks from any configured server. + + The SDK automatically: + - Polls all configured servers in parallel + - Tracks which server each task came from + - Routes updates back to the originating server + """ + logger.info(f"Processing task for: {name}") + return { + 'message': f'Hello {name}!', + 'processed_by': os.getpid(), + 'timestamp': time.time() + } + + +@worker_task(task_definition_name='multi_homed_async_task', thread_count=10) +async def async_example_worker(data: dict) -> dict: + """ + Async worker with high concurrency - also works with multi-homed servers. + """ + import asyncio + await asyncio.sleep(0.1) # Simulate async I/O + return { + 'processed': True, + 'input_keys': list(data.keys()) if data else [] + } + + +# ============================================================================= +# Configuration Examples +# ============================================================================= + +def get_env_var_configuration(): + """ + Uses comma-separated environment variables for multi-homed configuration. + + Environment Variables: + CONDUCTOR_SERVER_URL: comma-separated server URLs + CONDUCTOR_AUTH_KEY: comma-separated auth keys (must match server count) + CONDUCTOR_AUTH_SECRET: comma-separated auth secrets (must match server count) + + Returns: + List of Configuration objects (auto-parsed from env vars) + """ + # Configuration.from_env_multi() automatically parses comma-separated env vars + configs = Configuration.from_env_multi() + + logger.info(f"Loaded {len(configs)} server configuration(s) from environment") + for i, cfg in enumerate(configs): + has_auth = "Yes" if cfg.authentication_settings else "No" + logger.info(f" Server {i+1}: {cfg.host} (Auth: {has_auth})") + + return configs + + +def get_programmatic_configuration(): + """ + Creates multi-homed configuration programmatically. + + This approach is useful when: + - Servers are discovered dynamically + - Different servers need different SSL/proxy settings + - Configuration comes from a secrets manager + + Returns: + List of Configuration objects + """ + # Example: Configure two servers + configs = [ + Configuration( + server_api_url="http://localhost:8080/api", + # Optional: Add authentication + # authentication_settings=AuthenticationSettings( + # key_id="your-key-1", + # key_secret="your-secret-1" + # ), + debug=False + ), + Configuration( + server_api_url="http://localhost:8081/api", + # authentication_settings=AuthenticationSettings( + # key_id="your-key-2", + # key_secret="your-secret-2" + # ), + debug=False + ), + ] + + logger.info(f"Created {len(configs)} server configuration(s) programmatically") + for i, cfg in enumerate(configs): + logger.info(f" Server {i+1}: {cfg.host}") + + return configs + + +# ============================================================================= +# Main +# ============================================================================= + +def main(): + parser = argparse.ArgumentParser(description='Multi-Homed Workers Example') + parser.add_argument( + '--programmatic', + action='store_true', + help='Use programmatic configuration instead of environment variables' + ) + parser.add_argument( + '--duration', + type=int, + default=60, + help='Duration to run workers in seconds (default: 60)' + ) + args = parser.parse_args() + + # Get configuration + if args.programmatic: + logger.info("Using programmatic configuration...") + configs = get_programmatic_configuration() + else: + logger.info("Using environment variable configuration...") + configs = get_env_var_configuration() + + # Display multi-homed status + if len(configs) > 1: + logger.info(f"๐ŸŒ MULTI-HOMED MODE: Workers will poll {len(configs)} servers in parallel") + else: + logger.info(f"๐Ÿ“ก SINGLE SERVER MODE: Workers will poll 1 server") + + # Create TaskHandler with multi-homed configuration + task_handler = TaskHandler( + configuration=configs, # Pass list of configurations + scan_for_annotated_workers=True + ) + + try: + # Start workers + logger.info("Starting worker processes...") + task_handler.start_processes() + + logger.info(f"Workers running. Will stop after {args.duration} seconds.") + logger.info("Create workflows with tasks 'multi_homed_example_task' or 'multi_homed_async_task' to test.") + + # Run for specified duration + time.sleep(args.duration) + + except KeyboardInterrupt: + logger.info("Interrupted by user") + finally: + logger.info("Stopping workers...") + task_handler.stop_processes() + logger.info("Workers stopped") + + +if __name__ == '__main__': + main() diff --git a/src/conductor/client/automator/async_task_runner.py b/src/conductor/client/automator/async_task_runner.py index 85410c7b..5334e309 100644 --- a/src/conductor/client/automator/async_task_runner.py +++ b/src/conductor/client/automator/async_task_runner.py @@ -67,7 +67,7 @@ class AsyncTaskRunner: def __init__( self, worker: WorkerInterface, - configuration: Configuration = None, + configuration: Configuration = None, # Accepts single or list for multi-homed metrics_settings: MetricsSettings = None, event_listeners: list = None ): @@ -75,9 +75,15 @@ def __init__( raise Exception("Invalid worker") self.worker = worker self.__set_worker_properties() - if not isinstance(configuration, Configuration): - configuration = Configuration() - self.configuration = configuration + + # Normalize configuration to list (multi-homed support) + # Accepts: None, single Configuration, or List[Configuration] + if configuration is None: + self.configurations = [Configuration()] + elif isinstance(configuration, list): + self.configurations = configuration + else: + self.configurations = [configuration] # Set up event dispatcher and register listeners (same as TaskRunner) self.event_dispatcher = SyncEventDispatcher[TaskRunnerEvent]() @@ -93,14 +99,23 @@ def __init__( # Register metrics collector as event listener register_task_runner_listener(self.metrics_collector, self.event_dispatcher) - # Don't create async HTTP client here - will be created in subprocess + # Don't create async HTTP clients here - will be created in subprocess # httpx.AsyncClient is not picklable, so we defer creation until after fork - self.async_api_client = None - self.async_task_client = None - - # Auth failure backoff tracking (same as TaskRunner) - self._auth_failures = 0 - self._last_auth_failure = 0 + self.async_task_clients = [] # One per server (multi-homed) + + # Track which server each task came from: {task_id: server_index} + self._task_server_map = {} + + # Auth failure backoff tracking per server (multi-homed) + self._auth_failures = [0] * len(self.configurations) + self._last_auth_failure = [0] * len(self.configurations) + + # Circuit breaker per server (for multi-homed resilience) + self._server_failures = [0] * len(self.configurations) + self._circuit_open_until = [0.0] * len(self.configurations) + self._CIRCUIT_FAILURE_THRESHOLD = 3 # failures before opening circuit + self._CIRCUIT_RESET_SECONDS = 30 # seconds before half-open retry + self._POLL_TIMEOUT_SECONDS = 5 # max time to wait for any server poll # Polling state tracking (same as TaskRunner) self._max_workers = getattr(worker, 'thread_count', 1) # Max concurrent tasks @@ -114,21 +129,24 @@ def __init__( async def run(self) -> None: """Main async loop - runs continuously in single event loop.""" - if self.configuration is not None: - self.configuration.apply_logging_config() + # Apply logging config from primary configuration + if self.configurations: + self.configurations[0].apply_logging_config() else: logger.setLevel(logging.DEBUG) - # Create async HTTP client in subprocess (after fork) - # This must be done here because httpx.AsyncClient is not picklable - self.async_api_client = AsyncApiClient( - configuration=self.configuration, - metrics_collector=self.metrics_collector - ) - - self.async_task_client = AsyncTaskResourceApi( - api_client=self.async_api_client - ) + # Create async HTTP clients for all servers after fork (multi-homed support) + # httpx.AsyncClient is not picklable, so we must create them here + self.async_task_clients = [] + for cfg in self.configurations: + async_api_client = AsyncApiClient( + configuration=cfg, + metrics_collector=self.metrics_collector + ) + async_task_client = AsyncTaskResourceApi( + api_client=async_api_client + ) + self.async_task_clients.append(async_task_client) # Create semaphore in the event loop (must be created within the loop) self._semaphore = asyncio.Semaphore(self._max_workers) @@ -173,13 +191,15 @@ async def _cleanup(self) -> None: except AttributeError: pass # No tasks to cancel - # Close async HTTP client - if self.async_api_client: + # Close async HTTP clients for all servers (multi-homed support) + for i, task_client in enumerate(self.async_task_clients): try: - await self.async_api_client.close() - logger.debug("Async API client closed successfully") + await task_client.api_client.close() + logger.debug(f"Async API client {i + 1} closed successfully") except (IOError, OSError) as e: - logger.warning(f"Error closing async client: {e}") + logger.warning(f"Error closing async client {i + 1}: {e}") + except AttributeError: + pass # No client to close # Clear event listeners self.event_dispatcher = None @@ -197,7 +217,9 @@ async def __aexit__(self, exc_type, exc_val, exc_tb): async def __async_register_task_definition(self) -> None: """ - Register task definition with Conductor server (if register_task_def=True). + Register task definition with Conductor server(s) (if register_task_def=True). + + In multi-homed mode, registers to ALL configured servers. Automatically creates/updates: 1. Task definition with basic metadata or provided TaskDef configuration @@ -213,211 +235,102 @@ async def __async_register_task_definition(self) -> None: task_name = self.worker.get_task_definition_name() logger.info("=" * 80) - logger.info(f"Registering task definition: {task_name}") + logger.info(f"Registering task definition: {task_name} to {len(self.configurations)} server(s)") logger.info("=" * 80) - try: - # Create metadata client (sync client works in async context) - logger.debug(f"Creating metadata client for task registration...") - metadata_client = OrkesMetadataClient(self.configuration) - - # Generate JSON schemas from function signature (if worker has execute_function) - input_schema_name = None - output_schema_name = None - schema_registry_available = True - - if hasattr(self.worker, 'execute_function'): - logger.info(f"Generating JSON schemas from function signature...") - # Pass strict_schema flag to control additionalProperties - strict_mode = getattr(self.worker, 'strict_schema', False) - logger.debug(f" strict_schema mode: {strict_mode}") - schemas = generate_json_schema_from_function(self.worker.execute_function, task_name, strict_schema=strict_mode) - - if schemas: - has_input_schema = schemas.get('input') is not None - has_output_schema = schemas.get('output') is not None - - if has_input_schema or has_output_schema: - logger.info(f" โœ“ Generated schemas: input={'Yes' if has_input_schema else 'No'}, output={'Yes' if has_output_schema else 'No'}") - else: - logger.info(f" โš  No schemas generated (type hints not fully supported)") - # Register schemas with schema client - try: - logger.debug(f"Creating schema client...") - schema_client = OrkesSchemaClient(self.configuration) - except Exception as e: - # Schema client not available (server doesn't support schemas) - logger.warning(f"โš  Schema registry not available on server - task will be registered without schemas") - logger.debug(f" Error: {e}") - schema_registry_available = False - schema_client = None - - if schema_registry_available and schema_client: - logger.info(f"Registering JSON schemas...") - try: - # Register input schema - if schemas.get('input'): - input_schema_name = f"{task_name}_input" - try: - # Register schema (overwrite if exists) - input_schema_def = SchemaDef( - name=input_schema_name, - version=1, - type=SchemaType.JSON, - data=schemas['input'] - ) - schema_client.register_schema(input_schema_def) - logger.info(f" โœ“ Registered input schema: {input_schema_name} (v1)") - - except Exception as e: - # Check if this is a 404 (API endpoint doesn't exist on server) - if hasattr(e, 'status') and e.status == 404: - logger.warning(f"โš  Schema registry API not available on server (404) - task will be registered without schemas") - schema_registry_available = False - input_schema_name = None - else: - # Other error - log and continue without this schema - logger.warning(f"โš  Could not register input schema '{input_schema_name}': {e}") - input_schema_name = None - - # Register output schema (only if schema registry is available) - if schema_registry_available and schemas.get('output'): - output_schema_name = f"{task_name}_output" - try: - # Register schema (overwrite if exists) - output_schema_def = SchemaDef( - name=output_schema_name, - version=1, - type=SchemaType.JSON, - data=schemas['output'] - ) - schema_client.register_schema(output_schema_def) - logger.info(f" โœ“ Registered output schema: {output_schema_name} (v1)") - - except Exception as e: - # Check if this is a 404 (API endpoint doesn't exist on server) - if hasattr(e, 'status') and e.status == 404: - logger.warning(f"โš  Schema registry API not available on server (404)") - schema_registry_available = False - else: - # Other error - log and continue without this schema - logger.warning(f"โš  Could not register output schema '{output_schema_name}': {e}") - output_schema_name = None - - except Exception as e: - logger.debug(f"Could not register schemas for {task_name}: {e}") - else: - logger.info(f" โš  No schemas generated (unable to analyze function signature)") - else: - logger.info(f" โš  Class-based worker (no execute_function) - registering task without schemas") - - # Create task definition - logger.info(f"Creating task definition for '{task_name}'...") - - # Check if task_def_template is provided - logger.debug(f" task_def_template present: {hasattr(self.worker, 'task_def_template')}") - if hasattr(self.worker, 'task_def_template'): - logger.debug(f" task_def_template value: {self.worker.task_def_template}") - - # Use provided task_def template if available, otherwise create minimal TaskDef - if hasattr(self.worker, 'task_def_template') and self.worker.task_def_template: - logger.info(f" Using provided TaskDef configuration:") - - # Create a copy to avoid mutating the original - import copy - task_def = copy.deepcopy(self.worker.task_def_template) - - # Override name to ensure consistency - task_def.name = task_name - - # Log configuration being applied - if task_def.retry_count: - logger.info(f" - retry_count: {task_def.retry_count}") - if task_def.retry_logic: - logger.info(f" - retry_logic: {task_def.retry_logic}") - if task_def.timeout_seconds: - logger.info(f" - timeout_seconds: {task_def.timeout_seconds}") - if task_def.timeout_policy: - logger.info(f" - timeout_policy: {task_def.timeout_policy}") - if task_def.response_timeout_seconds: - logger.info(f" - response_timeout_seconds: {task_def.response_timeout_seconds}") - if task_def.concurrent_exec_limit: - logger.info(f" - concurrent_exec_limit: {task_def.concurrent_exec_limit}") - if task_def.rate_limit_per_frequency: - logger.info(f" - rate_limit: {task_def.rate_limit_per_frequency}/{task_def.rate_limit_frequency_in_seconds}s") - else: - # Create minimal task definition - logger.info(f" Creating minimal TaskDef (no custom configuration)") - task_def = TaskDef(name=task_name) - - # Link schemas if they were generated (overrides any schemas in task_def_template) - if input_schema_name: - task_def.input_schema = {"name": input_schema_name, "version": 1} - logger.debug(f" Linked input schema: {input_schema_name}") - if output_schema_name: - task_def.output_schema = {"name": output_schema_name, "version": 1} - logger.debug(f" Linked output schema: {output_schema_name}") - - # Register/update task definition - # Behavior depends on overwrite_task_def flag - overwrite = getattr(self.worker, 'overwrite_task_def', True) - logger.debug(f" overwrite_task_def: {overwrite}") + # Generate JSON schemas once (same for all servers) + input_schema_name = None + output_schema_name = None + schemas = None + + if hasattr(self.worker, 'execute_function'): + logger.info(f"Generating JSON schemas from function signature...") + strict_mode = getattr(self.worker, 'strict_schema', False) + logger.debug(f" strict_schema mode: {strict_mode}") + schemas = generate_json_schema_from_function(self.worker.execute_function, task_name, strict_schema=strict_mode) + + if schemas: + has_input_schema = schemas.get('input') is not None + has_output_schema = schemas.get('output') is not None + if has_input_schema or has_output_schema: + logger.info(f" โœ“ Generated schemas: input={'Yes' if has_input_schema else 'No'}, output={'Yes' if has_output_schema else 'No'}") + input_schema_name = f"{task_name}_input" if has_input_schema else None + output_schema_name = f"{task_name}_output" if has_output_schema else None + + # Build task definition once (same for all servers) + if hasattr(self.worker, 'task_def_template') and self.worker.task_def_template: + import copy + task_def = copy.deepcopy(self.worker.task_def_template) + task_def.name = task_name + else: + task_def = TaskDef(name=task_name) - try: - # Debug: Log the TaskDef being sent - logger.debug(f" Sending TaskDef to server:") - logger.debug(f" Name: {task_def.name}") - logger.debug(f" retry_count: {task_def.retry_count}") - logger.debug(f" retry_logic: {task_def.retry_logic}") - logger.debug(f" timeout_policy: {task_def.timeout_policy}") - logger.debug(f" Full to_dict(): {task_def.to_dict()}") - - if overwrite: - # Use update_task_def to overwrite existing definitions - logger.debug(f" Using update_task_def (overwrite=True)") - metadata_client.update_task_def(task_def=task_def) - else: - # Check if task exists, only create if it doesn't - logger.debug(f" Checking if task exists before creating (overwrite=False)") - try: - existing = metadata_client.get_task_def(task_name) - if existing: - logger.info(f"โœ“ Task definition '{task_name}' already exists - skipping (overwrite=False)") - logger.info(f" View at: {self.configuration.ui_host}/taskDef/{task_name}") - return - except Exception: - # Task doesn't exist, proceed to register - pass - metadata_client.register_task_def(task_def=task_def) - - # Print success message with link - task_def_url = f"{self.configuration.ui_host}/taskDef/{task_name}" - logger.info(f"โœ“ Registered/Updated task definition: {task_name} with {task_def.to_dict()}") - logger.info(f" View at: {task_def_url}") - - if input_schema_name or output_schema_name: - schema_count = sum([1 for s in [input_schema_name, output_schema_name] if s]) - logger.info(f" With {schema_count} JSON schema(s): {', '.join(filter(None, [input_schema_name, output_schema_name]))}") + # Link schemas if generated + if input_schema_name: + task_def.input_schema = {"name": input_schema_name, "version": 1} + if output_schema_name: + task_def.output_schema = {"name": output_schema_name, "version": 1} - except Exception as e: - # If update fails (task doesn't exist), try register - try: - metadata_client.register_task_def(task_def=task_def) + overwrite = getattr(self.worker, 'overwrite_task_def', True) - task_def_url = f"{self.configuration.ui_host}/taskDef/{task_name}" - logger.info(f"โœ“ Registered task definition: {task_name}") - logger.info(f" View at: {task_def_url}") + # Register to each server + for server_idx, config in enumerate(self.configurations): + server_label = f"server {server_idx + 1}/{len(self.configurations)}" + try: + logger.info(f"Registering to {server_label}: {config.host}") + metadata_client = OrkesMetadataClient(config) - if input_schema_name or output_schema_name: - schema_count = sum([1 for s in [input_schema_name, output_schema_name] if s]) - logger.info(f" With {schema_count} JSON schema(s): {', '.join(filter(None, [input_schema_name, output_schema_name]))}") + # Register schemas if available + if schemas and (schemas.get('input') or schemas.get('output')): + try: + schema_client = OrkesSchemaClient(config) + if schemas.get('input'): + input_schema_def = SchemaDef( + name=input_schema_name, + version=1, + type=SchemaType.JSON, + data=schemas['input'] + ) + schema_client.register_schema(input_schema_def) + logger.debug(f" โœ“ Registered input schema on {server_label}") + if schemas.get('output'): + output_schema_def = SchemaDef( + name=output_schema_name, + version=1, + type=SchemaType.JSON, + data=schemas['output'] + ) + schema_client.register_schema(output_schema_def) + logger.debug(f" โœ“ Registered output schema on {server_label}") + except Exception as e: + logger.warning(f"โš  Could not register schemas on {server_label}: {e}") - except Exception as register_error: - logger.warning(f"โš  Could not register/update task definition '{task_name}': {register_error}") + # Register task definition + try: + if overwrite: + metadata_client.update_task_def(task_def=task_def) + else: + try: + existing = metadata_client.get_task_def(task_name) + if existing: + logger.info(f" โœ“ Task already exists on {server_label} - skipping (overwrite=False)") + continue + except Exception: + pass + metadata_client.register_task_def(task_def=task_def) + + logger.info(f" โœ“ Registered task definition on {server_label}") + + except Exception as e: + # Try register if update fails + try: + metadata_client.register_task_def(task_def=task_def) + logger.info(f" โœ“ Registered task definition on {server_label}") + except Exception as register_error: + logger.warning(f"โš  Could not register task on {server_label}: {register_error}") - except Exception as e: - # Don't crash worker if registration fails - just log warning - logger.warning(f"Failed to register task definition for {task_name}: {e}") + except Exception as e: + logger.warning(f"Failed to register task definition on {server_label}: {e}") async def run_once(self) -> None: """Execute one iteration of the polling loop (async version).""" @@ -472,96 +385,167 @@ async def run_once(self) -> None: logger.error("Error in run_once: %s", traceback.format_exc()) async def __async_batch_poll(self, count: int) -> list: - """Async batch poll for multiple tasks (async version of TaskRunner.__batch_poll_tasks).""" + """Async batch poll for multiple tasks from all servers in parallel (multi-homed support).""" task_definition_name = self.worker.get_task_definition_name() if self.worker.paused: logger.debug("Stop polling task for: %s", task_definition_name) return [] - # Apply exponential backoff if we have recent auth failures (same as TaskRunner) - if self._auth_failures > 0: - now = time.time() - backoff_seconds = min(2 ** self._auth_failures, 60) - time_since_last_failure = now - self._last_auth_failure - if time_since_last_failure < backoff_seconds: - await asyncio.sleep(0.1) - return [] + if not self.async_task_clients: + return [] + + # Divide capacity across servers evenly + # e.g., count=10, clients=3 -> [4, 3, 3] + total_servers = len(self.async_task_clients) + base_count = count // total_servers + remainder = count % total_servers - # Publish PollStarted event (same as TaskRunner:245) + # Publish PollStarted event self.event_dispatcher.publish(PollStarted( task_type=task_definition_name, worker_id=self.worker.get_identity(), poll_count=count )) - try: - start_time = time.time() - domain = self.worker.get_domain() - params = { - "workerid": self.worker.get_identity(), - "count": count, - "timeout": 100 # ms - } - # Only add domain if it's not None and not empty string - if domain is not None and domain != "": - params["domain"] = domain - - # Async batch poll - tasks = await self.async_task_client.batch_poll(tasktype=task_definition_name, **params) + start_time = time.time() + all_tasks = [] + domain = self.worker.get_domain() - finish_time = time.time() - time_spent = finish_time - start_time - - # Publish PollCompleted event (same as TaskRunner:268) - self.event_dispatcher.publish(PollCompleted( - task_type=task_definition_name, - duration_ms=time_spent * 1000, - tasks_received=len(tasks) if tasks else 0 - )) + async def poll_single_server(server_idx: int, task_client) -> tuple: + """Poll a single server and return (server_idx, tasks, error).""" + now = time.time() + + # Calculate specific count for this server + # Distribute remainder to first N servers + server_count = base_count + (1 if server_idx < remainder else 0) + + # Don't poll if count is 0 + if server_count <= 0: + return (server_idx, [], None) + + # Circuit breaker: skip if circuit is open + if self._circuit_open_until[server_idx] > now: + # Circuit is open, skip this server + return (server_idx, [], None) + + # Check per-server auth backoff + if self._auth_failures[server_idx] > 0: + backoff_seconds = min(2 ** self._auth_failures[server_idx], 60) + time_since_failure = now - self._last_auth_failure[server_idx] + if time_since_failure < backoff_seconds: + return (server_idx, [], None) - # Success - reset auth failure counter - if tasks: - self._auth_failures = 0 + try: + params = { + "workerid": self.worker.get_identity(), + "count": server_count, + "timeout": 100 # ms + } + if domain is not None and domain != "": + params["domain"] = domain + + tasks = await task_client.batch_poll(tasktype=task_definition_name, **params) + + # Reset failures on success + if tasks: + self._auth_failures[server_idx] = 0 + self._server_failures[server_idx] = 0 + + return (server_idx, tasks or [], None) + + except AuthorizationException as auth_exception: + self._auth_failures[server_idx] += 1 + self._last_auth_failure[server_idx] = time.time() + backoff = min(2 ** self._auth_failures[server_idx], 60) + logger.error( + f"Auth failure polling server {server_idx} for {task_definition_name}: " + f"{auth_exception.error_code} (backoff: {backoff}s)" + ) + return (server_idx, [], auth_exception) - return tasks if tasks else [] + except Exception as e: + # Increment failure count for circuit breaker + self._server_failures[server_idx] += 1 + if self._server_failures[server_idx] >= self._CIRCUIT_FAILURE_THRESHOLD: + self._circuit_open_until[server_idx] = time.time() + self._CIRCUIT_RESET_SECONDS + logger.warning( + f"Circuit breaker OPEN for server {server_idx} after {self._server_failures[server_idx]} failures. " + f"Will retry in {self._CIRCUIT_RESET_SECONDS}s" + ) + else: + logger.error( + f"Failed to poll server {server_idx} for {task_definition_name}: {e} " + f"(failure {self._server_failures[server_idx]}/{self._CIRCUIT_FAILURE_THRESHOLD})" + ) + return (server_idx, [], e) + + # Single server: poll directly + if len(self.async_task_clients) == 1: + server_idx, tasks, error = await poll_single_server(0, self.async_task_clients[0]) + + if error: + self.event_dispatcher.publish(PollFailure( + task_type=task_definition_name, + cause=error, + duration_ms=int((time.time() - start_time) * 1000) + )) + + for task in tasks: + if task and task.task_id: + self._task_server_map[task.task_id] = server_idx + all_tasks.append(task) + else: + # Multi-homed: poll all servers in parallel with timeout + try: + results = await asyncio.wait_for( + asyncio.gather(*[ + poll_single_server(idx, client) + for idx, client in enumerate(self.async_task_clients) + ], return_exceptions=True), + timeout=self._POLL_TIMEOUT_SECONDS + ) + + # Merge results and track task-to-server mapping + for result in results: + if isinstance(result, Exception): + logger.debug(f"Poll exception: {result}") + continue + server_idx, tasks, error = result + + if error: + self.event_dispatcher.publish(PollFailure( + task_type=task_definition_name, + cause=error, + duration_ms=int((time.time() - start_time) * 1000) + )) + + for task in tasks: + if task and task.task_id: + self._task_server_map[task.task_id] = server_idx + all_tasks.append(task) + + except asyncio.TimeoutError: + # Some servers didn't respond in time - continue with tasks we have + logger.debug( + f"Poll timeout after {self._POLL_TIMEOUT_SECONDS}s - some servers did not respond" + ) - except AuthorizationException as auth_exception: - self._auth_failures += 1 - self._last_auth_failure = time.time() - backoff_seconds = min(2 ** self._auth_failures, 60) + finish_time = time.time() + time_spent = finish_time - start_time - # Publish PollFailure event (same as TaskRunner:286) - self.event_dispatcher.publish(PollFailure( - task_type=task_definition_name, - duration_ms=(time.time() - start_time) * 1000, - cause=auth_exception - )) + # Publish PollCompleted event + self.event_dispatcher.publish(PollCompleted( + task_type=task_definition_name, + duration_ms=time_spent * 1000, + tasks_received=len(all_tasks) + )) - if auth_exception.invalid_token: - logger.error( - f"Failed to batch poll task {task_definition_name} due to invalid auth token " - f"(failure #{self._auth_failures}). Will retry with exponential backoff ({backoff_seconds}s). " - "Please check your CONDUCTOR_AUTH_KEY and CONDUCTOR_AUTH_SECRET." - ) - else: - logger.error( - f"Failed to batch poll task {task_definition_name} error: {auth_exception.status} - {auth_exception.error_code} " - f"(failure #{self._auth_failures}). Will retry with exponential backoff ({backoff_seconds}s)." - ) - return [] - except Exception as e: - # Publish PollFailure event (same as TaskRunner:306) - self.event_dispatcher.publish(PollFailure( - task_type=task_definition_name, - duration_ms=(time.time() - start_time) * 1000, - cause=e - )) - logger.error( - "Failed to batch poll task for: %s, reason: %s", - task_definition_name, - traceback.format_exc() + if len(self.async_task_clients) > 1 and all_tasks: + logger.debug( + f"Polled {len(all_tasks)} tasks from {len(self.async_task_clients)} servers for {task_definition_name}" ) - return [] + + return all_tasks async def __async_execute_and_update_task(self, task: Task) -> None: """Execute task and update result (async version - runs in event loop, not thread pool).""" @@ -786,17 +770,22 @@ def __merge_context_modifications(self, task_result: TaskResult, context_result: task_result.output_data = context_result.output_data async def __async_update_task(self, task_result: TaskResult): - """Async update task result (async version of TaskRunner.__update_task).""" + """Async update task result to the correct server (multi-homed support).""" if not isinstance(task_result, TaskResult): return None task_definition_name = self.worker.get_task_definition_name() + + # Get the correct server for this task (multi-homed support) + server_idx = self._task_server_map.pop(task_result.task_id, 0) + task_client = self.async_task_clients[server_idx] + logger.debug( - "Updating async task, id: %s, workflow_instance_id: %s, task_definition_name: %s, status: %s, output_data: %s", + "Updating async task, id: %s, workflow_instance_id: %s, task_definition_name: %s, status: %s, server: %d", task_result.task_id, task_result.workflow_instance_id, task_definition_name, task_result.status, - task_result.output_data + server_idx ) last_exception = None @@ -808,7 +797,7 @@ async def __async_update_task(self, task_result: TaskResult): # Exponential backoff: [10s, 20s, 30s] before retry await asyncio.sleep(attempt * 10) try: - response = await self.async_task_client.update_task(body=task_result) + response = await task_client.update_task(body=task_result) logger.debug( "Updated async task, id: %s, workflow_instance_id: %s, task_definition_name: %s, response: %s", task_result.task_id, @@ -824,12 +813,13 @@ async def __async_update_task(self, task_result: TaskResult): task_definition_name, type(e) ) logger.error( - "Failed to update async task (attempt %d/%d), id: %s, workflow_instance_id: %s, task_definition_name: %s, reason: %s", + "Failed to update async task (attempt %d/%d), id: %s, workflow_instance_id: %s, task_definition_name: %s, server: %d, reason: %s", attempt + 1, retry_count, task_result.task_id, task_result.workflow_instance_id, task_definition_name, + server_idx, traceback.format_exc() ) diff --git a/src/conductor/client/automator/task_handler.py b/src/conductor/client/automator/task_handler.py index 3185b4ae..6a8a5147 100644 --- a/src/conductor/client/automator/task_handler.py +++ b/src/conductor/client/automator/task_handler.py @@ -6,7 +6,7 @@ import os from multiprocessing import Process, freeze_support, Queue, set_start_method from sys import platform -from typing import List, Optional +from typing import List, Optional, Union from conductor.client.automator.task_runner import TaskRunner from conductor.client.automator.async_task_runner import AsyncTaskRunner @@ -148,14 +148,28 @@ def process_data(data: dict) -> dict: def __init__( self, workers: Optional[List[WorkerInterface]] = None, - configuration: Optional[Configuration] = None, + configuration: Optional[Union[Configuration, List[Configuration]]] = None, metrics_settings: Optional[MetricsSettings] = None, scan_for_annotated_workers: bool = True, import_modules: Optional[List[str]] = None, event_listeners: Optional[List] = None ): workers = workers or [] - self.logger_process, self.queue = _setup_logging_queue(configuration) + + # Normalize configuration to list (multi-homed support) + # If no config provided, try comma-separated env vars first + if configuration is None: + self.configurations = Configuration.from_env_multi() + elif isinstance(configuration, list): + self.configurations = configuration + else: + self.configurations = [configuration] + + if len(self.configurations) > 1: + server_urls = [c.host for c in self.configurations] + logger.info(f"Multi-homed mode enabled: {len(self.configurations)} servers - {server_urls}") + + self.logger_process, self.queue = _setup_logging_queue(self.configurations[0]) # Set prometheus multiprocess directory BEFORE any worker processes start # This must be done before prometheus_client is imported in worker processes @@ -218,9 +232,9 @@ def __init__( logger.debug("created worker with name=%s and domain=%s", task_def_name, resolved_config['domain']) workers.append(worker) - self.__create_task_runner_processes(workers, configuration, metrics_settings) + self.__create_task_runner_processes(workers, self.configurations, metrics_settings) self.__create_metrics_provider_process(metrics_settings) - logger.info("TaskHandler initialized") + logger.info(f"TaskHandler initialized with {len(self.configurations)} server(s)") def __enter__(self): return self @@ -260,21 +274,21 @@ def __create_metrics_provider_process(self, metrics_settings: MetricsSettings) - def __create_task_runner_processes( self, workers: List[WorkerInterface], - configuration: Configuration, + configurations: List[Configuration], metrics_settings: MetricsSettings ) -> None: self.task_runner_processes = [] self.workers = [] for worker in workers: self.__create_task_runner_process( - worker, configuration, metrics_settings + worker, configurations, metrics_settings ) self.workers.append(worker) def __create_task_runner_process( self, worker: WorkerInterface, - configuration: Configuration, + configurations: List[Configuration], metrics_settings: MetricsSettings ) -> None: # Detect if worker function is async @@ -288,17 +302,18 @@ def __create_task_runner_process( # Class-based worker (implements WorkerInterface) is_async_worker = inspect.iscoroutinefunction(worker.execute) + server_count = len(configurations) if is_async_worker: # Use AsyncTaskRunner for async def workers - async_task_runner = AsyncTaskRunner(worker, configuration, metrics_settings, self.event_listeners) + async_task_runner = AsyncTaskRunner(worker, configurations, metrics_settings, self.event_listeners) # Wrap async runner in a sync function for multiprocessing process = Process(target=self.__run_async_runner, args=(async_task_runner,)) - logger.debug(f"Created AsyncTaskRunner for async worker: {worker.get_task_definition_name()}") + logger.debug(f"Created AsyncTaskRunner for async worker: {worker.get_task_definition_name()} ({server_count} server(s))") else: # Use TaskRunner for sync def workers - task_runner = TaskRunner(worker, configuration, metrics_settings, self.event_listeners) + task_runner = TaskRunner(worker, configurations, metrics_settings, self.event_listeners) process = Process(target=task_runner.run) - logger.debug(f"Created TaskRunner for sync worker: {worker.get_task_definition_name()}") + logger.debug(f"Created TaskRunner for sync worker: {worker.get_task_definition_name()} ({server_count} server(s))") self.task_runner_processes.append(process) diff --git a/src/conductor/client/automator/task_runner.py b/src/conductor/client/automator/task_runner.py index 6165d0d9..949ac266 100644 --- a/src/conductor/client/automator/task_runner.py +++ b/src/conductor/client/automator/task_runner.py @@ -2,6 +2,7 @@ import logging import os import sys +import threading import time import traceback from concurrent.futures import ThreadPoolExecutor, as_completed @@ -46,7 +47,7 @@ class TaskRunner: def __init__( self, worker: WorkerInterface, - configuration: Configuration = None, + configuration: Configuration = None, # Accepts single or list for multi-homed metrics_settings: MetricsSettings = None, event_listeners: list = None ): @@ -54,9 +55,15 @@ def __init__( raise Exception("Invalid worker") self.worker = worker self.__set_worker_properties() - if not isinstance(configuration, Configuration): - configuration = Configuration() - self.configuration = configuration + + # Normalize configuration to list (multi-homed support) + # Accepts: None, single Configuration, or List[Configuration] + if configuration is None: + self.configurations = [Configuration()] + elif isinstance(configuration, list): + self.configurations = configuration + else: + self.configurations = [configuration] # Set up event dispatcher and register listeners self.event_dispatcher = SyncEventDispatcher[TaskRunnerEvent]() @@ -72,16 +79,32 @@ def __init__( # Register metrics collector as event listener register_task_runner_listener(self.metrics_collector, self.event_dispatcher) - self.task_client = TaskResourceApi( - ApiClient( - configuration=self.configuration, - metrics_collector=self.metrics_collector + # Create one task client per server (multi-homed support) + self.task_clients = [] + for cfg in self.configurations: + task_client = TaskResourceApi( + ApiClient( + configuration=cfg, + metrics_collector=self.metrics_collector + ) ) - ) - - # Auth failure backoff tracking to prevent retry storms - self._auth_failures = 0 - self._last_auth_failure = 0 + self.task_clients.append(task_client) + + # Track which server each task came from: {task_id: server_index} + # Thread-safe: accessed from poll thread and executor threads + self._task_server_map = {} + self._task_server_map_lock = threading.Lock() + + # Auth failure backoff tracking per server + self._auth_failures = [0] * len(self.configurations) + self._last_auth_failure = [0] * len(self.configurations) + + # Circuit breaker per server (for multi-homed resilience) + self._server_failures = [0] * len(self.configurations) + self._circuit_open_until = [0.0] * len(self.configurations) + self._CIRCUIT_FAILURE_THRESHOLD = 3 # failures before opening circuit + self._CIRCUIT_RESET_SECONDS = 30 # seconds before half-open retry + self._POLL_TIMEOUT_SECONDS = 5 # max time to wait for any server poll # Thread pool for concurrent task execution # thread_count from worker configuration controls concurrency @@ -92,10 +115,20 @@ def __init__( self._last_poll_time = 0 # Track last poll to avoid excessive polling when queue is empty self._consecutive_empty_polls = 0 # Track empty polls to implement backoff self._shutdown = False # Flag to indicate graceful shutdown + + # Reusable poll executor for multi-homed parallel polling + if len(self.configurations) > 1: + self._poll_executor = ThreadPoolExecutor( + max_workers=len(self.configurations), + thread_name_prefix=f"poll-{worker.get_task_definition_name()}" + ) + else: + self._poll_executor = None def run(self) -> None: - if self.configuration is not None: - self.configuration.apply_logging_config() + # Apply logging config from primary configuration + if self.configurations: + self.configurations[0].apply_logging_config() else: logger.setLevel(logging.DEBUG) @@ -139,16 +172,25 @@ def _cleanup(self) -> None: pass # No executor to shutdown except (RuntimeError, ValueError) as e: logger.warning(f"Error shutting down executor: {e}") + + # Shutdown poll executor if exists (multi-homed mode) + if self._poll_executor is not None: + try: + self._poll_executor.shutdown(wait=False, cancel_futures=True) + logger.debug("Poll executor shut down successfully") + except (RuntimeError, ValueError) as e: + logger.warning(f"Error shutting down poll executor: {e}") - # Close HTTP client (EAFP style) - try: - rest_client = self.task_client.api_client.rest_client - rest_client.close() - logger.debug("HTTP client closed successfully") - except AttributeError: - pass # No client to close or no close method - except (IOError, OSError) as e: - logger.warning(f"Error closing HTTP client: {e}") + # Close HTTP clients for all servers (multi-homed support) + for i, task_client in enumerate(self.task_clients): + try: + rest_client = task_client.api_client.rest_client + rest_client.close() + logger.debug(f"HTTP client {i + 1} closed successfully") + except AttributeError: + pass # No client to close or no close method + except (IOError, OSError) as e: + logger.warning(f"Error closing HTTP client {i + 1}: {e}") # Clear event listeners self.event_dispatcher = None @@ -166,7 +208,9 @@ def __exit__(self, exc_type, exc_val, exc_tb): def __register_task_definition(self) -> None: """ - Register task definition with Conductor server (if register_task_def=True). + Register task definition with Conductor server(s) (if register_task_def=True). + + In multi-homed mode, registers to ALL configured servers. Automatically creates/updates: 1. Task definition with basic metadata or provided TaskDef configuration @@ -181,212 +225,116 @@ def __register_task_definition(self) -> None: task_name = self.worker.get_task_definition_name() logger.info("=" * 80) - logger.info(f"Registering task definition: {task_name}") + logger.info(f"Registering task definition: {task_name} to {len(self.configurations)} server(s)") logger.info("=" * 80) - try: - # Create metadata client - logger.debug(f"Creating metadata client for task registration...") - metadata_client = OrkesMetadataClient(self.configuration) - - # Generate JSON schemas from function signature (if worker has execute_function) - input_schema_name = None - output_schema_name = None - schema_registry_available = True - - if hasattr(self.worker, 'execute_function'): - logger.info(f"Generating JSON schemas from function signature...") - # Pass strict_schema flag to control additionalProperties - strict_mode = getattr(self.worker, 'strict_schema', False) - logger.debug(f" strict_schema mode: {strict_mode}") - schemas = generate_json_schema_from_function(self.worker.execute_function, task_name, strict_schema=strict_mode) - - if schemas: - has_input_schema = schemas.get('input') is not None - has_output_schema = schemas.get('output') is not None - - if has_input_schema or has_output_schema: - logger.info(f" โœ“ Generated schemas: input={'Yes' if has_input_schema else 'No'}, output={'Yes' if has_output_schema else 'No'}") - else: - logger.info(f" โš  No schemas generated (type hints not fully supported)") + # Generate JSON schemas once (same for all servers) + input_schema_name = None + output_schema_name = None + schemas = None + + if hasattr(self.worker, 'execute_function'): + logger.info(f"Generating JSON schemas from function signature...") + strict_mode = getattr(self.worker, 'strict_schema', False) + logger.debug(f" strict_schema mode: {strict_mode}") + schemas = generate_json_schema_from_function(self.worker.execute_function, task_name, strict_schema=strict_mode) + + if schemas: + has_input_schema = schemas.get('input') is not None + has_output_schema = schemas.get('output') is not None + if has_input_schema or has_output_schema: + logger.info(f" โœ“ Generated schemas: input={'Yes' if has_input_schema else 'No'}, output={'Yes' if has_output_schema else 'No'}") + input_schema_name = f"{task_name}_input" if has_input_schema else None + output_schema_name = f"{task_name}_output" if has_output_schema else None + + # Build task definition once (same for all servers) + if hasattr(self.worker, 'task_def_template') and self.worker.task_def_template: + import copy + task_def = copy.deepcopy(self.worker.task_def_template) + task_def.name = task_name + else: + task_def = TaskDef(name=task_name) - # Register schemas with schema client - try: - logger.debug(f"Creating schema client...") - schema_client = OrkesSchemaClient(self.configuration) - except Exception as e: - # Schema client not available (server doesn't support schemas) - logger.warning(f"โš  Schema registry not available on server - task will be registered without schemas") - logger.debug(f" Error: {e}") - schema_registry_available = False - schema_client = None - - if schema_registry_available and schema_client: - logger.info(f"Registering JSON schemas...") - try: - # Register input schema - if schemas.get('input'): - input_schema_name = f"{task_name}_input" - try: - # Register schema (overwrite if exists) - input_schema_def = SchemaDef( - name=input_schema_name, - version=1, - type=SchemaType.JSON, - data=schemas['input'] - ) - schema_client.register_schema(input_schema_def) - logger.info(f" โœ“ Registered input schema: {input_schema_name} (v1)") - - except Exception as e: - # Check if this is a 404 (API endpoint doesn't exist on server) - if hasattr(e, 'status') and e.status == 404: - logger.warning(f"โš  Schema registry API not available on server (404) - task will be registered without schemas") - schema_registry_available = False - input_schema_name = None - else: - # Other error - log and continue without this schema - logger.warning(f"โš  Could not register input schema '{input_schema_name}': {e}") - input_schema_name = None - - # Register output schema (only if schema registry is available) - if schema_registry_available and schemas.get('output'): - output_schema_name = f"{task_name}_output" - try: - # Register schema (overwrite if exists) - output_schema_def = SchemaDef( - name=output_schema_name, - version=1, - type=SchemaType.JSON, - data=schemas['output'] - ) - schema_client.register_schema(output_schema_def) - logger.info(f" โœ“ Registered output schema: {output_schema_name} (v1)") - - except Exception as e: - # Check if this is a 404 (API endpoint doesn't exist on server) - if hasattr(e, 'status') and e.status == 404: - logger.warning(f"โš  Schema registry API not available on server (404)") - schema_registry_available = False - else: - # Other error - log and continue without this schema - logger.warning(f"โš  Could not register output schema '{output_schema_name}': {e}") - output_schema_name = None - - except Exception as e: - logger.debug(f"Could not register schemas for {task_name}: {e}") - else: - logger.info(f" โš  No schemas generated (unable to analyze function signature)") - else: - logger.info(f" โš  Class-based worker (no execute_function) - registering task without schemas") - - # Create task definition - logger.info(f"Creating task definition for '{task_name}'...") - - # Check if task_def_template is provided - logger.debug(f" task_def_template present: {hasattr(self.worker, 'task_def_template')}") - if hasattr(self.worker, 'task_def_template'): - logger.debug(f" task_def_template value: {self.worker.task_def_template}") - - # Use provided task_def template if available, otherwise create minimal TaskDef - if hasattr(self.worker, 'task_def_template') and self.worker.task_def_template: - logger.info(f" Using provided TaskDef configuration:") - - # Create a copy to avoid mutating the original - import copy - task_def = copy.deepcopy(self.worker.task_def_template) - - # Override name to ensure consistency - task_def.name = task_name - - # Log configuration being applied - if task_def.retry_count: - logger.info(f" - retry_count: {task_def.retry_count}") - if task_def.retry_logic: - logger.info(f" - retry_logic: {task_def.retry_logic}") - if task_def.timeout_seconds: - logger.info(f" - timeout_seconds: {task_def.timeout_seconds}") - if task_def.timeout_policy: - logger.info(f" - timeout_policy: {task_def.timeout_policy}") - if task_def.response_timeout_seconds: - logger.info(f" - response_timeout_seconds: {task_def.response_timeout_seconds}") - if task_def.concurrent_exec_limit: - logger.info(f" - concurrent_exec_limit: {task_def.concurrent_exec_limit}") - if task_def.rate_limit_per_frequency: - logger.info(f" - rate_limit: {task_def.rate_limit_per_frequency}/{task_def.rate_limit_frequency_in_seconds}s") - else: - # Create minimal task definition - logger.info(f" Creating minimal TaskDef (no custom configuration)") - task_def = TaskDef(name=task_name) - - # Link schemas if they were generated (overrides any schemas in task_def_template) - if input_schema_name: - task_def.input_schema = {"name": input_schema_name, "version": 1} - logger.debug(f" Linked input schema: {input_schema_name}") - if output_schema_name: - task_def.output_schema = {"name": output_schema_name, "version": 1} - logger.debug(f" Linked output schema: {output_schema_name}") - - # Register/update task definition - # Behavior depends on overwrite_task_def flag - overwrite = getattr(self.worker, 'overwrite_task_def', True) - logger.debug(f" overwrite_task_def: {overwrite}") - try: - # Debug: Log the TaskDef being sent - logger.debug(f" Sending TaskDef to server:") - logger.debug(f" Name: {task_def.name}") - logger.debug(f" retry_count: {task_def.retry_count}") - logger.debug(f" retry_logic: {task_def.retry_logic}") - logger.debug(f" timeout_policy: {task_def.timeout_policy}") - logger.debug(f" Full to_dict(): {task_def.to_dict()}") - - if overwrite: - # Use update_task_def to overwrite existing definitions - logger.debug(f" Using update_task_def (overwrite=True)") - metadata_client.update_task_def(task_def=task_def) - else: - # Check if task exists, only create if it doesn't - logger.debug(f" Checking if task exists before creating (overwrite=False)") - try: - existing = metadata_client.get_task_def(task_name) - if existing: - logger.info(f"โœ“ Task definition '{task_name}' already exists - skipping (overwrite=False)") - logger.info(f" View at: {self.configuration.ui_host}/taskDef/{task_name}") - return - except Exception: - # Task doesn't exist, proceed to register - pass - metadata_client.register_task_def(task_def=task_def) - - # Print success message with link - task_def_url = f"{self.configuration.ui_host}/taskDef/{task_name}" - logger.info(f"โœ“ Registered/Updated task definition: {task_name} with {task_def.to_dict()}") - logger.info(f" View at: {task_def_url}") - - if input_schema_name or output_schema_name: - schema_count = sum([1 for s in [input_schema_name, output_schema_name] if s]) - logger.info(f" With {schema_count} JSON schema(s): {', '.join(filter(None, [input_schema_name, output_schema_name]))}") - except Exception as e: - # If update fails (task doesn't exist), try register - try: - metadata_client.register_task_def(task_def=task_def) + overwrite = getattr(self.worker, 'overwrite_task_def', True) - task_def_url = f"{self.configuration.ui_host}/taskDef/{task_name}" - logger.info(f"โœ“ Registered task definition: {task_name}") - logger.info(f" View at: {task_def_url}") + # Register to each server + for server_idx, config in enumerate(self.configurations): + server_label = f"server {server_idx + 1}/{len(self.configurations)}" + try: + logger.info(f"Registering to {server_label}: {config.host}") + metadata_client = OrkesMetadataClient(config) - if input_schema_name or output_schema_name: - schema_count = sum([1 for s in [input_schema_name, output_schema_name] if s]) - logger.info(f" With {schema_count} JSON schema(s): {', '.join(filter(None, [input_schema_name, output_schema_name]))}") + # Register schemas if available + input_schema_registered = False + output_schema_registered = False + if schemas and (schemas.get('input') or schemas.get('output')): + try: + schema_client = OrkesSchemaClient(config) + if schemas.get('input'): + input_schema_def = SchemaDef( + name=input_schema_name, + version=1, + type=SchemaType.JSON, + data=schemas['input'] + ) + schema_client.register_schema(input_schema_def) + input_schema_registered = True + logger.debug(f" โœ“ Registered input schema on {server_label}") + if schemas.get('output'): + output_schema_def = SchemaDef( + name=output_schema_name, + version=1, + type=SchemaType.JSON, + data=schemas['output'] + ) + schema_client.register_schema(output_schema_def) + output_schema_registered = True + logger.debug(f" โœ“ Registered output schema on {server_label}") + except Exception as e: + logger.warning(f"โš  Could not register schemas on {server_label}: {e}") - except Exception as register_error: - logger.warning(f"โš  Could not register/update task definition '{task_name}': {register_error}") + # Register task definition + try: + # For single server, modify usage of task_def directly to preserve object identity (helps tests) + # For multi-server, clone to ensure isolation + if len(self.configurations) == 1: + server_task_def = task_def + else: + import copy + server_task_def = copy.deepcopy(task_def) + + # Link schemas ONLY if successfully registered on THIS server + if input_schema_registered and input_schema_name: + server_task_def.input_schema = {"name": input_schema_name, "version": 1} + if output_schema_registered and output_schema_name: + server_task_def.output_schema = {"name": output_schema_name, "version": 1} + + if overwrite: + metadata_client.update_task_def(task_def=server_task_def) + else: + try: + existing = metadata_client.get_task_def(task_name) + if existing: + logger.info(f" โœ“ Task already exists on {server_label} - skipping (overwrite=False)") + continue + except Exception: + pass + metadata_client.register_task_def(task_def=server_task_def) + + logger.info(f" โœ“ Registered task definition on {server_label}") + + except Exception as e: + # Try register if update fails + try: + metadata_client.register_task_def(task_def=task_def) + logger.info(f" โœ“ Registered task definition on {server_label}") + except Exception as register_error: + logger.warning(f"โš  Could not register task on {server_label}: {register_error}") - except Exception as e: - # Don't crash worker if registration fails - just log warning - logger.warning(f"Failed to register task definition for {task_name}: {e}") + except Exception as e: + logger.warning(f"Failed to register task definition on {server_label}: {e}") def run_once(self) -> None: try: @@ -519,108 +467,177 @@ def __execute_and_update_task(self, task: Task) -> None: ) def __batch_poll_tasks(self, count: int) -> list: - """Poll for multiple tasks at once (more efficient than polling one at a time)""" + """Poll for multiple tasks from all servers in parallel (multi-homed support).""" task_definition_name = self.worker.get_task_definition_name() if self.worker.paused: logger.debug("Stop polling task for: %s", task_definition_name) return [] - # Apply exponential backoff if we have recent auth failures - if self._auth_failures > 0: - now = time.time() - backoff_seconds = min(2 ** self._auth_failures, 60) - time_since_last_failure = now - self._last_auth_failure - if time_since_last_failure < backoff_seconds: - time.sleep(0.1) - return [] + if not self.task_clients: + return [] - # Publish PollStarted event (metrics collector will handle via event) + # Divide capacity across servers evenly + # e.g., count=10, clients=3 -> [4, 3, 3] + total_servers = len(self.task_clients) + base_count = count // total_servers + remainder = count % total_servers + + # Publish PollStarted event self.event_dispatcher.publish(PollStarted( task_type=task_definition_name, worker_id=self.worker.get_identity(), poll_count=count )) - try: - start_time = time.time() - domain = self.worker.get_domain() - params = { - "workerid": self.worker.get_identity(), - "count": count, - "timeout": 100 # ms - } - # Only add domain if it's not None and not empty string - if domain is not None and domain != "": - params["domain"] = domain + start_time = time.time() + all_tasks = [] + domain = self.worker.get_domain() - tasks = self.task_client.batch_poll(tasktype=task_definition_name, **params) - - finish_time = time.time() - time_spent = finish_time - start_time + def poll_single_server(server_idx: int, task_client: TaskResourceApi) -> tuple: + """Poll a single server and return (server_idx, tasks, error).""" + now = time.time() + + # Calculate specific count for this server + # Distribute remainder to first N servers + server_count = base_count + (1 if server_idx < remainder else 0) + + # Don't poll if count is 0 (unless we have plenty of thread capacity, but let's be strict) + if server_count <= 0: + return (server_idx, [], None) + + # Circuit breaker: skip if circuit is open + if self._circuit_open_until[server_idx] > now: + # Circuit is open, skip this server + return (server_idx, [], None) + + # Check per-server auth backoff + if self._auth_failures[server_idx] > 0: + backoff_seconds = min(2 ** self._auth_failures[server_idx], 60) + time_since_failure = now - self._last_auth_failure[server_idx] + if time_since_failure < backoff_seconds: + return (server_idx, [], None) - # Publish PollCompleted event (metrics collector will handle via event) - self.event_dispatcher.publish(PollCompleted( - task_type=task_definition_name, - duration_ms=time_spent * 1000, - tasks_received=len(tasks) if tasks else 0 - )) + try: + params = { + "workerid": self.worker.get_identity(), + "count": server_count, + "timeout": 100 # ms + } + if domain is not None and domain != "": + params["domain"] = domain + + tasks = task_client.batch_poll(tasktype=task_definition_name, **params) + + # Reset failures on success + if tasks: + self._auth_failures[server_idx] = 0 + self._server_failures[server_idx] = 0 + + return (server_idx, tasks or [], None) + + except AuthorizationException as auth_exception: + self._auth_failures[server_idx] += 1 + self._last_auth_failure[server_idx] = time.time() + backoff = min(2 ** self._auth_failures[server_idx], 60) + logger.error( + f"Auth failure polling server {server_idx} for {task_definition_name}: " + f"{auth_exception.error_code} (backoff: {backoff}s)" + ) + return (server_idx, [], auth_exception) - # Success - reset auth failure counter - if tasks: - self._auth_failures = 0 + except Exception as e: + # Increment failure count for circuit breaker + self._server_failures[server_idx] += 1 + if self._server_failures[server_idx] >= self._CIRCUIT_FAILURE_THRESHOLD: + self._circuit_open_until[server_idx] = time.time() + self._CIRCUIT_RESET_SECONDS + logger.warning( + f"Circuit breaker OPEN for server {server_idx} after {self._server_failures[server_idx]} failures. " + f"Will retry in {self._CIRCUIT_RESET_SECONDS}s" + ) + else: + logger.error( + f"Failed to poll server {server_idx} for {task_definition_name}: {e} " + f"(failure {self._server_failures[server_idx]}/{self._CIRCUIT_FAILURE_THRESHOLD})" + ) + return (server_idx, [], e) + + # Single server: poll directly without executor overhead + if len(self.task_clients) == 1: + server_idx, tasks, error = poll_single_server(0, self.task_clients[0]) + for task in tasks: + if task and task.task_id: + with self._task_server_map_lock: + self._task_server_map[task.task_id] = server_idx + all_tasks.append(task) + else: + # Multi-homed: poll all servers in parallel with timeout + futures = { + self._poll_executor.submit(poll_single_server, idx, client): idx + for idx, client in enumerate(self.task_clients) + } - return tasks if tasks else [] + try: + # Use timeout to avoid slow server blocking all polling + for future in as_completed(futures, timeout=self._POLL_TIMEOUT_SECONDS): + try: + server_idx, tasks, error = future.result(timeout=0) + + if error: + self.event_dispatcher.publish(PollFailure( + task_type=task_definition_name, + cause=error, + duration_ms=int((time.time() - start_time) * 1000) + )) + + for task in tasks: + if task and task.task_id: + # Track which server this task came from (thread-safe) + with self._task_server_map_lock: + self._task_server_map[task.task_id] = server_idx + all_tasks.append(task) + except Exception as e: + logger.debug(f"Error getting poll result: {e}") + except TimeoutError: + # Some servers didn't respond in time - continue with tasks we have + logger.debug( + f"Poll timeout after {self._POLL_TIMEOUT_SECONDS}s - some servers did not respond" + ) - except AuthorizationException as auth_exception: - self._auth_failures += 1 - self._last_auth_failure = time.time() - backoff_seconds = min(2 ** self._auth_failures, 60) + finish_time = time.time() + time_spent = finish_time - start_time - # Publish PollFailure event (metrics collector will handle via event) - self.event_dispatcher.publish(PollFailure( - task_type=task_definition_name, - duration_ms=(time.time() - start_time) * 1000, - cause=auth_exception - )) + # Publish PollCompleted event + self.event_dispatcher.publish(PollCompleted( + task_type=task_definition_name, + duration_ms=time_spent * 1000, + tasks_received=len(all_tasks) + )) - if auth_exception.invalid_token: - logger.error( - f"Failed to batch poll task {task_definition_name} due to invalid auth token " - f"(failure #{self._auth_failures}). Will retry with exponential backoff ({backoff_seconds}s). " - "Please check your CONDUCTOR_AUTH_KEY and CONDUCTOR_AUTH_SECRET." - ) - else: - logger.error( - f"Failed to batch poll task {task_definition_name} error: {auth_exception.status} - {auth_exception.error_code} " - f"(failure #{self._auth_failures}). Will retry with exponential backoff ({backoff_seconds}s)." - ) - return [] - except Exception as e: - # Publish PollFailure event (metrics collector will handle via event) - self.event_dispatcher.publish(PollFailure( - task_type=task_definition_name, - duration_ms=(time.time() - start_time) * 1000, - cause=e - )) - logger.error( - "Failed to batch poll task for: %s, reason: %s", - task_definition_name, - traceback.format_exc() + if len(self.task_clients) > 1 and all_tasks: + logger.debug( + f"Polled {len(all_tasks)} tasks from {len(self.task_clients)} servers for {task_definition_name}" ) - return [] + + return all_tasks def __poll_task(self) -> Task: + """Poll for a single task (single-server optimization path). + + This method is only called when len(self.configurations) == 1. + For multi-server mode, __batch_poll_tasks is used instead. + All list indices use [0] because there's only one server. + """ task_definition_name = self.worker.get_task_definition_name() if self.worker.paused: logger.debug("Stop polling task for: %s", task_definition_name) return None # Apply exponential backoff if we have recent auth failures - if self._auth_failures > 0: + if self._auth_failures[0] > 0: now = time.time() # Exponential backoff: 2^failures seconds (2s, 4s, 8s, 16s, 32s) - backoff_seconds = min(2 ** self._auth_failures, 60) # Cap at 60s - time_since_last_failure = now - self._last_auth_failure + backoff_seconds = min(2 ** self._auth_failures[0], 60) # Cap at 60s + time_since_last_failure = now - self._last_auth_failure[0] if time_since_last_failure < backoff_seconds: # Still in backoff period - skip polling @@ -639,16 +656,24 @@ def __poll_task(self) -> Task: # Only add domain if it's not None and not empty string if domain is not None and domain != "": params["domain"] = domain - task = self.task_client.poll(tasktype=task_definition_name, **params) + + # Use the first client (single-server optimization) + task = self.task_clients[0].poll(tasktype=task_definition_name, **params) + finish_time = time.time() time_spent = finish_time - start_time if self.metrics_collector is not None: self.metrics_collector.record_task_poll_time(task_definition_name, time_spent) + + # Reset failure count on success + if self._auth_failures[0] > 0: + self._auth_failures[0] = 0 + except AuthorizationException as auth_exception: # Track auth failure for backoff - self._auth_failures += 1 - self._last_auth_failure = time.time() - backoff_seconds = min(2 ** self._auth_failures, 60) + self._auth_failures[0] += 1 + self._last_auth_failure[0] = time.time() + backoff_seconds = min(2 ** self._auth_failures[0], 60) if self.metrics_collector is not None: self.metrics_collector.increment_task_poll_error(task_definition_name, type(auth_exception)) @@ -656,7 +681,7 @@ def __poll_task(self) -> Task: if auth_exception.invalid_token: logger.error( f"Failed to poll task {task_definition_name} due to invalid auth token " - f"(failure #{self._auth_failures}). Will retry with exponential backoff ({backoff_seconds}s). " + f"(failure #{self._auth_failures[0]}). Will retry with exponential backoff ({backoff_seconds}s). " "Please check your CONDUCTOR_AUTH_KEY and CONDUCTOR_AUTH_SECRET." ) else: @@ -677,7 +702,7 @@ def __poll_task(self) -> Task: # Success - reset auth failure counter if task is not None: - self._auth_failures = 0 + self._auth_failures[0] = 0 logger.trace( "Polled task: %s, worker_id: %s, domain: %s", task_definition_name, @@ -686,7 +711,7 @@ def __poll_task(self) -> Task: ) else: # No task available - also reset auth failures since poll succeeded - self._auth_failures = 0 + self._auth_failures[0] = 0 return task @@ -897,13 +922,19 @@ def __update_task(self, task_result: TaskResult): if not isinstance(task_result, TaskResult): return None task_definition_name = self.worker.get_task_definition_name() + + # Get the correct server for this task (thread-safe, multi-homed support) + with self._task_server_map_lock: + server_idx = self._task_server_map.pop(task_result.task_id, 0) + task_client = self.task_clients[server_idx] + logger.debug( - "Updating task, id: %s, workflow_instance_id: %s, task_definition_name: %s, status: %s, output_data: %s", + "Updating task, id: %s, workflow_instance_id: %s, task_definition_name: %s, status: %s, server: %d", task_result.task_id, task_result.workflow_instance_id, task_definition_name, task_result.status, - task_result.output_data + server_idx ) last_exception = None @@ -914,7 +945,7 @@ def __update_task(self, task_result: TaskResult): # Exponential backoff: [10s, 20s, 30s] before retry time.sleep(attempt * 10) try: - response = self.task_client.update_task(body=task_result) + response = task_client.update_task(body=task_result) logger.debug( "Updated task, id: %s, workflow_instance_id: %s, task_definition_name: %s, response: %s", task_result.task_id, @@ -930,12 +961,13 @@ def __update_task(self, task_result: TaskResult): task_definition_name, type(e) ) logger.error( - "Failed to update task (attempt %d/%d), id: %s, workflow_instance_id: %s, task_definition_name: %s, reason: %s", + "Failed to update task (attempt %d/%d), id: %s, workflow_instance_id: %s, task_definition_name: %s, server: %d, reason: %s", attempt + 1, retry_count, task_result.task_id, task_result.workflow_instance_id, task_definition_name, + server_idx, traceback.format_exc() ) diff --git a/src/conductor/client/configuration/configuration.py b/src/conductor/client/configuration/configuration.py index 157e7607..a9e8d4ad 100644 --- a/src/conductor/client/configuration/configuration.py +++ b/src/conductor/client/configuration/configuration.py @@ -180,3 +180,70 @@ def get_logging_formatted_name(name): def update_token(self, token: str) -> None: self.AUTH_TOKEN = token self.token_update_time = round(time.time() * 1000) + + @classmethod + def from_env_multi(cls) -> list: + """ + Create list of Configuration objects from comma-separated environment variables. + + Supports multi-homed workers by parsing: + - CONDUCTOR_SERVER_URL: comma-separated list of server URLs + - CONDUCTOR_AUTH_KEY: comma-separated list of auth keys (must match server count) + - CONDUCTOR_AUTH_SECRET: comma-separated list of auth secrets (must match server count) + + Returns: + List of Configuration objects, one per server. + Returns [Configuration()] if no env vars are set. + + Raises: + ValueError: If auth key/secret counts don't match server count. + + Example: + CONDUCTOR_SERVER_URL=https://east.example.com/api,https://west.example.com/api + CONDUCTOR_AUTH_KEY=key1,key2 + CONDUCTOR_AUTH_SECRET=secret1,secret2 + """ + servers_raw = os.getenv("CONDUCTOR_SERVER_URL", "") + keys_raw = os.getenv("CONDUCTOR_AUTH_KEY", "") + secrets_raw = os.getenv("CONDUCTOR_AUTH_SECRET", "") + + # Parse and clean comma-separated values + servers = [s.strip() for s in servers_raw.split(",") if s.strip()] + keys = [k.strip() for k in keys_raw.split(",") if k.strip()] + secrets = [s.strip() for s in secrets_raw.split(",") if s.strip()] + + # If no servers configured, return default + if not servers: + return [cls()] + + # Validate auth credentials count matches server count + if keys and len(keys) != len(servers): + raise ValueError( + f"CONDUCTOR_AUTH_KEY count ({len(keys)}) must match " + f"CONDUCTOR_SERVER_URL count ({len(servers)}). " + f"Got servers: {servers}, keys count: {len(keys)}" + ) + + if secrets and len(secrets) != len(servers): + raise ValueError( + f"CONDUCTOR_AUTH_SECRET count ({len(secrets)}) must match " + f"CONDUCTOR_SERVER_URL count ({len(servers)}). " + f"Got servers: {servers}, secrets count: {len(secrets)}" + ) + + # Validate keys and secrets are paired + if (keys and not secrets) or (secrets and not keys): + raise ValueError( + "Both CONDUCTOR_AUTH_KEY and CONDUCTOR_AUTH_SECRET must be provided together" + ) + + # Create configuration for each server + configs = [] + for i, server in enumerate(servers): + auth = None + if keys and secrets: + auth = AuthenticationSettings(key_id=keys[i], key_secret=secrets[i]) + configs.append(cls(server_api_url=server, authentication_settings=auth)) + + return configs + diff --git a/tests/unit/automator/test_async_task_runner.py b/tests/unit/automator/test_async_task_runner.py index be7f3035..c0d532ad 100644 --- a/tests/unit/automator/test_async_task_runner.py +++ b/tests/unit/automator/test_async_task_runner.py @@ -91,6 +91,7 @@ async def run_test(): # Initialize runner (creates clients in event loop) runner.async_api_client = AsyncMock() runner.async_task_client = AsyncMock() + runner.async_task_clients = [runner.async_task_client] runner._semaphore = asyncio.Semaphore(5) # Mock batch_poll to return one task @@ -149,6 +150,7 @@ async def async_worker_returns_none(message: str) -> None: async def run_test(): runner.async_api_client = AsyncMock() runner.async_task_client = AsyncMock() + runner.async_task_clients = [runner.async_task_client] runner._semaphore = asyncio.Semaphore(1) runner.async_task_client.batch_poll = AsyncMock(return_value=[mock_task]) @@ -195,6 +197,7 @@ def on_poll_failure(self, event): async def run_test(): runner.async_api_client = AsyncMock() runner.async_task_client = AsyncMock() + runner.async_task_clients = [runner.async_task_client] runner._semaphore = asyncio.Semaphore(1) # Mock batch_poll to raise a generic exception @@ -244,6 +247,7 @@ async def slow_async_worker(task_id: str) -> dict: async def run_test(): runner.async_api_client = AsyncMock() runner.async_task_client = AsyncMock() + runner.async_task_clients = [runner.async_task_client] runner._semaphore = asyncio.Semaphore(2) # Max 2 concurrent # Return 2 tasks on first poll, 2 on second poll @@ -297,6 +301,7 @@ async def simple_worker(value: int) -> dict: async def run_test(): runner.async_api_client = AsyncMock() runner.async_task_client = AsyncMock() + runner.async_task_clients = [runner.async_task_client] runner._semaphore = asyncio.Semaphore(1) # Mock batch_poll to return empty (no tasks) @@ -344,6 +349,7 @@ def on_poll_failure(self, event): async def run_test(): runner.async_api_client = AsyncMock() runner.async_task_client = AsyncMock() + runner.async_task_clients = [runner.async_task_client] runner._semaphore = asyncio.Semaphore(1) # Mock batch_poll to raise exception @@ -394,6 +400,7 @@ def on_task_execution_failure(self, event): async def run_test(): runner.async_api_client = AsyncMock() runner.async_task_client = AsyncMock() + runner.async_task_clients = [runner.async_task_client] runner._semaphore = asyncio.Semaphore(1) runner.async_task_client.batch_poll = AsyncMock(return_value=[mock_task]) @@ -435,6 +442,7 @@ async def slow_worker(value: int) -> dict: async def run_test(): runner.async_api_client = AsyncMock() runner.async_task_client = AsyncMock() + runner.async_task_clients = [runner.async_task_client] runner._semaphore = asyncio.Semaphore(2) # Mock batch_poll to return only the number of tasks requested (count param) @@ -486,6 +494,7 @@ async def simple_worker(value: int) -> dict: async def run_test(): runner.async_api_client = AsyncMock() runner.async_task_client = AsyncMock() + runner.async_task_clients = [runner.async_task_client] runner._semaphore = asyncio.Semaphore(1) runner.async_task_client.batch_poll = AsyncMock(return_value=[self.__create_task()]) @@ -529,6 +538,7 @@ async def concurrent_worker(task_num: int) -> dict: async def run_test(): runner.async_api_client = AsyncMock() runner.async_task_client = AsyncMock() + runner.async_task_clients = [runner.async_task_client] runner._semaphore = asyncio.Semaphore(3) runner.async_task_client.batch_poll = AsyncMock(return_value=mock_tasks) @@ -574,6 +584,7 @@ async def worker_with_complex_output(data: dict) -> dict: async def run_test(): runner.async_api_client = AsyncMock() runner.async_task_client = AsyncMock() + runner.async_task_clients = [runner.async_task_client] runner._semaphore = asyncio.Semaphore(1) runner.async_task_client.batch_poll = AsyncMock(return_value=[mock_task]) @@ -666,6 +677,7 @@ def on_task_execution_failure(self, event): async def run_test(): runner.async_api_client = AsyncMock() runner.async_task_client = AsyncMock() + runner.async_task_clients = [runner.async_task_client] runner._semaphore = asyncio.Semaphore(1) # Successful execution scenario @@ -753,6 +765,7 @@ def on_task_execution_completed(self, event): async def run_test(): runner.async_api_client = AsyncMock() runner.async_task_client = AsyncMock() + runner.async_task_clients = [runner.async_task_client] runner._semaphore = asyncio.Semaphore(1) runner.async_task_client.batch_poll = AsyncMock(return_value=[mock_task]) @@ -808,6 +821,7 @@ def on_poll_completed(self, event): async def run_test(): runner.async_api_client = AsyncMock() runner.async_task_client = AsyncMock() + runner.async_task_clients = [runner.async_task_client] runner._semaphore = asyncio.Semaphore(1) runner.async_task_client.batch_poll = AsyncMock(return_value=[mock_task]) @@ -863,6 +877,7 @@ def on_task_execution_completed(self, event): async def run_test(): runner.async_api_client = AsyncMock() runner.async_task_client = AsyncMock() + runner.async_task_clients = [runner.async_task_client] runner._semaphore = asyncio.Semaphore(1) runner.async_task_client.batch_poll = AsyncMock(return_value=[mock_task]) @@ -919,6 +934,7 @@ def on_task_execution_completed(self, event): async def run_test(): runner.async_api_client = AsyncMock() runner.async_task_client = AsyncMock() + runner.async_task_clients = [runner.async_task_client] runner._semaphore = asyncio.Semaphore(1) runner.async_task_client.batch_poll = AsyncMock(return_value=[mock_task]) @@ -994,6 +1010,7 @@ async def simple_worker(value: int) -> dict: async def run_test(): runner.async_api_client = AsyncMock() runner.async_task_client = AsyncMock() + runner.async_task_clients = [runner.async_task_client] runner._semaphore = asyncio.Semaphore(1) runner.async_task_client.batch_poll = AsyncMock(return_value=[mock_task]) diff --git a/tests/unit/automator/test_concurrency_bugs.py b/tests/unit/automator/test_concurrency_bugs.py index 7d13b2d8..a14a1076 100644 --- a/tests/unit/automator/test_concurrency_bugs.py +++ b/tests/unit/automator/test_concurrency_bugs.py @@ -197,8 +197,8 @@ def test_threadpool_executor_resource_leak(self): runner = TaskRunner(worker=worker, configuration=self.config) # Simulate some work - with patch.object(runner, 'task_client'): - runner.task_client.batch_poll = Mock(return_value=[]) + with patch.object(runner, 'task_clients', new=[Mock()]) as mock_clients: + mock_clients[0].batch_poll = Mock(return_value=[]) # Note: No cleanup/shutdown called # Force garbage collection diff --git a/tests/unit/automator/test_multi_homed.py b/tests/unit/automator/test_multi_homed.py new file mode 100644 index 00000000..1210ecad --- /dev/null +++ b/tests/unit/automator/test_multi_homed.py @@ -0,0 +1,474 @@ +""" +Unit tests for multi-homed workers functionality. + +Tests cover: +1. Configuration.from_env_multi() - comma-separated env var parsing +2. Thread-safe task_server_map operations +3. TaskRunner/AsyncTaskRunner multi-config initialization +4. Credential validation +5. Backward compatibility +""" + +import os +import pytest +import threading +import time +from unittest.mock import patch, MagicMock + +from conductor.client.configuration.configuration import Configuration +from conductor.client.configuration.settings.authentication_settings import AuthenticationSettings + + +class TestConfigurationFromEnvMulti: + """Tests for Configuration.from_env_multi() factory method.""" + + def test_no_env_vars_returns_default(self): + """When no env vars set, returns list with default configuration.""" + with patch.dict(os.environ, {}, clear=True): + # Clear any existing conductor env vars + for key in ['CONDUCTOR_SERVER_URL', 'CONDUCTOR_AUTH_KEY', 'CONDUCTOR_AUTH_SECRET']: + os.environ.pop(key, None) + + configs = Configuration.from_env_multi() + + assert len(configs) == 1 + assert isinstance(configs[0], Configuration) + + def test_single_server_no_auth(self): + """Single server URL without auth credentials.""" + with patch.dict(os.environ, { + 'CONDUCTOR_SERVER_URL': 'https://server1.example.com/api' + }, clear=True): + configs = Configuration.from_env_multi() + + assert len(configs) == 1 + assert configs[0].host == 'https://server1.example.com/api' + assert configs[0].authentication_settings is None + + def test_single_server_with_auth(self): + """Single server with auth credentials.""" + with patch.dict(os.environ, { + 'CONDUCTOR_SERVER_URL': 'https://server1.example.com/api', + 'CONDUCTOR_AUTH_KEY': 'key1', + 'CONDUCTOR_AUTH_SECRET': 'secret1' + }, clear=True): + configs = Configuration.from_env_multi() + + assert len(configs) == 1 + assert configs[0].host == 'https://server1.example.com/api' + assert configs[0].authentication_settings is not None + assert configs[0].authentication_settings.key_id == 'key1' + assert configs[0].authentication_settings.key_secret == 'secret1' + + def test_multiple_servers_no_auth(self): + """Multiple servers without auth credentials.""" + with patch.dict(os.environ, { + 'CONDUCTOR_SERVER_URL': 'https://east.example.com/api,https://west.example.com/api' + }, clear=True): + configs = Configuration.from_env_multi() + + assert len(configs) == 2 + assert configs[0].host == 'https://east.example.com/api' + assert configs[1].host == 'https://west.example.com/api' + assert configs[0].authentication_settings is None + assert configs[1].authentication_settings is None + + def test_multiple_servers_with_auth(self): + """Multiple servers with matching auth credentials.""" + with patch.dict(os.environ, { + 'CONDUCTOR_SERVER_URL': 'https://east.example.com/api,https://west.example.com/api', + 'CONDUCTOR_AUTH_KEY': 'key1,key2', + 'CONDUCTOR_AUTH_SECRET': 'secret1,secret2' + }, clear=True): + configs = Configuration.from_env_multi() + + assert len(configs) == 2 + assert configs[0].host == 'https://east.example.com/api' + assert configs[0].authentication_settings.key_id == 'key1' + assert configs[0].authentication_settings.key_secret == 'secret1' + assert configs[1].host == 'https://west.example.com/api' + assert configs[1].authentication_settings.key_id == 'key2' + assert configs[1].authentication_settings.key_secret == 'secret2' + + def test_whitespace_handling(self): + """Whitespace around values is trimmed.""" + with patch.dict(os.environ, { + 'CONDUCTOR_SERVER_URL': ' https://east.example.com/api , https://west.example.com/api ', + 'CONDUCTOR_AUTH_KEY': ' key1 , key2 ', + 'CONDUCTOR_AUTH_SECRET': ' secret1 , secret2 ' + }, clear=True): + configs = Configuration.from_env_multi() + + assert len(configs) == 2 + assert configs[0].host == 'https://east.example.com/api' + assert configs[1].host == 'https://west.example.com/api' + assert configs[0].authentication_settings.key_id == 'key1' + assert configs[1].authentication_settings.key_id == 'key2' + + def test_mismatched_key_count_raises(self): + """Mismatched key count raises ValueError.""" + with patch.dict(os.environ, { + 'CONDUCTOR_SERVER_URL': 'https://east.example.com/api,https://west.example.com/api', + 'CONDUCTOR_AUTH_KEY': 'key1', # Only one key for two servers + 'CONDUCTOR_AUTH_SECRET': 'secret1,secret2' + }, clear=True): + with pytest.raises(ValueError) as exc_info: + Configuration.from_env_multi() + + assert "CONDUCTOR_AUTH_KEY count (1)" in str(exc_info.value) + assert "CONDUCTOR_SERVER_URL count (2)" in str(exc_info.value) + + def test_mismatched_secret_count_raises(self): + """Mismatched secret count raises ValueError.""" + with patch.dict(os.environ, { + 'CONDUCTOR_SERVER_URL': 'https://east.example.com/api,https://west.example.com/api', + 'CONDUCTOR_AUTH_KEY': 'key1,key2', + 'CONDUCTOR_AUTH_SECRET': 'secret1' # Only one secret for two servers + }, clear=True): + with pytest.raises(ValueError) as exc_info: + Configuration.from_env_multi() + + assert "CONDUCTOR_AUTH_SECRET count (1)" in str(exc_info.value) + + def test_key_without_secret_raises(self): + """Key without secret raises ValueError.""" + with patch.dict(os.environ, { + 'CONDUCTOR_SERVER_URL': 'https://server1.example.com/api', + 'CONDUCTOR_AUTH_KEY': 'key1' + # No CONDUCTOR_AUTH_SECRET + }, clear=True): + with pytest.raises(ValueError) as exc_info: + Configuration.from_env_multi() + + assert "must be provided together" in str(exc_info.value) + + def test_secret_without_key_raises(self): + """Secret without key raises ValueError.""" + with patch.dict(os.environ, { + 'CONDUCTOR_SERVER_URL': 'https://server1.example.com/api', + 'CONDUCTOR_AUTH_SECRET': 'secret1' + # No CONDUCTOR_AUTH_KEY + }, clear=True): + with pytest.raises(ValueError) as exc_info: + Configuration.from_env_multi() + + assert "must be provided together" in str(exc_info.value) + + def test_empty_values_filtered(self): + """Empty values in comma-separated list are filtered out.""" + with patch.dict(os.environ, { + 'CONDUCTOR_SERVER_URL': 'https://server1.example.com/api,,https://server2.example.com/api,' + }, clear=True): + configs = Configuration.from_env_multi() + + assert len(configs) == 2 + assert configs[0].host == 'https://server1.example.com/api' + assert configs[1].host == 'https://server2.example.com/api' + + +class TestTaskServerMapThreadSafety: + """Tests for thread-safe task_server_map operations.""" + + def test_concurrent_writes_and_reads(self): + """Simulate concurrent map access from multiple threads.""" + from conductor.client.automator.task_runner import TaskRunner + from conductor.client.worker.worker_interface import WorkerInterface + + # Create a mock worker + worker = MagicMock(spec=WorkerInterface) + worker.get_task_definition_name.return_value = 'test_task' + worker.task_definition_names = ['test_task'] + worker.thread_count = 4 + worker.poll_interval = 1 + worker.domain = None + worker.worker_id = 'test-worker' + worker.register_task_def = False + worker.poll_timeout = 100 + worker.lease_extend_enabled = False + worker.paused = False + worker.overwrite_task_def = True + worker.strict_schema = False + + config = Configuration(server_api_url='http://localhost:8080/api') + runner = TaskRunner(worker, configuration=config) + + # Verify lock exists + assert hasattr(runner, '_task_server_map_lock') + assert isinstance(runner._task_server_map_lock, type(threading.Lock())) + + errors = [] + + def writer_thread(thread_id): + """Simulate poll thread writing to map.""" + try: + for i in range(100): + task_id = f"task-{thread_id}-{i}" + with runner._task_server_map_lock: + runner._task_server_map[task_id] = thread_id % 2 + time.sleep(0.0001) + except Exception as e: + errors.append(e) + + def reader_thread(thread_id): + """Simulate update thread reading from map.""" + try: + for i in range(100): + task_id = f"task-{thread_id}-{i}" + with runner._task_server_map_lock: + runner._task_server_map.pop(task_id, 0) + time.sleep(0.0001) + except Exception as e: + errors.append(e) + + # Start threads + threads = [] + for i in range(4): + t1 = threading.Thread(target=writer_thread, args=(i,)) + t2 = threading.Thread(target=reader_thread, args=(i,)) + threads.extend([t1, t2]) + t1.start() + t2.start() + + # Wait for completion + for t in threads: + t.join(timeout=5) + + # No errors should have occurred + assert len(errors) == 0, f"Thread errors: {errors}" + + +class TestMultiHomedRunnerInitialization: + """Tests for TaskRunner and AsyncTaskRunner multi-config initialization.""" + + def test_task_runner_single_config(self): + """TaskRunner with single configuration (backward compatible).""" + from conductor.client.automator.task_runner import TaskRunner + from conductor.client.worker.worker_interface import WorkerInterface + + worker = MagicMock(spec=WorkerInterface) + worker.get_task_definition_name.return_value = 'test_task' + worker.task_definition_names = ['test_task'] + worker.thread_count = 1 + worker.poll_interval = 1 + worker.domain = None + worker.worker_id = 'test-worker' + worker.register_task_def = False + worker.poll_timeout = 100 + worker.lease_extend_enabled = False + worker.paused = False + worker.overwrite_task_def = True + worker.strict_schema = False + + config = Configuration(server_api_url='http://localhost:8080/api') + runner = TaskRunner(worker, configuration=config) + + assert len(runner.configurations) == 1 + assert len(runner.task_clients) == 1 + assert len(runner._auth_failures) == 1 + + def test_task_runner_multiple_configs(self): + """TaskRunner with multiple configurations.""" + from conductor.client.automator.task_runner import TaskRunner + from conductor.client.worker.worker_interface import WorkerInterface + + worker = MagicMock(spec=WorkerInterface) + worker.get_task_definition_name.return_value = 'test_task' + worker.task_definition_names = ['test_task'] + worker.thread_count = 2 + worker.poll_interval = 1 + worker.domain = None + worker.worker_id = 'test-worker' + worker.register_task_def = False + worker.poll_timeout = 100 + worker.lease_extend_enabled = False + worker.paused = False + worker.overwrite_task_def = True + worker.strict_schema = False + + configs = [ + Configuration(server_api_url='http://east:8080/api'), + Configuration(server_api_url='http://west:8080/api') + ] + runner = TaskRunner(worker, configuration=configs) + + assert len(runner.configurations) == 2 + assert len(runner.task_clients) == 2 + assert len(runner._auth_failures) == 2 + assert len(runner._last_auth_failure) == 2 + + def test_async_task_runner_multiple_configs(self): + """AsyncTaskRunner with multiple configurations.""" + from conductor.client.automator.async_task_runner import AsyncTaskRunner + from conductor.client.worker.worker_interface import WorkerInterface + + worker = MagicMock(spec=WorkerInterface) + worker.get_task_definition_name.return_value = 'test_task' + worker.task_definition_names = ['test_task'] + worker.thread_count = 2 + worker.poll_interval = 1 + worker.domain = None + worker.worker_id = 'test-worker' + worker.register_task_def = False + worker.poll_timeout = 100 + worker.lease_extend_enabled = False + worker.paused = False + worker.overwrite_task_def = True + worker.strict_schema = False + + configs = [ + Configuration(server_api_url='http://east:8080/api'), + Configuration(server_api_url='http://west:8080/api') + ] + runner = AsyncTaskRunner(worker, configuration=configs) + + assert len(runner.configurations) == 2 + assert len(runner._auth_failures) == 2 + # async_task_clients created in run(), not __init__ + assert runner.async_task_clients == [] + + +class TestCircuitBreaker: + """Tests for circuit breaker functionality in multi-homed mode.""" + + def test_task_runner_circuit_breaker_initialized(self): + """TaskRunner initializes circuit breaker state.""" + from conductor.client.automator.task_runner import TaskRunner + from conductor.client.worker.worker_interface import WorkerInterface + + worker = MagicMock(spec=WorkerInterface) + worker.get_task_definition_name.return_value = 'test_task' + worker.task_definition_names = ['test_task'] + worker.thread_count = 2 + worker.poll_interval = 1 + worker.domain = None + worker.worker_id = 'test-worker' + worker.register_task_def = False + worker.poll_timeout = 100 + worker.lease_extend_enabled = False + worker.paused = False + worker.overwrite_task_def = True + worker.strict_schema = False + + configs = [ + Configuration(server_api_url='http://east:8080/api'), + Configuration(server_api_url='http://west:8080/api') + ] + runner = TaskRunner(worker, configuration=configs) + + # Circuit breaker state initialized + assert hasattr(runner, '_server_failures') + assert hasattr(runner, '_circuit_open_until') + assert len(runner._server_failures) == 2 + assert len(runner._circuit_open_until) == 2 + assert all(f == 0 for f in runner._server_failures) + assert all(t == 0.0 for t in runner._circuit_open_until) + + # Constants defined + assert runner._CIRCUIT_FAILURE_THRESHOLD == 3 + assert runner._CIRCUIT_RESET_SECONDS == 30 + assert runner._POLL_TIMEOUT_SECONDS == 5 + + def test_task_runner_poll_executor_for_multi_homed(self): + """TaskRunner creates poll executor only for multi-homed mode.""" + from conductor.client.automator.task_runner import TaskRunner + from conductor.client.worker.worker_interface import WorkerInterface + + worker = MagicMock(spec=WorkerInterface) + worker.get_task_definition_name.return_value = 'test_task' + worker.task_definition_names = ['test_task'] + worker.thread_count = 2 + worker.poll_interval = 1 + worker.domain = None + worker.worker_id = 'test-worker' + worker.register_task_def = False + worker.poll_timeout = 100 + worker.lease_extend_enabled = False + worker.paused = False + worker.overwrite_task_def = True + worker.strict_schema = False + + # Single server - no poll executor + single_config = Configuration(server_api_url='http://localhost:8080/api') + runner_single = TaskRunner(worker, configuration=single_config) + assert runner_single._poll_executor is None + + # Multi-homed - poll executor created + multi_configs = [ + Configuration(server_api_url='http://east:8080/api'), + Configuration(server_api_url='http://west:8080/api') + ] + runner_multi = TaskRunner(worker, configuration=multi_configs) + assert runner_multi._poll_executor is not None + + def test_async_task_runner_circuit_breaker_initialized(self): + """AsyncTaskRunner initializes circuit breaker state.""" + from conductor.client.automator.async_task_runner import AsyncTaskRunner + from conductor.client.worker.worker_interface import WorkerInterface + + worker = MagicMock(spec=WorkerInterface) + worker.get_task_definition_name.return_value = 'test_task' + worker.task_definition_names = ['test_task'] + worker.thread_count = 2 + worker.poll_interval = 1 + worker.domain = None + worker.worker_id = 'test-worker' + worker.register_task_def = False + worker.poll_timeout = 100 + worker.lease_extend_enabled = False + worker.paused = False + worker.overwrite_task_def = True + worker.strict_schema = False + + configs = [ + Configuration(server_api_url='http://east:8080/api'), + Configuration(server_api_url='http://west:8080/api') + ] + runner = AsyncTaskRunner(worker, configuration=configs) + + # Circuit breaker state initialized + assert hasattr(runner, '_server_failures') + assert hasattr(runner, '_circuit_open_until') + assert len(runner._server_failures) == 2 + assert len(runner._circuit_open_until) == 2 + + # Constants defined + assert runner._CIRCUIT_FAILURE_THRESHOLD == 3 + assert runner._CIRCUIT_RESET_SECONDS == 30 + assert runner._POLL_TIMEOUT_SECONDS == 5 + + +class TestBackwardCompatibility: + """Tests to ensure backward compatibility with existing code.""" + + def test_task_handler_single_config_kwarg(self): + """TaskHandler accepts single config as keyword arg.""" + from conductor.client.automator.task_handler import TaskHandler + + config = Configuration(server_api_url='http://localhost:8080/api') + + # Should not raise + handler = TaskHandler(workers=[], configuration=config, scan_for_annotated_workers=False) + assert len(handler.configurations) == 1 + + def test_task_handler_no_config_uses_env(self): + """TaskHandler with no config uses from_env_multi().""" + from conductor.client.automator.task_handler import TaskHandler + + with patch.dict(os.environ, { + 'CONDUCTOR_SERVER_URL': 'https://server1.example.com/api,https://server2.example.com/api' + }, clear=True): + handler = TaskHandler(workers=[], scan_for_annotated_workers=False) + assert len(handler.configurations) == 2 + + def test_configuration_single_server_unchanged(self): + """Single server Configuration() behavior unchanged.""" + config = Configuration(server_api_url='http://localhost:8080/api') + assert config.host == 'http://localhost:8080/api' + + def test_configuration_env_var_single_unchanged(self): + """Single CONDUCTOR_SERVER_URL still works.""" + with patch.dict(os.environ, { + 'CONDUCTOR_SERVER_URL': 'http://myserver:8080/api' + }, clear=True): + config = Configuration() + assert config.host == 'http://myserver:8080/api' diff --git a/tests/unit/automator/test_task_runner_coverage.py b/tests/unit/automator/test_task_runner_coverage.py index 19b07261..0298a15e 100644 --- a/tests/unit/automator/test_task_runner_coverage.py +++ b/tests/unit/automator/test_task_runner_coverage.py @@ -154,7 +154,7 @@ def test_initialization_with_metrics_settings(self): self.assertIsNotNone(task_runner.metrics_collector) self.assertEqual(task_runner.worker, worker) - self.assertEqual(task_runner.configuration, config) + self.assertEqual(task_runner.configurations[0], config) def test_initialization_without_metrics_settings(self): """Test TaskRunner initialization without metrics""" @@ -178,8 +178,8 @@ def test_initialization_creates_default_configuration(self): configuration=None ) - self.assertIsNotNone(task_runner.configuration) - self.assertIsInstance(task_runner.configuration, Configuration) + self.assertIsNotNone(task_runner.configurations) + self.assertIsInstance(task_runner.configurations[0], Configuration) @patch.dict(os.environ, { 'conductor_worker_test_task_polling_interval': 'invalid_value' @@ -245,7 +245,7 @@ def test_run_without_configuration_sets_debug_logging(self): ) # Set configuration to None to test the logging path - task_runner.configuration = None + task_runner.configurations = [] # Mock run_once to exit after one iteration with patch.object(task_runner, 'run_once', side_effect=[None, Exception("Exit loop")]): @@ -297,8 +297,8 @@ def test_poll_task_with_auth_failure_backoff(self, mock_sleep): task_runner = TaskRunner(worker=worker) # Simulate auth failure - task_runner._auth_failures = 2 - task_runner._last_auth_failure = time.time() + task_runner._auth_failures[0] = 2 + task_runner._last_auth_failure[0] = time.time() with patch.object(TaskResourceApi, 'poll', return_value=None): task = task_runner._TaskRunner__poll_task() @@ -330,8 +330,8 @@ def test_poll_task_auth_failure_with_invalid_token(self, mock_sleep): task = task_runner._TaskRunner__poll_task() self.assertIsNone(task) - self.assertEqual(task_runner._auth_failures, 1) - self.assertGreater(task_runner._last_auth_failure, 0) + self.assertEqual(task_runner._auth_failures[0], 1) + self.assertGreater(task_runner._last_auth_failure[0], 0) @patch('time.sleep') def test_poll_task_auth_failure_without_invalid_token(self, mock_sleep): @@ -356,7 +356,7 @@ def test_poll_task_auth_failure_without_invalid_token(self, mock_sleep): task = task_runner._TaskRunner__poll_task() self.assertIsNone(task) - self.assertEqual(task_runner._auth_failures, 1) + self.assertEqual(task_runner._auth_failures[0], 1) @patch('time.sleep') def test_poll_task_success_resets_auth_failures(self, mock_sleep): @@ -365,8 +365,8 @@ def test_poll_task_success_resets_auth_failures(self, mock_sleep): task_runner = TaskRunner(worker=worker) # Set some auth failures in the past (so backoff has elapsed) - task_runner._auth_failures = 3 - task_runner._last_auth_failure = time.time() - 100 # 100 seconds ago + task_runner._auth_failures[0] = 3 + task_runner._last_auth_failure[0] = time.time() - 100 # 100 seconds ago test_task = Task(task_id='test_id', workflow_instance_id='wf_id') @@ -374,7 +374,7 @@ def test_poll_task_success_resets_auth_failures(self, mock_sleep): task = task_runner._TaskRunner__poll_task() self.assertEqual(task, test_task) - self.assertEqual(task_runner._auth_failures, 0) + self.assertEqual(task_runner._auth_failures[0], 0) def test_poll_task_no_task_available_resets_auth_failures(self): """Test that None result from successful poll resets auth failures""" @@ -382,13 +382,13 @@ def test_poll_task_no_task_available_resets_auth_failures(self): task_runner = TaskRunner(worker=worker) # Set some auth failures - task_runner._auth_failures = 2 + task_runner._auth_failures[0] = 2 with patch.object(TaskResourceApi, 'poll', return_value=None): task = task_runner._TaskRunner__poll_task() self.assertIsNone(task) - self.assertEqual(task_runner._auth_failures, 0) + self.assertEqual(task_runner._auth_failures[0], 0) def test_poll_task_with_metrics_collector(self): """Test polling with metrics collection enabled"""