fix: harden orchestrator startup and tool dispatch timeouts#353
Closed
l50 wants to merge 25 commits into
Closed
Conversation
Renovate skips forks by default. l50/ares is the production target for this workflow run, so opt in via RENOVATE_FORK_PROCESSING=enabled.
| datasource | package | from | to | | ----------- | ----------------------- | ------ | ------ | | github-tags | actions/upload-artifact | v7.0.0 | v7.0.1 |
**Key Changes:** - Added optional remote cracking mode that delegates hashcat jobs to an HTTP service when configured - Implemented authenticated job submission, polling, timeout handling, and potfile retrieval for remote jobs - Preserved local hashcat execution as the default path when remote service configuration is absent - Scoped remote execution to simple wordlist attacks so service-owned GPU and wordlist resources remain isolated **Added:** - Remote hashcat client module - Adds HTTP integration for submitting jobs, polling job status, retrieving cracked results, handling bearer authentication, and normalizing local wordlist paths to remote-safe basenames - Remote service configuration support - Enables remote mode through HASHCAT_SERVICE_URL and requires HASHCAT_TOKEN for authenticated requests - Remote result handling - Returns crackd logs, potfile contents, remote errors, exit codes, and timeout failures through the existing ToolOutput structure **Changed:** - Hashcat cracking flow - Updates crack_with_hashcat to check for remote service configuration first and delegate to the remote backend when available, while keeping the existing local hashcat behavior unchanged otherwise
**Added:** - Renovate package rule to automerge patch and minor Cargo, Ansible Galaxy, Galaxy collection, and pre-commit updates via PR - .github/renovate.json5
| datasource | package | from | to | | ---------- | ---------------- | ------ | ------ | | crate | local-ip-address | 0.6.12 | 0.6.13 | Co-authored-by: ares-renovate[bot] <286782180+ares-renovate[bot]@users.noreply.github.com>
| datasource | package | from | to | | ---------- | ---------- | ------- | ------- | | crate | serde_json | 1.0.149 | 1.0.150 | Co-authored-by: ares-renovate[bot] <286782180+ares-renovate[bot]@users.noreply.github.com>
| datasource | package | from | to | | ---------- | ------------ | ------ | ------ | | pypi | ansible-core | 2.20.5 | 2.21.0 | Co-authored-by: ares-renovate[bot] <286782180+ares-renovate[bot]@users.noreply.github.com>
| datasource | package | from | to | | ----------------- | ------------- | ----- | ----- | | galaxy-collection | ansible.posix | 2.1.0 | 2.2.0 | Co-authored-by: ares-renovate[bot] <286782180+ares-renovate[bot]@users.noreply.github.com>
| datasource | package | from | to | | ---------- | ------- | ----- | ----- | | crate | sqlx | 0.8.6 | 0.9.0 | Co-authored-by: ares-renovate[bot] <286782180+ares-renovate[bot]@users.noreply.github.com>
| datasource | package | from | to | | ----------------- | ----------------- | ------ | ------ | | galaxy-collection | community.general | 12.6.1 | 13.0.0 | Co-authored-by: ares-renovate[bot] <286782180+ares-renovate[bot]@users.noreply.github.com>
**Removed:** - Removed the temporary Renovate allowedVersions cap that blocked opentelemetry Rust crates from updating to 0.32 and later versions
* fix: assert safety for dynamic sqlx history queries **Changed:** - Wrapped dynamically assembled history queries with `AssertSqlSafe` so sqlx accepts SQL built from static fragments with bound user values - `ares-cli/src/history` - Documented and applied the same safety assertion to credential hash search queries that construct placeholder lists dynamically - `ares-core/src/persistent_store/queries/credentials.rs` * build: update windows-sys lockfile dependency
| datasource | package | from | to | | ---------- | ---------------------------------- | ------ | ------ | | crate | opentelemetry | 0.31.0 | 0.32.0 | | crate | opentelemetry-otlp | 0.31.1 | 0.32.0 | | crate | opentelemetry-semantic-conventions | 0.31.0 | 0.32.0 | | crate | opentelemetry_sdk | 0.31.0 | 0.32.0 | | crate | tracing-opentelemetry | 0.32.1 | 0.33.0 | Co-authored-by: ares-renovate[bot] <286782180+ares-renovate[bot]@users.noreply.github.com>
Co-authored-by: ares-renovate[bot] <286782180+ares-renovate[bot]@users.noreply.github.com>
chore(deps): update returntocorp/semgrep docker digest to 9349edb
chore(deps): update actions/upload-artifact action to v7.0.1
…latformAutomerge enables GH auto-merge on PR creation
Co-authored-by: ares-renovate[bot] <286782180+ares-renovate[bot]@users.noreply.github.com>
Co-authored-by: ares-renovate[bot] <286782180+ares-renovate[bot]@users.noreply.github.com>
Co-authored-by: ares-renovate[bot] <286782180+ares-renovate[bot]@users.noreply.github.com>
| datasource | package | from | to | | ----------- | -------------------- | ------- | ------- | | github-tags | github/codeql-action | v4.35.4 | v4.36.0 | Co-authored-by: ares-renovate[bot] <286782180+ares-renovate[bot]@users.noreply.github.com>
…-26) Four independent failures observed during a live op against the GOAD lab. Each is addressed in isolation; no behavioral changes outside the failing paths. 1. orchestrator double-init of tracing subscriber `ares --redis-url <url> orchestrator` panicked with `SetGlobalDefaultError`. main.rs detected the orchestrator subcommand by checking `args().nth(1)`, but global flags like --redis-url shift the subcommand later in argv, so main initialized telemetry as "ares-cli" and orchestrator::run reinitialized as "ares-orchestrator", panicking. ares-core/telemetry/init now uses `try_init()` and returns a no-op guard on a second call; ares-cli/main now scans all argv for the subcommand name. Both layers must work because either alone leaves a footgun for the next caller. Regression test in ares-core. 2. NATS tool dispatch timeout too short for nmap recon async-nats defaults `request_timeout` to 10s. The dispatcher's outer `tokio::time::timeout(1500s, client.request(...))` never got a chance to apply because the underlying NATS client raised `request timed out: deadline has elapsed` first. Bumped the NATS client request_timeout to 30 minutes at connection time and added a per-tool floor table so slow recon tools (nmap_scan, smb_sweep, password_spray, ...) get a minimum deadline that exceeds their worst-case runtime regardless of any tighter dispatcher default an operator might configure. The floor is a minimum, never a cap. 7. stale JetStream consumer blocks orchestrator restart The result-demux consumer on `ARES_TASKS` was ephemeral with no durable name; a SIGKILLed orchestrator left it behind, and the next pod hit `filtered consumer not unique on workqueue stream (10100)` on `create_consumer`. Switched to a deterministic durable name (`ares-orch-result-demux`), delete-then-create on startup, and a `inactive_threshold` of 5 min as a server-side safety net for the case where the delete itself races a pod kill. No backward-compat shim for the old ephemeral consumer — workqueue semantics make the migration idempotent. 9. org-restricted model defaults silently 403 every task `task red:multi` defaults MODEL=gpt-5.2, which OpenAI restricts to allowlisted orgs. Without access every task in the op tipped over with a non-actionable "authentication failed" string from ares-llm. ares-llm openai provider now classifies 403 + org-restriction phrasing as `LlmError::AuthError` (was generic `ApiError`), augments the error message with pointers to `OPENAI_ORG_ID` and `ARES_LLM_MODEL`, and the orchestrator runs a 1-token preflight ping right after provider creation so the failure surfaces once at startup instead of N times per submitted op. `ARES_LLM_PREFLIGHT_SKIP=1` opts out for air-gapped tests.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #353 +/- ##
==========================================
- Coverage 80.03% 78.77% -1.26%
==========================================
Files 433 418 -15
Lines 125577 118527 -7050
==========================================
- Hits 100500 93375 -7125
- Misses 25077 25152 +75
🚀 New features to boost your workflow:
|
Contributor
Author
|
Moving this to my fork — tracking in l50#27 instead. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Four independent failures observed during a live op against the GOAD lab. Each is addressed in isolation; no behavioral changes outside the failing paths.
Bug 1 — orchestrator double-init of tracing subscriber.
ares --redis-url <url> orchestratorpanicked withSetGlobalDefaultError. main.rs detected the orchestrator subcommand by checkingargs().nth(1), but global flags shift the subcommand later in argv, so main initialized telemetry as ares-cli and orchestrator::run reinitialized as ares-orchestrator. ares-core/telemetry/init now uses try_init() and returns a no-op guard on a second call; ares-cli/main scans all argv for the subcommand name.Bug 2 — NATS tool dispatch timeout too short for nmap recon. async-nats defaults request_timeout to 10s. The dispatcher's outer tokio::time::timeout(1500s, client.request(...)) never got to apply because the NATS client raised `request timed out: deadline has elapsed` first. Bumped client request_timeout to 30 minutes at connection time and added a per-tool floor table so slow recon tools get a minimum deadline that exceeds their worst-case runtime.
Bug 7 — stale JetStream consumer blocks orchestrator restart. The result-demux consumer on ARES_TASKS was ephemeral with no durable name; a SIGKILLed orchestrator left it behind, and the next pod hit `filtered consumer not unique on workqueue stream (10100)` on create_consumer. Switched to a deterministic durable name (ares-orch-result-demux), delete-then-create on startup, and an inactive_threshold of 5 min as a server-side safety net.
Bug 9 — org-restricted model defaults silently 403 every task. task red:multi defaults MODEL=gpt-5.2, which OpenAI restricts to allowlisted orgs. Without access every task tipped over with a non-actionable auth string. ares-llm openai provider now classifies 403 + org-restriction phrasing as LlmError::AuthError, augments the message with pointers to OPENAI_ORG_ID and ARES_LLM_MODEL, and the orchestrator runs a 1-token preflight ping right after provider creation so the failure surfaces once at startup instead of N times per submitted op. ARES_LLM_PREFLIGHT_SKIP=1 opts out.
Test plan