Skip to content

fix: harden orchestrator startup and tool dispatch timeouts#353

Closed
l50 wants to merge 25 commits into
dreadnode:mainfrom
l50:fix/orchestrator-bringup-bugs
Closed

fix: harden orchestrator startup and tool dispatch timeouts#353
l50 wants to merge 25 commits into
dreadnode:mainfrom
l50:fix/orchestrator-bringup-bugs

Conversation

@l50
Copy link
Copy Markdown
Contributor

@l50 l50 commented May 27, 2026

Summary

Four independent failures observed during a live op against the GOAD lab. Each is addressed in isolation; no behavioral changes outside the failing paths.

Bug 1 — orchestrator double-init of tracing subscriber. ares --redis-url <url> orchestrator panicked with SetGlobalDefaultError. main.rs detected the orchestrator subcommand by checking args().nth(1), but global flags shift the subcommand later in argv, so main initialized telemetry as ares-cli and orchestrator::run reinitialized as ares-orchestrator. ares-core/telemetry/init now uses try_init() and returns a no-op guard on a second call; ares-cli/main scans all argv for the subcommand name.

Bug 2 — NATS tool dispatch timeout too short for nmap recon. async-nats defaults request_timeout to 10s. The dispatcher's outer tokio::time::timeout(1500s, client.request(...)) never got to apply because the NATS client raised `request timed out: deadline has elapsed` first. Bumped client request_timeout to 30 minutes at connection time and added a per-tool floor table so slow recon tools get a minimum deadline that exceeds their worst-case runtime.

Bug 7 — stale JetStream consumer blocks orchestrator restart. The result-demux consumer on ARES_TASKS was ephemeral with no durable name; a SIGKILLed orchestrator left it behind, and the next pod hit `filtered consumer not unique on workqueue stream (10100)` on create_consumer. Switched to a deterministic durable name (ares-orch-result-demux), delete-then-create on startup, and an inactive_threshold of 5 min as a server-side safety net.

Bug 9 — org-restricted model defaults silently 403 every task. task red:multi defaults MODEL=gpt-5.2, which OpenAI restricts to allowlisted orgs. Without access every task tipped over with a non-actionable auth string. ares-llm openai provider now classifies 403 + org-restriction phrasing as LlmError::AuthError, augments the message with pointers to OPENAI_ORG_ID and ARES_LLM_MODEL, and the orchestrator runs a 1-token preflight ping right after provider creation so the failure surfaces once at startup instead of N times per submitted op. ARES_LLM_PREFLIGHT_SKIP=1 opts out.

Test plan

  • cargo test --workspace --all-features (6,235 tests pass)
  • pre-commit run --all-files passes
  • Manual: live op against GOAD lab will exercise all four code paths once images rebuild and roll out

l50 and others added 25 commits May 21, 2026 14:31
Renovate skips forks by default. l50/ares is the production target for
this workflow run, so opt in via RENOVATE_FORK_PROCESSING=enabled.
| datasource  | package                 | from   | to     |
| ----------- | ----------------------- | ------ | ------ |
| github-tags | actions/upload-artifact | v7.0.0 | v7.0.1 |
**Key Changes:**

- Added optional remote cracking mode that delegates hashcat jobs to an HTTP service when configured
- Implemented authenticated job submission, polling, timeout handling, and potfile retrieval for remote jobs
- Preserved local hashcat execution as the default path when remote service configuration is absent
- Scoped remote execution to simple wordlist attacks so service-owned GPU and wordlist resources remain isolated

**Added:**

- Remote hashcat client module - Adds HTTP integration for submitting jobs, polling job status, retrieving cracked results, handling bearer authentication, and normalizing local wordlist paths to remote-safe basenames
- Remote service configuration support - Enables remote mode through HASHCAT_SERVICE_URL and requires HASHCAT_TOKEN for authenticated requests
- Remote result handling - Returns crackd logs, potfile contents, remote errors, exit codes, and timeout failures through the existing ToolOutput structure

**Changed:**

- Hashcat cracking flow - Updates crack_with_hashcat to check for remote service configuration first and delegate to the remote backend when available, while keeping the existing local hashcat behavior unchanged otherwise
**Added:**

- Renovate package rule to automerge patch and minor Cargo, Ansible Galaxy, Galaxy collection, and pre-commit updates via PR - .github/renovate.json5
| datasource | package          | from   | to     |
| ---------- | ---------------- | ------ | ------ |
| crate      | local-ip-address | 0.6.12 | 0.6.13 |

Co-authored-by: ares-renovate[bot] <286782180+ares-renovate[bot]@users.noreply.github.com>
| datasource | package    | from    | to      |
| ---------- | ---------- | ------- | ------- |
| crate      | serde_json | 1.0.149 | 1.0.150 |

Co-authored-by: ares-renovate[bot] <286782180+ares-renovate[bot]@users.noreply.github.com>
| datasource | package      | from   | to     |
| ---------- | ------------ | ------ | ------ |
| pypi       | ansible-core | 2.20.5 | 2.21.0 |

Co-authored-by: ares-renovate[bot] <286782180+ares-renovate[bot]@users.noreply.github.com>
| datasource        | package       | from  | to    |
| ----------------- | ------------- | ----- | ----- |
| galaxy-collection | ansible.posix | 2.1.0 | 2.2.0 |

Co-authored-by: ares-renovate[bot] <286782180+ares-renovate[bot]@users.noreply.github.com>
| datasource | package | from  | to    |
| ---------- | ------- | ----- | ----- |
| crate      | sqlx    | 0.8.6 | 0.9.0 |

Co-authored-by: ares-renovate[bot] <286782180+ares-renovate[bot]@users.noreply.github.com>
| datasource        | package           | from   | to     |
| ----------------- | ----------------- | ------ | ------ |
| galaxy-collection | community.general | 12.6.1 | 13.0.0 |

Co-authored-by: ares-renovate[bot] <286782180+ares-renovate[bot]@users.noreply.github.com>
**Removed:**

- Removed the temporary Renovate allowedVersions cap that blocked opentelemetry Rust crates from updating to 0.32 and later versions
* fix: assert safety for dynamic sqlx history queries

**Changed:**

- Wrapped dynamically assembled history queries with `AssertSqlSafe` so sqlx accepts SQL built from static fragments with bound user values - `ares-cli/src/history`
- Documented and applied the same safety assertion to credential hash search queries that construct placeholder lists dynamically - `ares-core/src/persistent_store/queries/credentials.rs`

* build: update windows-sys lockfile dependency
| datasource | package                            | from   | to     |
| ---------- | ---------------------------------- | ------ | ------ |
| crate      | opentelemetry                      | 0.31.0 | 0.32.0 |
| crate      | opentelemetry-otlp                 | 0.31.1 | 0.32.0 |
| crate      | opentelemetry-semantic-conventions | 0.31.0 | 0.32.0 |
| crate      | opentelemetry_sdk                  | 0.31.0 | 0.32.0 |
| crate      | tracing-opentelemetry              | 0.32.1 | 0.33.0 |

Co-authored-by: ares-renovate[bot] <286782180+ares-renovate[bot]@users.noreply.github.com>
Co-authored-by: ares-renovate[bot] <286782180+ares-renovate[bot]@users.noreply.github.com>
chore(deps): update returntocorp/semgrep docker digest to 9349edb
chore(deps): update actions/upload-artifact action to v7.0.1
…latformAutomerge enables GH auto-merge on PR creation
Co-authored-by: ares-renovate[bot] <286782180+ares-renovate[bot]@users.noreply.github.com>
Co-authored-by: ares-renovate[bot] <286782180+ares-renovate[bot]@users.noreply.github.com>
Co-authored-by: ares-renovate[bot] <286782180+ares-renovate[bot]@users.noreply.github.com>
| datasource  | package              | from    | to      |
| ----------- | -------------------- | ------- | ------- |
| github-tags | github/codeql-action | v4.35.4 | v4.36.0 |

Co-authored-by: ares-renovate[bot] <286782180+ares-renovate[bot]@users.noreply.github.com>
…-26)

Four independent failures observed during a live op against the GOAD lab.
Each is addressed in isolation; no behavioral changes outside the failing
paths.

1. orchestrator double-init of tracing subscriber
   `ares --redis-url <url> orchestrator` panicked with
   `SetGlobalDefaultError`. main.rs detected the orchestrator subcommand by
   checking `args().nth(1)`, but global flags like --redis-url shift the
   subcommand later in argv, so main initialized telemetry as "ares-cli"
   and orchestrator::run reinitialized as "ares-orchestrator", panicking.
   ares-core/telemetry/init now uses `try_init()` and returns a no-op guard
   on a second call; ares-cli/main now scans all argv for the subcommand
   name. Both layers must work because either alone leaves a footgun for
   the next caller. Regression test in ares-core.

2. NATS tool dispatch timeout too short for nmap recon
   async-nats defaults `request_timeout` to 10s. The dispatcher's outer
   `tokio::time::timeout(1500s, client.request(...))` never got a chance
   to apply because the underlying NATS client raised `request timed out:
   deadline has elapsed` first. Bumped the NATS client request_timeout to
   30 minutes at connection time and added a per-tool floor table so slow
   recon tools (nmap_scan, smb_sweep, password_spray, ...) get a minimum
   deadline that exceeds their worst-case runtime regardless of any
   tighter dispatcher default an operator might configure. The floor is a
   minimum, never a cap.

7. stale JetStream consumer blocks orchestrator restart
   The result-demux consumer on `ARES_TASKS` was ephemeral with no
   durable name; a SIGKILLed orchestrator left it behind, and the next
   pod hit `filtered consumer not unique on workqueue stream (10100)` on
   `create_consumer`. Switched to a deterministic durable name
   (`ares-orch-result-demux`), delete-then-create on startup, and a
   `inactive_threshold` of 5 min as a server-side safety net for the
   case where the delete itself races a pod kill. No backward-compat
   shim for the old ephemeral consumer — workqueue semantics make the
   migration idempotent.

9. org-restricted model defaults silently 403 every task
   `task red:multi` defaults MODEL=gpt-5.2, which OpenAI restricts to
   allowlisted orgs. Without access every task in the op tipped over with
   a non-actionable "authentication failed" string from ares-llm.
   ares-llm openai provider now classifies 403 + org-restriction phrasing
   as `LlmError::AuthError` (was generic `ApiError`), augments the error
   message with pointers to `OPENAI_ORG_ID` and `ARES_LLM_MODEL`, and the
   orchestrator runs a 1-token preflight ping right after provider
   creation so the failure surfaces once at startup instead of N times
   per submitted op. `ARES_LLM_PREFLIGHT_SKIP=1` opts out for air-gapped
   tests.
@dreadnode-renovate-bot dreadnode-renovate-bot Bot added area/pre-commit Changes made to pre-commit hooks area/github Changes made to GitHub Actions workflows labels May 27, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

❌ Patch coverage is 39.13043% with 140 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.77%. Comparing base (db1c4ee) to head (08576d0).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
ares-tools/src/cracker/remote.rs 4.90% 97 Missing ⚠️
ares-core/src/telemetry/init.rs 62.79% 16 Missing ⚠️
ares-cli/src/orchestrator/task_queue.rs 0.00% 15 Missing ⚠️
...s-core/src/persistent_store/queries/credentials.rs 0.00% 3 Missing ⚠️
ares-cli/src/history/search.rs 0.00% 2 Missing ⚠️
ares-cli/src/main.rs 0.00% 2 Missing ⚠️
ares-cli/src/history/cost.rs 0.00% 1 Missing ⚠️
ares-cli/src/history/list.rs 0.00% 1 Missing ⚠️
...c/orchestrator/tool_dispatcher/redis_dispatcher.rs 0.00% 1 Missing ⚠️
ares-core/src/nats.rs 75.00% 1 Missing ⚠️
... and 1 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #353      +/-   ##
==========================================
- Coverage   80.03%   78.77%   -1.26%     
==========================================
  Files         433      418      -15     
  Lines      125577   118527    -7050     
==========================================
- Hits       100500    93375    -7125     
- Misses      25077    25152      +75     
Files with missing lines Coverage Δ
ares-cli/src/orchestrator/tool_dispatcher/mod.rs 96.80% <100.00%> (+0.43%) ⬆️
ares-llm/src/provider/openai.rs 68.18% <100.00%> (+6.64%) ⬆️
ares-cli/src/history/cost.rs 0.00% <0.00%> (ø)
ares-cli/src/history/list.rs 0.00% <0.00%> (ø)
...c/orchestrator/tool_dispatcher/redis_dispatcher.rs 40.17% <0.00%> (ø)
ares-core/src/nats.rs 76.20% <75.00%> (+0.12%) ⬆️
ares-tools/src/cracker.rs 88.28% <66.66%> (-0.20%) ⬇️
ares-cli/src/history/search.rs 0.00% <0.00%> (ø)
ares-cli/src/main.rs 0.00% <0.00%> (ø)
...s-core/src/persistent_store/queries/credentials.rs 0.00% <0.00%> (ø)
... and 3 more

... and 17 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@l50
Copy link
Copy Markdown
Contributor Author

l50 commented May 27, 2026

Moving this to my fork — tracking in l50#27 instead.

@l50 l50 closed this May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/github Changes made to GitHub Actions workflows area/pre-commit Changes made to pre-commit hooks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant