diff --git a/.claude/agents/python-ares-expert.md b/.claude/agents/python-ares-expert.md
deleted file mode 100644
index 4663ce3e..00000000
--- a/.claude/agents/python-ares-expert.md
+++ /dev/null
@@ -1,131 +0,0 @@
----
-name: python-ares-expert
-description: Expert on the Python ares codebase at ../ares (src/ares/). Use when you need to understand Python ares architecture, look up how something works in Python, find equivalent implementations, or answer questions about the original Python system before porting to Rust.
-tools: Read, Glob, Grep, Bash
-model: sonnet
----
-
-You are an expert on the **Python ares codebase** located at `/Users/l/dreadnode/ares`. Your job is to answer questions about the Python implementation accurately by reading the actual source code.
-
-## Project Overview
-
-Ares is an autonomous security operations multi-agent system with:
-
-- **Red Team**: LLM-powered penetration testing with coordinator/worker architecture
-- **Blue Team**: SOC alert investigation and threat hunting
-
-Built on the Dreadnode Agent SDK, rigging (LLM framework), and MITRE ATT&CK.
-
-## Codebase Layout
-
-```
-/Users/l/dreadnode/ares/
- src/ares/
- core/ # Core framework
- dispatcher/ # Task dispatcher (routing, throttling, result processing, publishing)
- worker/ # Worker agent (_worker.py, operations.py, prompts.py, dc_resolution.py)
- orchestrator/ # Orchestrator (_orchestrator.py)
- factories/ # Agent factories (red_agents.py, blue_factory.py)
- replay/ # Deterministic replay
- persistent_store/ # Persistent storage
- blue_dispatcher/ # Blue team dispatcher
- blue_worker/ # Blue team worker
- models.py # ALL data models (Credential, Host, Hash, Target, SharedRedTeamState, etc.)
- config.py # Configuration loading
- state_backend.py # Redis state backend (red team)
- blue_state_backend.py # Redis state backend (blue team)
- task_queue.py # Redis task queue (red team)
- blue_task_queue.py # Redis task queue (blue team)
- redis_client.py # Redis client wrapper
- recovery.py # Checkpoint/recovery
- persistence.py # State serialization
- workflows.py # Credential expansion workflows
- engines.py # Question generation engines
- correlation.py # Red-Blue correlation
- evidence_validation.py # Evidence dedup/validation
- k8s_executor.py # Kubernetes pod execution
- lateral_analyzer.py # Graph-based lateral movement
- messages.py # Inter-agent messages
- orchestrator_client.py # Client for orchestrator communication
- orchestrator_service.py # Orchestrator service pod
- query_resilience.py # Query retry logic
- remote.py # Remote K8s execution
- templates.py # Jinja2 template loading
- tracing.py # OpenTelemetry tracing
- capability_registry.py # Agent capability registration
- context_manager.py # LLM context window management
- tool_retrieval.py # Dynamic tool loading
- circuit_breaker.py # Circuit breaker pattern
- tools/
- red/ # Red team tools
- credential_discovery/ # discovery.py, harvesting.py, cracking.py, pilfering.py
- reconnaissance.py # nmap, enum4linux, user/share enumeration
- orchestrator.py # Dispatch functions
- kerberos_attacks.py # Delegation, tickets, ADCS
- lateral_movement.py # psexec, wmi, smb, evil-winrm
- acl_attacks.py # bloodyAD, pywhisker, dacledit
- privilege_escalation.py
- coercion.py # PetitPotam, Coercer, relay
- cve_exploits.py
- reporting.py
- common.py
- blue/ # Blue team tools
- investigation.py, grafana.py, query_templates.py, observability.py, actions.py, learning.py
- shared/
- mitre.py # MITRE ATT&CK integration
- agents/
- red/ # Red team agents (dynamic via factories)
- blue/
- soc_investigator.py # SOC investigation orchestrator
- integrations/ # Third-party integrations
- reports/ # Report generation (investigation.py, redteam.py, blueteam.py)
- eval/ # Evaluation framework
- templates/ # Jinja2 prompt templates
- redteam/agents/ # Per-role agent prompts (orchestrator.md.jinja, recon.md.jinja, etc.)
- main.py # CLI entry point
- cli_ops.py # CLI operations (loot, status, inject, etc.)
- cli_blue_ops.py # Blue team CLI operations
- cli_history.py # CLI history
- tests/ # Test suite
- docs/
- codemap.md # Full codebase map
- red.md # Red team architecture (AUTHORITATIVE)
- blue.md # Blue team workflow
- config/
- multi-agent-production.yaml # Agent configurations
-```
-
-## Multi-Agent Architecture
-
-- **Orchestrator**: Central LLM coordinator, dispatches tasks, never executes tools directly
-- **Workers**: RECON, CREDENTIAL_ACCESS, CRACKER, ACL, PRIVESC, LATERAL, COERCION
-- **Communication**: Redis pub/sub + task queues
-- **State**: Write-through cache (memory + Redis persistence)
-- **Namespace**: `attack-simulation` in Kubernetes
-
-## Key Design Patterns
-
-1. **Write-through cache**: `SharedRedTeamState` in memory, persisted to Redis via `state_backend.py`
-2. **Task queue**: Redis-based with priority routing in `task_queue.py`
-3. **Result processing**: `dispatcher/result_processing.py` extracts credentials/hashes from tool output
-4. **Publishing**: `dispatcher/publishing.py` broadcasts discovered credentials to all agents
-5. **Recovery**: `recovery.py` can restore operation state from Redis checkpoints
-6. **Factory pattern**: `factories/red_agents.py` maps AgentRole -> toolsets (ROLE_TOOLSETS)
-
-## How to Answer Questions
-
-1. **Always read the actual source files** before answering - don't guess from the layout alone
-2. Start with the most relevant file based on the question
-3. For architecture questions, read `docs/red.md` and `docs/codemap.md`
-4. For model/data questions, read `src/ares/core/models.py`
-5. For tool implementations, read the specific file in `src/ares/tools/red/`
-6. For orchestration logic, read `src/ares/core/dispatcher/` and `src/ares/core/orchestrator/`
-7. Be precise: include file paths, function names, and line numbers
-8. When asked "how does X work", trace the full code path
-
-## Important Context
-
-- This codebase is being ported to Rust (the parent project at `/Users/l/dreadnode/ares-rust-cli/ares-rust/`)
-- Questions will often be about understanding the Python implementation to inform the Rust port
-- The Python codebase uses: rigging (LLM), loguru (logging), redis, kubernetes, cyclopts (CLI), pydantic (models)
-- Domain conventions: `contoso.local` (primary), `fabrikam.local` (secondary), `192.168.58.x` subnet
diff --git a/.taskfiles/ec2/Taskfile.yaml b/.taskfiles/ec2/Taskfile.yaml
index c8392930..fa621b86 100644
--- a/.taskfiles/ec2/Taskfile.yaml
+++ b/.taskfiles/ec2/Taskfile.yaml
@@ -36,9 +36,6 @@ vars:
ARES_REMOTE_BIN: '/usr/local/bin'
ARES_REMOTE_CONFIG: '/etc/ares/config.yaml'
ARES_LOG_DIR: '/var/log/ares'
- # Build config
- RUST_TARGET: '{{.RUST_TARGET | default "x86_64-unknown-linux-gnu"}}'
- BIN_DIR: 'target/{{.RUST_TARGET}}/{{.BUILD_PROFILE | default "dev-deploy"}}'
# Build tool: auto (cross on macOS due to aws-lc-sys, zigbuild on Linux), cross, zigbuild, cargo, remote
# remote: builds natively on EC2 (fastest for iteration, no cross-compilation)
BUILD_TOOL: '{{.BUILD_TOOL | default "auto"}}'
@@ -88,6 +85,7 @@ tasks:
desc: "Cross-compile Rust binaries and deploy to EC2 via S3 staging (usage: task ec2:deploy [EC2_NAME=ares-tools])"
silent: true
vars:
+ RUST_TARGET: '{{.RUST_TARGET | default "x86_64-unknown-linux-gnu"}}'
MAX_OPEN_FILES: '{{.MAX_OPEN_FILES | default "65536"}}'
CARGO_BUILD_JOBS: '{{.CARGO_BUILD_JOBS | default "0"}}'
S3_DEPLOY_PREFIX: 'ares-deploy'
@@ -161,21 +159,32 @@ tasks:
"aws s3 cp s3://" + $bucket + "/" + $prefix + "/ares-src.tar.gz /tmp/ares-src.tar.gz",
"tar -xzf /tmp/ares-src.tar.gz -C " + $build_dir,
"cd " + $build_dir + " && cargo build --profile dev-deploy -p ares-cli 2>&1",
- "cp " + $build_dir + "/target/dev-deploy/ares /usr/local/bin/ares && chmod +x /usr/local/bin/ares",
+ "SRC=" + $build_dir + "/target/dev-deploy/ares",
+ "if [ ! -f \"$SRC\" ]; then echo ERROR: build artifact missing at $SRC; exit 1; fi",
+ "BUILD_RAW=$(sha256sum \"$SRC\"); BUILD_SHA=${BUILD_RAW%% *}",
+ "echo Build SHA: $BUILD_SHA",
+ "install -m 755 \"$SRC\" /usr/local/bin/ares",
+ "DEPLOY_RAW=$(sha256sum /usr/local/bin/ares); DEPLOY_SHA=${DEPLOY_RAW%% *}",
+ "echo Deploy SHA: $DEPLOY_SHA",
+ "if [ \"$BUILD_SHA\" != \"$DEPLOY_SHA\" ]; then echo ERROR: deployed sha differs from build artifact build=$BUILD_SHA deploy=$DEPLOY_SHA; exit 1; fi",
"echo Deployed: && ls -lh /usr/local/bin/ares"
]}' > "$PARAMS_FILE"
+ # Clean cargo builds on a t3.medium can run 15-25 min — pre-EC2-reboot
+ # cache may be wiped, and incremental builds still need to relink.
+ # Allow 30 min total for both the SSM command itself and the local
+ # polling loop so we don't bail mid-build with a "InProgress" report.
CMD_ID=$(aws ssm send-command \
--profile "{{.EC2_PROFILE}}" \
--region "{{.EC2_REGION}}" \
--instance-ids "$INSTANCE_ID" \
--document-name "AWS-RunShellScript" \
--parameters "file://$PARAMS_FILE" \
- --timeout-seconds 600 \
+ --timeout-seconds 1800 \
--query "Command.CommandId" --output text)
- # Poll for completion (up to 10 minutes)
- for i in $(seq 1 300); do
+ # Poll for completion (up to 30 minutes)
+ for i in $(seq 1 900); do
STATUS=$(aws ssm get-command-invocation \
--profile "{{.EC2_PROFILE}}" \
--region "{{.EC2_REGION}}" \
@@ -234,11 +243,14 @@ tasks:
echo -e "{{.INFO}} Cross-compiling for {{.RUST_TARGET}} (profile: $PROFILE, jobs: {{.CARGO_BUILD_JOBS}})..."
- # zig 0.15+ cannot handle RLIM_INFINITY — it needs a concrete fd limit.
- CURRENT_FD_LIMIT=$(ulimit -n 2>/dev/null || echo "256")
- if [ "$CURRENT_FD_LIMIT" = "unlimited" ] || [ "$CURRENT_FD_LIMIT" -lt "{{.MAX_OPEN_FILES}}" ] 2>/dev/null; then
- ulimit -n {{.MAX_OPEN_FILES}} 2>/dev/null || ulimit -n 10240 2>/dev/null || ulimit -n 4096 2>/dev/null || true
- fi
+ # Zig 0.15+ rejects RLIM_INFINITY on the *hard* fd limit (returns
+ # ProcessFdQuotaExceeded mid-link). On macOS, default zsh/bash sessions
+ # have soft=1048576 but hard=RLIM_INFINITY (`getrlimit` returns INT64_MAX),
+ # so even a high soft limit isn't enough — Zig sees the unlimited hard
+ # limit and bails. `ulimit -n N` sets both soft and hard to N, which is
+ # exactly what we want. Always pin to a concrete value, regardless of
+ # the current setting.
+ ulimit -n {{.MAX_OPEN_FILES}} 2>/dev/null || ulimit -n 10240 2>/dev/null || ulimit -n 4096 2>/dev/null || true
JOBS="{{.CARGO_BUILD_JOBS}}"
if [ "$JOBS" = "0" ]; then
@@ -301,11 +313,25 @@ tasks:
fi
ls -lh "$BIN_PATH"
+ # Pin sha256 of what we're about to ship so the SSM deploy step can
+ # verify the binary that lands on /usr/local/bin/ares matches exactly.
+ # Without this, the cp can silently fail to overwrite (ETXTBSY, immutable
+ # attribute, symlink redirection, prior deploy race) and the task still
+ # reports success.
+ if command -v sha256sum >/dev/null 2>&1; then
+ BUILD_SHA=$(sha256sum "$BIN_PATH" | awk '{print $1}')
+ else
+ BUILD_SHA=$(shasum -a 256 "$BIN_PATH" | awk '{print $1}')
+ fi
+ echo -e "{{.INFO}} Build SHA: $BUILD_SHA"
+ mkdir -p target/.deploy
+ echo "$BUILD_SHA" > target/.deploy/ares.sha256
+
echo -e "{{.INFO}} Uploading binary to s3://{{.BCP_BUCKET}}/{{.S3_DEPLOY_PREFIX}}/..."
aws s3 cp "$BIN_PATH" "s3://{{.BCP_BUCKET}}/{{.S3_DEPLOY_PREFIX}}/ares" \
--profile "{{.EC2_PROFILE}}" --region "{{.EC2_REGION}}"
- echo -e "{{.SUCCESS}} Binary staged in S3"
+ echo -e "{{.SUCCESS}} Binary staged in S3 (sha=$BUILD_SHA)"
# Pull from S3 on EC2 via SSM + verify (skip for remote builds)
- |
@@ -326,11 +352,30 @@ tasks:
echo -e "{{.INFO}} Pulling binaries from S3 to $INSTANCE_ID..."
+ EXPECTED_SHA=""
+ if [ -f target/.deploy/ares.sha256 ]; then
+ EXPECTED_SHA=$(cat target/.deploy/ares.sha256)
+ fi
+
PARAMS_FILE=$(mktemp)
trap "rm -f $PARAMS_FILE" EXIT
- jq -n --arg bucket "{{.BCP_BUCKET}}" --arg prefix "{{.S3_DEPLOY_PREFIX}}" \
- '{"commands": ["set -e; aws s3 cp s3://" + $bucket + "/" + $prefix + "/ares /usr/local/bin/ares; chmod +x /usr/local/bin/ares; echo Deployed:; ls -lh /usr/local/bin/ares"]}' \
- > "$PARAMS_FILE"
+ jq -n \
+ --arg bucket "{{.BCP_BUCKET}}" \
+ --arg prefix "{{.S3_DEPLOY_PREFIX}}" \
+ --arg expected_sha "$EXPECTED_SHA" \
+ '{"commands": [
+ "set -ex",
+ "aws s3 cp s3://" + $bucket + "/" + $prefix + "/ares /tmp/ares.staged",
+ "STAGED_RAW=$(sha256sum /tmp/ares.staged); STAGED_SHA=${STAGED_RAW%% *}",
+ "echo Staged SHA: $STAGED_SHA",
+ "if [ -n \"" + $expected_sha + "\" ] && [ \"$STAGED_SHA\" != \"" + $expected_sha + "\" ]; then echo ERROR: S3 staged binary sha mismatch expected=" + $expected_sha + " staged=$STAGED_SHA; exit 1; fi",
+ "install -m 755 /tmp/ares.staged /usr/local/bin/ares",
+ "DEPLOY_RAW=$(sha256sum /usr/local/bin/ares); DEPLOY_SHA=${DEPLOY_RAW%% *}",
+ "echo Deploy SHA: $DEPLOY_SHA",
+ "if [ \"$STAGED_SHA\" != \"$DEPLOY_SHA\" ]; then echo ERROR: deployed sha differs from staged staged=$STAGED_SHA deploy=$DEPLOY_SHA; exit 1; fi",
+ "rm -f /tmp/ares.staged",
+ "echo Deployed: && ls -lh /usr/local/bin/ares"
+ ]}' > "$PARAMS_FILE"
CMD_ID=$(aws ssm send-command \
--profile "{{.EC2_PROFILE}}" \
@@ -1032,6 +1077,7 @@ tasks:
SECRETS_ID: '{{.SECRETS_ID | default "ares/api-keys"}}'
LLM_MODEL: '{{.LLM_MODEL | default ""}}'
FLUSH_REDIS: '{{.FLUSH_REDIS | default "true"}}'
+ OPERATION_ID: '{{.OPERATION_ID | default ""}}'
cmds:
- |
INSTANCE_ID=$(aws ec2 describe-instances \
@@ -1047,7 +1093,11 @@ tasks:
exit 1
fi
- OP_ID="op-$(date -u +%Y%m%d-%H%M%S)"
+ if [ -n "{{.OPERATION_ID}}" ]; then
+ OP_ID="{{.OPERATION_ID}}"
+ else
+ OP_ID="op-$(date -u +%Y%m%d-%H%M%S)"
+ fi
echo -e "{{.INFO}} Operation ID: $OP_ID"
# Build target IPs JSON array
@@ -1084,6 +1134,10 @@ tasks:
ANTHROPIC_KEY=$(echo "$SECRETS" | jq -r .ANTHROPIC_API_KEY)
GRAFANA_URL_VAL=$(echo "$SECRETS" | jq -r '.GRAFANA_URL // empty')
GRAFANA_TOKEN_VAL=$(echo "$SECRETS" | jq -r '.GRAFANA_SERVICE_ACCOUNT_TOKEN // empty')
+ LOKI_URL_VAL=$(echo "$SECRETS" | jq -r '.LOKI_URL // empty')
+ if [ -z "$LOKI_URL_VAL" ]; then
+ LOKI_URL_VAL="{{.LOKI_URL}}"
+ fi
DREADNODE_API_KEY=$(echo "$SECRETS" | jq -r '.DREADNODE_API_KEY // empty')
OTEL_TRACES_ENDPOINT="{{.OTEL_TRACES_ENDPOINT}}"
@@ -1101,6 +1155,9 @@ tasks:
ENV_FILE_CMD="$ENV_FILE_CMD; echo 'GRAFANA_SERVICE_ACCOUNT_TOKEN=${GRAFANA_TOKEN_VAL}' >> /etc/ares/env"
fi
fi
+ if [ -n "$LOKI_URL_VAL" ]; then
+ ENV_FILE_CMD="$ENV_FILE_CMD; echo 'LOKI_URL=${LOKI_URL_VAL}' >> /etc/ares/env"
+ fi
ENV_FILE_CMD="$ENV_FILE_CMD; echo 'ARES_DEPLOYMENT={{.EC2_DEPLOYMENT}}' >> /etc/ares/env"
ENV_FILE_CMD="$ENV_FILE_CMD; echo 'NATS_URL=nats://127.0.0.1:4222' >> /etc/ares/env"
# OTEL: send traces to Alloy OTLP gateway → Tempo via HTTP/protobuf
@@ -1120,6 +1177,7 @@ tasks:
export ANTHROPIC_API_KEY='${ANTHROPIC_KEY}'
export GRAFANA_URL='${GRAFANA_URL_VAL}'
export GRAFANA_SERVICE_ACCOUNT_TOKEN='${GRAFANA_TOKEN_VAL}'
+ export LOKI_URL='${LOKI_URL_VAL}'
export ARES_REDIS_URL=redis://127.0.0.1:6379
export NATS_URL=nats://127.0.0.1:4222
{{- if .LLM_MODEL}}
diff --git a/.taskfiles/ec2/scripts/launch-orchestrator.sh.tmpl b/.taskfiles/ec2/scripts/launch-orchestrator.sh.tmpl
index 3b98544a..202a618c 100755
--- a/.taskfiles/ec2/scripts/launch-orchestrator.sh.tmpl
+++ b/.taskfiles/ec2/scripts/launch-orchestrator.sh.tmpl
@@ -1,6 +1,11 @@
#!/bin/bash
-# Launch ares orchestrator with environment variables
-# Placeholders are substituted by the calling task via envsubst/sed
+# Launch ares orchestrator in its own systemd transient unit so it (and any
+# tool subprocesses it spawns) gets its own cgroup, separate from
+# amazon-ssm-agent.service. Otherwise everything launched by SSM
+# RunShellScript inherits SSM's cgroup and competes with it for memory —
+# resulting in CONSTRAINT_MEMCG OOM-kills regardless of OOMScoreAdjust.
+set -euo pipefail
+
export ARES_REDIS_URL=redis://127.0.0.1:6379
export NATS_URL=nats://127.0.0.1:4222
export RUST_LOG=info
@@ -14,6 +19,7 @@ export DREADNODE_WORKSPACE='__DREADNODE_WORKSPACE__'
export DREADNODE_PROJECT='__DREADNODE_PROJECT__'
export GRAFANA_SERVICE_ACCOUNT_TOKEN='__GRAFANA_TOKEN__'
export GRAFANA_URL='__GRAFANA_URL__'
+export LOKI_URL='__LOKI_URL__'
_llm_model='__ARES_LLM_MODEL__'
if [ -n "$_llm_model" ] && [ "$_llm_model" = "${_llm_model#__}" ]; then
export ARES_LLM_MODEL="$_llm_model"
@@ -26,13 +32,57 @@ if [ -n "$_blue_model" ] && [ "$_blue_model" = "${_blue_model#__}" ]; then
fi
export ARES_DEPLOYMENT='__ARES_DEPLOYMENT__'
export ARES_CONFIG=/etc/ares/config.yaml
+export ARES_MAX_CONCURRENT_TASKS=8
_otel_endpoint='__OTEL_TRACES_ENDPOINT__'
if [ -n "$_otel_endpoint" ] && [ "$_otel_endpoint" = "${_otel_endpoint#__}" ]; then
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT="$_otel_endpoint"
export OTEL_EXPORTER_OTLP_PROTOCOL='http/protobuf'
export OTEL_RESOURCE_ATTRIBUTES='deployment.environment=staging,attack.team=red'
fi
+
+mkdir -p /var/log/ares
+
+# Stop any prior orchestrator (transient unit or stray nohup process).
+systemctl stop ares-orchestrator.service 2>/dev/null || true
+systemctl reset-failed ares-orchestrator.service 2>/dev/null || true
pkill -f 'ares orchestrator' 2>/dev/null || true
sleep 1
-nohup /usr/local/bin/ares orchestrator >/var/log/ares/orchestrator.log 2>&1 &
-echo "Orchestrator started (PID: $!)"
+
+# Spawn as a transient systemd service in system-ares.slice. --setenv=NAME
+# (no value) inherits from current environment, preserving quoting that
+# would otherwise be mangled by EnvironmentFile parsing of JSON payloads.
+exec systemd-run \
+ --unit=ares-orchestrator.service \
+ --slice=system-ares.slice \
+ --description="Ares Orchestrator (transient)" \
+ --collect \
+ --setenv=ARES_REDIS_URL \
+ --setenv=RUST_LOG \
+ --setenv=ARES_OPERATION_ID \
+ --setenv=OPENAI_API_KEY \
+ --setenv=ANTHROPIC_API_KEY \
+ --setenv=DREADNODE_API_KEY \
+ --setenv=DREADNODE_SERVER_URL \
+ --setenv=DREADNODE_ORGANIZATION \
+ --setenv=DREADNODE_WORKSPACE \
+ --setenv=DREADNODE_PROJECT \
+ --setenv=GRAFANA_SERVICE_ACCOUNT_TOKEN \
+ --setenv=GRAFANA_URL \
+ --setenv=LOKI_URL \
+ --setenv=ARES_LLM_MODEL \
+ --setenv=ARES_TOOL_DISPATCH \
+ --setenv=ARES_BLUE_ENABLED \
+ --setenv=ARES_BLUE_LLM_MODEL \
+ --setenv=ARES_DEPLOYMENT \
+ --setenv=ARES_CONFIG \
+ --setenv=ARES_MAX_CONCURRENT_TASKS \
+ --setenv=OTEL_EXPORTER_OTLP_TRACES_ENDPOINT \
+ --setenv=OTEL_EXPORTER_OTLP_PROTOCOL \
+ --setenv=OTEL_RESOURCE_ATTRIBUTES \
+ --property=StandardOutput=append:/var/log/ares/orchestrator.log \
+ --property=StandardError=append:/var/log/ares/orchestrator.log \
+ --property=OOMScoreAdjust=-500 \
+ --property=TasksMax=4096 \
+ --property=MemoryHigh=8G \
+ --property=MemoryMax=10G \
+ /usr/local/bin/ares orchestrator
diff --git a/.taskfiles/ec2/scripts/setup.sh b/.taskfiles/ec2/scripts/setup.sh
index 8c6961a3..549dd8a9 100755
--- a/.taskfiles/ec2/scripts/setup.sh
+++ b/.taskfiles/ec2/scripts/setup.sh
@@ -90,6 +90,46 @@ NATS_UNIT_EOF
echo "=== Creating directories ==="
mkdir -p /var/log/ares /etc/ares
+echo "=== Removing legacy ares-worker@ unit (renamed in PR #226) ==="
+if [ -f /etc/systemd/system/ares-worker@.service ]; then
+ for role in recon credential_access cracker acl privesc lateral coercion; do
+ systemctl disable --now "ares-worker@${role}.service" 2>/dev/null || true
+ done
+ rm -f /etc/systemd/system/ares-worker@.service
+fi
+
+echo "=== Creating system-ares.slice with global memory cap ==="
+cat >/etc/systemd/system/system-ares.slice <<'SLICE_EOF'
+[Unit]
+Description=Ares system slice (orchestrator + workers)
+Before=slices.target
+
+[Slice]
+MemoryMax=12G
+MemoryHigh=10G
+TasksMax=8192
+SLICE_EOF
+
+echo "=== Ensuring 4G swap file (OOM cushion) ==="
+if [ ! -f /swapfile ] || [ "$(stat -c%s /swapfile 2>/dev/null || echo 0)" -lt 4000000000 ]; then
+ swapoff /swapfile 2>/dev/null || true
+ rm -f /swapfile
+ fallocate -l 4G /swapfile || dd if=/dev/zero of=/swapfile bs=1M count=4096
+ chmod 600 /swapfile
+ mkswap /swapfile
+ swapon /swapfile
+ if ! grep -q '^/swapfile' /etc/fstab; then
+ echo '/swapfile none swap sw 0 0' >>/etc/fstab
+ fi
+fi
+
+echo "=== Tuning OOM behavior (oom_kill_allocating_task, swappiness) ==="
+cat >/etc/sysctl.d/90-ares.conf <<'SYSCTL_EOF'
+vm.oom_kill_allocating_task = 1
+vm.swappiness = 10
+SYSCTL_EOF
+sysctl -p /etc/sysctl.d/90-ares.conf >/dev/null
+
echo "=== Creating systemd worker template unit ==="
cat >/etc/systemd/system/ares@.service <<'UNIT_EOF'
[Unit]
@@ -124,6 +164,7 @@ TasksMax=256
[Install]
WantedBy=multi-user.target
UNIT_EOF
+systemctl daemon-reload
echo "=== Installing cracking tools ==="
if ! command -v hashcat >/dev/null 2>&1 || ! command -v john >/dev/null 2>&1; then
diff --git a/.taskfiles/red/Taskfile.yaml b/.taskfiles/red/Taskfile.yaml
index a1568fda..97228365 100644
--- a/.taskfiles/red/Taskfile.yaml
+++ b/.taskfiles/red/Taskfile.yaml
@@ -19,12 +19,13 @@ tasks:
# ===========================================================================
multi:
- desc: "Run multi-agent red team operation (usage: task red:multi [TARGET=dreadgoad] [DOMAIN=contoso.local] [TARGET_ENV=staging])"
+ desc: "Run multi-agent red team operation (usage: task red:multi [TARGET=dreadgoad] [DOMAIN=contoso.local] [TARGET_ENV=staging] [IPS=10.1.10.10,10.1.10.11])"
silent: true
vars:
OPERATION_ID: '{{.OPERATION_ID | default ""}}'
RESUME: '{{.RESUME | default "false"}}'
TARGET_ENV: '{{.TARGET_ENV | default "staging"}}'
+ IPS: '{{.IPS | default ""}}'
OPERATION_ID_COMPUTED:
sh: |
if [ -n "{{.OPERATION_ID}}" ]; then
@@ -71,6 +72,14 @@ tasks:
MODEL_OVERRIDE_ENV="ARES_MODEL_OVERRIDE={{.MODEL}}"
fi
+ # When IPS is supplied, target IPs directly and skip EC2 Name-tag resolution
+ # (the orchestrator pod has no `aws` CLI). Otherwise default to AWS lookup.
+ if [ -n "{{.IPS}}" ]; then
+ TARGET_FLAGS="--ips {{.IPS}}"
+ else
+ TARGET_FLAGS="--resolve-targets --aws-profile {{.TARGET_PROFILE}} --aws-region {{.TARGET_REGION}}"
+ fi
+
# CLI auto-loads .env if present, or use --secrets-from 1password
kubectl exec -i -n {{.K8S_NAMESPACE}} deploy/ares-orchestrator -- \
env $MODEL_OVERRIDE_ENV \
@@ -82,9 +91,7 @@ tasks:
GRAFANA_URL="{{.GRAFANA_URL}}" \
ares --redis-url "{{.REDIS_URL}}" ops submit \
"{{.TARGET}}" "{{.DOMAIN}}" \
- --resolve-targets \
- --aws-profile "{{.TARGET_PROFILE}}" \
- --aws-region "{{.TARGET_REGION}}" \
+ $TARGET_FLAGS \
--pin-active \
--operation-id "{{.OPERATION_ID_COMPUTED}}" \
--model "{{.MODEL}}" \
@@ -738,6 +745,7 @@ tasks:
BLUE_ENABLED: '{{.BLUE_ENABLED | default "0"}}'
BLUE_LLM_MODEL: '{{.BLUE_LLM_MODEL | default ""}}'
EC2_DEPLOYMENT: '{{.EC2_DEPLOYMENT | default "alpha-operator-range"}}'
+ STRATEGY: '{{.STRATEGY | default "comprehensive"}}'
RESOLVED_TARGETS:
sh: |
TARGET="{{.TARGET}}"
@@ -867,7 +875,7 @@ tasks:
# Build JSON payload for ARES_OPERATION_ID
TARGET_IPS_JSON=$(echo "{{.RESOLVED_TARGETS}}" | tr ',' '\n' | sed 's/^/"/;s/$/"/' | paste -sd, - | sed 's/^/[/;s/$/]/')
- ORCH_PAYLOAD="{\"operation_id\":\"{{.OPERATION_ID_COMPUTED}}\",\"target_domain\":\"{{.DOMAIN}}\",\"target_ips\":${TARGET_IPS_JSON},\"model\":\"{{.MODEL}}\"}"
+ ORCH_PAYLOAD="{\"operation_id\":\"{{.OPERATION_ID_COMPUTED}}\",\"target_domain\":\"{{.DOMAIN}}\",\"target_ips\":${TARGET_IPS_JSON},\"model\":\"{{.MODEL}}\",\"strategy\":\"{{.STRATEGY}}\"}"
# Build orchestrator launch script from template
ORCH_SCRIPT=$(mktemp)
@@ -882,6 +890,7 @@ tasks:
-e "s|__DREADNODE_PROJECT__|{{.DREADNODE_PROJECT}}|" \
-e "s|__GRAFANA_TOKEN__|${GRAFANA_SERVICE_ACCOUNT_TOKEN:-}|" \
-e "s|__GRAFANA_URL__|{{.GRAFANA_URL}}|" \
+ -e "s|__LOKI_URL__|{{.LOKI_URL}}|" \
-e "s|__ARES_LLM_MODEL__|{{.MODEL}}|" \
-e "s|__ARES_BLUE_ENABLED__|{{.BLUE_ENABLED}}|" \
-e "s|__ARES_BLUE_LLM_MODEL__|{{.BLUE_LLM_MODEL}}|" \
diff --git a/.taskfiles/remote/Taskfile.yaml b/.taskfiles/remote/Taskfile.yaml
index 8d0f8bb7..94c0893d 100644
--- a/.taskfiles/remote/Taskfile.yaml
+++ b/.taskfiles/remote/Taskfile.yaml
@@ -868,7 +868,15 @@ tasks:
desc: "Cross-compile Rust binaries for Linux (K8s pods)"
silent: true
vars:
- RUST_TARGET: '{{.RUST_TARGET | default "x86_64-unknown-linux-gnu"}}'
+ DETECTED_ARCH:
+ sh: |
+ arch=$(kubectl get nodes -n {{.K8S_NAMESPACE}} -o jsonpath='{.items[0].status.nodeInfo.architecture}' 2>/dev/null || true)
+ case "$arch" in
+ arm64) echo "aarch64-unknown-linux-gnu" ;;
+ amd64) echo "x86_64-unknown-linux-gnu" ;;
+ *) echo "x86_64-unknown-linux-gnu" ;;
+ esac
+ RUST_TARGET: '{{.RUST_TARGET | default .DETECTED_ARCH}}'
MAX_OPEN_FILES: '{{.MAX_OPEN_FILES | default "8192"}}'
CARGO_BUILD_JOBS: '{{.CARGO_BUILD_JOBS | default "4"}}'
cmds:
@@ -922,7 +930,15 @@ tasks:
desc: "Check if local Rust binaries match remote pods (usage: task remote:check [TEAM=red|blue|all])"
silent: true
vars:
- RUST_TARGET: '{{.RUST_TARGET | default "x86_64-unknown-linux-gnu"}}'
+ DETECTED_ARCH:
+ sh: |
+ arch=$(kubectl get nodes -n {{.K8S_NAMESPACE}} -o jsonpath='{.items[0].status.nodeInfo.architecture}' 2>/dev/null || true)
+ case "$arch" in
+ arm64) echo "aarch64-unknown-linux-gnu" ;;
+ amd64) echo "x86_64-unknown-linux-gnu" ;;
+ *) echo "x86_64-unknown-linux-gnu" ;;
+ esac
+ RUST_TARGET: '{{.RUST_TARGET | default .DETECTED_ARCH}}'
BIN_DIR: 'target/{{.RUST_TARGET}}/release'
REMOTE_BIN_DIR: '/usr/local/bin'
cmds:
@@ -1050,7 +1066,15 @@ tasks:
desc: "Deploy Rust binaries to K8s pods (usage: task remote:rust:deploy [TEAM=red|blue|all])"
silent: true
vars:
- RUST_TARGET: '{{.RUST_TARGET | default "x86_64-unknown-linux-gnu"}}'
+ DETECTED_ARCH:
+ sh: |
+ arch=$(kubectl get nodes -n {{.K8S_NAMESPACE}} -o jsonpath='{.items[0].status.nodeInfo.architecture}' 2>/dev/null || true)
+ case "$arch" in
+ arm64) echo "aarch64-unknown-linux-gnu" ;;
+ amd64) echo "x86_64-unknown-linux-gnu" ;;
+ *) echo "x86_64-unknown-linux-gnu" ;;
+ esac
+ RUST_TARGET: '{{.RUST_TARGET | default .DETECTED_ARCH}}'
BIN_DIR: 'target/{{.RUST_TARGET}}/release'
REMOTE_BIN_DIR: '/usr/local/bin'
preconditions:
diff --git a/.taskfiles/remote/orchestrator-wrapper-patch.json b/.taskfiles/remote/orchestrator-wrapper-patch.json
index 9ee1be92..67009f79 100644
--- a/.taskfiles/remote/orchestrator-wrapper-patch.json
+++ b/.taskfiles/remote/orchestrator-wrapper-patch.json
@@ -8,7 +8,7 @@
"op": "replace",
"path": "/spec/template/spec/containers/0/args",
"value": [
- "echo \"ares orchestrator queue dispatcher starting\" >&2\nwhile true; do\n OP_REQUEST=$(RUST_LOG=error ares ops claim-next --timeout 30 2>/dev/null | tail -n 1 || true)\n if [ -n \"$OP_REQUEST\" ]; then\n OP_ID=$(printf '%s\\n' \"$OP_REQUEST\" | sed -n 's/.*\"operation_id\"[[:space:]]*:[[:space:]]*\"\\([^\"]*\\)\".*/\\1/p')\n echo \"Starting operation: ${OP_ID:-unknown}\" >&2\n export ARES_OPERATION_ID=\"$OP_REQUEST\"\n ares orchestrator\n status=$?\n echo \"Operation ${OP_ID:-unknown} exited with status $status\" >&2\n fi\ndone"
+ "echo \"ares orchestrator queue dispatcher starting\" >&2\nwhile true; do\n OP_REQUEST=$(RUST_LOG=error ares ops claim-next --timeout 30 2>/dev/null | tail -n 1 || true)\n case \"$OP_REQUEST\" in *\"\\\"operation_id\\\"\"*) ;; *) OP_REQUEST=\"\" ;; esac\n if [ -n \"$OP_REQUEST\" ]; then\n OP_ID=$(printf '%s\\n' \"$OP_REQUEST\" | sed -n 's/.*\"operation_id\"[[:space:]]*:[[:space:]]*\"\\([^\"]*\\)\".*/\\1/p')\n if [ -z \"$OP_ID\" ]; then\n echo \"Skipping malformed op request\" >&2\n continue\n fi\n echo \"Starting operation: $OP_ID\" >&2\n export ARES_OPERATION_ID=\"$OP_REQUEST\"\n ares orchestrator\n status=$?\n echo \"Operation $OP_ID exited with status $status\" >&2\n fi\ndone"
]
}
]
diff --git a/Taskfile.yaml b/Taskfile.yaml
index 878b9d8b..a3ba0b72 100644
--- a/Taskfile.yaml
+++ b/Taskfile.yaml
@@ -26,6 +26,7 @@ includes:
LOG_DIR: '{{.LOG_DIR}}'
REPORT_DIR: '{{.REPORT_DIR}}'
GRAFANA_URL: '{{.GRAFANA_URL}}'
+ LOKI_URL: '{{.LOKI_URL}}'
DREADNODE_SERVER_URL: '{{.DREADNODE_SERVER_URL}}'
DREADNODE_ORGANIZATION: '{{.DREADNODE_ORGANIZATION}}'
DREADNODE_WORKSPACE: '{{.DREADNODE_WORKSPACE}}'
@@ -51,6 +52,7 @@ includes:
ARES_CONFIG: '{{.ARES_CONFIG}}'
OTEL_TRACES_ENDPOINT: '{{.OTEL_TRACES_ENDPOINT}}'
ALLOY_LOKI_ENDPOINT: '{{.ALLOY_LOKI_ENDPOINT}}'
+ LOKI_URL: '{{.LOKI_URL}}'
blue:
taskfile: .taskfiles/blue/Taskfile.yaml
optional: true
@@ -76,6 +78,7 @@ vars:
# MODEL: '{{.MODEL | default "claude-sonnet-4-5-20250929"}}'
MODEL: '{{.MODEL | default "gpt-5.2"}}'
GRAFANA_URL: '{{.GRAFANA_URL}}'
+ LOKI_URL: '{{.LOKI_URL}}'
POLL_INTERVAL: '{{.POLL_INTERVAL | default "30"}}'
MAX_STEPS_BLUE: '{{.MAX_STEPS_BLUE | default "50"}}'
MAX_STEPS_BLUE_ONCE: '{{.MAX_STEPS_BLUE_ONCE | default "15"}}' # ~15 min max for once mode
diff --git a/ansible/playbooks/ares/goad_attack_box.yml b/ansible/playbooks/ares/goad_attack_box.yml
index 15107d9a..f65d1f6e 100644
--- a/ansible/playbooks/ares/goad_attack_box.yml
+++ b/ansible/playbooks/ares/goad_attack_box.yml
@@ -32,7 +32,7 @@
alloy_deployment_name: "goad-attack-box"
alloy_server_id: ""
alloy_instance_id: ""
- alloy_loki_endpoint: "{{ alloy_loki_endpoint }}"
+ alloy_loki_endpoint: "{{ lookup('env', 'ALLOY_LOKI_ENDPOINT') | default('http://localhost:3100/loki/api/v1/push', true) }}"
alloy_version: "1.10.1"
# Python version
@@ -45,6 +45,12 @@
cracking_tools_gpu_support: true
cracking_tools_hashcat_from_source: true
cracking_tools_nvidia_opencl_icd: true
+ # Bake the kernel-mode NVIDIA driver + CUDA into the image. Without these,
+ # hashcat on g4dn (T4) reports "OpenCL platform not found" and falls back
+ # to john-on-CPU, which is too slow to feed credential cracks back into
+ # the orchestrator within an op's budget.
+ cracking_tools_install_nvidia_driver: true
+ cracking_tools_install_cuda_toolkit: true
cracking_tools_wordlists:
- rockyou
- seclists_passwords
@@ -113,9 +119,14 @@
changed_when: true
roles:
- # AWS infrastructure agents
+ # AWS infrastructure agents — skipped on non-AWS clouds because they
+ # require the EC2 instance metadata service (cloudwatch-agent's
+ # `fetch-config -m ec2` hits 169.254.169.254 and aborts the build
+ # on Azure).
- role: dreadnode.nimbus_range.aws_ssm_agent
+ when: cloud_provider | default('aws') == 'aws'
- role: dreadnode.nimbus_range.aws_cloudwatch_agent
+ when: cloud_provider | default('aws') == 'aws'
# Base Ares requirements
- role: dreadnode.nimbus_range.base
diff --git a/ansible/roles/base/README.md b/ansible/roles/base/README.md
index 6c13b679..a4449559 100644
--- a/ansible/roles/base/README.md
+++ b/ansible/roles/base/README.md
@@ -34,10 +34,9 @@ Base requirements for Ares AI agents
| `base_pip_packages.0` | str | python-dotenv | No description |
| `base_pip_packages.1` | str | rigging>=3.0 | No description |
| `base_pip_packages.2` | str | pydantic | No description |
-| `base_pip_packages.3` | str | asyncio | No description |
-| `base_pip_packages.4` | str | aiohttp>=3.13.4 | No description |
-| `base_pip_packages.5` | str | cryptography>=44.0.1 | No description |
-| `base_pip_packages.6` | str | requests>=2.33.0 | No description |
+| `base_pip_packages.3` | str | aiohttp>=3.13.4 | No description |
+| `base_pip_packages.4` | str | cryptography>=44.0.1 | No description |
+| `base_pip_packages.5` | str | requests>=2.33.0 | No description |
| `base_pip_externally_managed` | bool | False | No description |
| `base_pip_break_required` | bool | False | No description |
| `base_system_packages` | list | [] | No description |
@@ -140,7 +139,10 @@ Base requirements for Ares AI agents
- **Fail when break-system-packages is required but disabled** (ansible.builtin.fail) - Conditional
- **Fail when break-system-packages is required but unsupported by pip** (ansible.builtin.fail) - Conditional
- **Upgrade pip to latest (CVE fixes)** (ansible.builtin.command)
-- **Install Ares Python dependencies** (ansible.builtin.pip)
+- **Install Ares Python dependencies (with full log)** (ansible.builtin.shell)
+- **Show pip install log tail on failure** (ansible.builtin.command) - Conditional
+- **Print pip install tail** (ansible.builtin.debug) - Conditional
+- **Fail if pip install failed** (ansible.builtin.fail) - Conditional
- **Create Ares workspace directory** (ansible.builtin.file) - Conditional
### main.yml
diff --git a/ansible/roles/base/defaults/main.yml b/ansible/roles/base/defaults/main.yml
index 6588b5a0..e366f5da 100644
--- a/ansible/roles/base/defaults/main.yml
+++ b/ansible/roles/base/defaults/main.yml
@@ -28,11 +28,14 @@ base_rust_install_script: "https://sh.rustup.rs"
base_install_pipx: true
# Ares Python dependencies (installed via pip)
+# Do NOT add `asyncio` here — Python 3.4+ ships asyncio in the stdlib. The
+# PyPI `asyncio` package is a 2015-era stub that ships an `asyncio.py` into
+# site-packages, shadowing the stdlib module and breaking any import of
+# asyncio (including the rest of this pip install run on Python 3.13).
base_pip_packages:
- python-dotenv
- "rigging>=3.0"
- pydantic
- - asyncio
- "aiohttp>=3.13.4"
- "cryptography>=44.0.1"
- "requests>=2.33.0"
diff --git a/ansible/roles/base/tasks/linux.yml b/ansible/roles/base/tasks/linux.yml
index 62d42782..4b7350ab 100644
--- a/ansible/roles/base/tasks/linux.yml
+++ b/ansible/roles/base/tasks/linux.yml
@@ -142,16 +142,50 @@
become: true
changed_when: false
-- name: Install Ares Python dependencies
- ansible.builtin.pip:
- name: "{{ base_pip_packages }}"
- state: present
- executable: "{{ base_pip_executable }}"
- extra_args: >-
- {{ '--break-system-packages' if base_pip_break_required else '' }}
- {{ '--ignore-installed' if ansible_facts['os_family'] == 'Debian' else '' }}
+# Run pip directly via shell so we can tee stdout+stderr to a log file. The
+# ansible.builtin.pip module captures output into a single `msg` field that
+# is too large for CloudWatch's per-event size limit on this dep tree
+# (rigging pulls 100+ transitives), so failures show up as a truncated stdout
+# with no stderr or rc visible. The tee'd log lets the next task surface the
+# real error.
+#
+# `--ignore-installed` is required: Kali ships several Python deps via apt
+# (python3-requests, python3-cryptography, python3-urllib3, python3-yaml).
+# apt-installed packages have no pip RECORD file, so pip's normal upgrade
+# path fails with `uninstall-no-record-file` ("The package was installed
+# by debian"). `--ignore-installed` skips uninstall and overwrites in place.
+- name: Install Ares Python dependencies (with full log)
+ ansible.builtin.shell:
+ cmd: |
+ set -o pipefail
+ {{ base_pip_executable }} install \
+ {{ '--break-system-packages' if base_pip_break_required else '' }} \
+ --ignore-installed \
+ --no-color \
+ {{ base_pip_packages | map('quote') | join(' ') }} \
+ 2>&1 | tee /tmp/ares-pip-install.log
+ executable: /bin/bash
+ become: true
+ register: base_pip_install_result
+ changed_when: false
+ failed_when: false
+
+- name: Show pip install log tail on failure
+ ansible.builtin.command: tail -120 /tmp/ares-pip-install.log
become: true
+ register: base_pip_install_tail
changed_when: false
+ when: base_pip_install_result.rc != 0
+
+- name: Print pip install tail
+ ansible.builtin.debug:
+ var: base_pip_install_tail.stdout_lines
+ when: base_pip_install_result.rc != 0
+
+- name: Fail if pip install failed
+ ansible.builtin.fail:
+ msg: "pip install failed (rc={{ base_pip_install_result.rc }}); see tail above"
+ when: base_pip_install_result.rc != 0
- name: Create Ares workspace directory
ansible.builtin.file:
diff --git a/ansible/roles/cracking_tools/README.md b/ansible/roles/cracking_tools/README.md
index 6c12b795..6400f577 100644
--- a/ansible/roles/cracking_tools/README.md
+++ b/ansible/roles/cracking_tools/README.md
@@ -53,6 +53,17 @@ Install and configure password cracking tools for Ares agents
| `cracking_tools_opencl_packages.1` | str | opencl-headers | No description |
| `cracking_tools_opencl_packages.2` | str | clinfo | No description |
| `cracking_tools_nvidia_opencl_icd` | bool | False | No description |
+| `cracking_tools_install_nvidia_driver` | bool | False | No description |
+| `cracking_tools_install_cuda_toolkit` | bool | False | No description |
+| `cracking_tools_nvidia_driver_packages` | list | [] | No description |
+| `cracking_tools_nvidia_driver_packages.0` | str | linux-headers-cloud-amd64 | No description |
+| `cracking_tools_nvidia_driver_packages.1` | str | dkms | No description |
+| `cracking_tools_nvidia_driver_packages.2` | str | firmware-misc-nonfree | No description |
+| `cracking_tools_nvidia_driver_packages.3` | str | nvidia-kernel-open-dkms | No description |
+| `cracking_tools_nvidia_driver_packages.4` | str | nvidia-driver-cuda | No description |
+| `cracking_tools_nvidia_driver_packages.5` | str | nvidia-opencl-icd | No description |
+| `cracking_tools_nvidia_cuda_toolkit_packages` | list | [] | No description |
+| `cracking_tools_nvidia_cuda_toolkit_packages.0` | str | nvidia-cuda-toolkit | No description |
| `cracking_tools_update_cache` | bool | True | No description |
## Tasks
@@ -94,9 +105,21 @@ Install and configure password cracking tools for Ares agents
- **Set DEBIAN_FRONTEND to noninteractive** (ansible.builtin.lineinfile) - Conditional
- **Update apt cache** (ansible.builtin.apt) - Conditional
- **Create wordlist directory** (ansible.builtin.file)
+- **Add NVIDIA CUDA apt repository (Kali ships 550.x which fails on kernel 6.19+)** (ansible.builtin.shell) - Conditional
+- **Install kernel headers and DKMS prerequisites** (ansible.builtin.apt) - Conditional
+- **Install NVIDIA driver and OpenCL runtime (with full log)** (ansible.builtin.shell) - Conditional
+- **Show NVIDIA install log tail on failure** (ansible.builtin.command) - Conditional
+- **Print NVIDIA install tail** (ansible.builtin.debug) - Conditional
+- **Dump DKMS make.log on failure** (ansible.builtin.shell) - Conditional
+- **Print DKMS make.log** (ansible.builtin.debug) - Conditional
+- **Fail if NVIDIA install failed** (ansible.builtin.fail) - Conditional
+- **Install NVIDIA CUDA toolkit** (ansible.builtin.apt) - Conditional
- **Install GPU support packages** (ansible.builtin.apt) - Conditional
- **Create OpenCL vendors directory** (ansible.builtin.file) - Conditional
- **Register NVIDIA OpenCL ICD** (ansible.builtin.copy) - Conditional
+- **Verify NVIDIA driver (non-fatal — no GPU on builder hosts)** (ansible.builtin.command) - Conditional
+- **Verify OpenCL platform discovery (non-fatal)** (ansible.builtin.command) - Conditional
+- **Show GPU/OpenCL detection summary** (ansible.builtin.debug) - Conditional
- **Ensure libgcc runtime is present for hashcat** (block) - Conditional
- **Install primary libgcc package** (ansible.builtin.apt)
- **Ensure libgcc static archive is present for hashcat** (block) - Conditional
diff --git a/ansible/roles/cracking_tools/defaults/main.yml b/ansible/roles/cracking_tools/defaults/main.yml
index 4fe3e9b7..af1d326c 100644
--- a/ansible/roles/cracking_tools/defaults/main.yml
+++ b/ansible/roles/cracking_tools/defaults/main.yml
@@ -50,4 +50,35 @@ cracking_tools_opencl_packages:
# Set to true when using nvidia/cuda base image to register NVIDIA OpenCL ICD
cracking_tools_nvidia_opencl_icd: false
+# Install the NVIDIA kernel-mode driver + OpenCL runtime on the host. Required
+# on bare-metal/AMI builds (g4dn etc.) where the Kali base image ships without
+# any NVIDIA bits — without this hashcat reports "OpenCL platform not found".
+# Leave false for container builds: the nvidia/cuda runtime base image
+# already provides libnvidia-opencl/libcuda, and the kernel module comes
+# from the host via nvidia-container-toolkit.
+cracking_tools_install_nvidia_driver: false
+# Install the full CUDA toolkit so hashcat can use the CUDA backend (faster
+# than OpenCL on T4/A10/etc.). Pulls ~3GB; only enable on AMI builds.
+cracking_tools_install_cuda_toolkit: false
+# Recommends are intentionally enabled — DKMS, libcuda1, and the kernel
+# module build chain come in via Recommends on Debian/Kali.
+# Kali AMIs ship `+kali-cloud-amd64` kernel — needs the `cloud` headers
+# meta-package. We pull driver + open-source kernel module from NVIDIA's
+# CUDA Debian repo (added in tasks/linux.yml) because Kali's archive
+# nvidia-driver (550.163.01) does not build against kernel 6.19+.
+# `nvidia-kernel-open-dkms` is required for Turing+ (T4 included) on
+# modern kernels; legacy `nvidia-kernel-dkms` is a dead-end here. Pair it
+# with `nvidia-driver-cuda` (CUDA-only userspace) — the `cuda-drivers`
+# meta and full `nvidia-driver` both pull `nvidia-kernel-dkms` (closed
+# kernel module), which Conflicts with the open variant.
+cracking_tools_nvidia_driver_packages:
+ - linux-headers-cloud-amd64
+ - dkms
+ - firmware-misc-nonfree
+ - nvidia-kernel-open-dkms
+ - nvidia-driver-cuda
+ - nvidia-opencl-icd
+cracking_tools_nvidia_cuda_toolkit_packages:
+ - nvidia-cuda-toolkit
+
cracking_tools_update_cache: true
diff --git a/ansible/roles/cracking_tools/tasks/linux.yml b/ansible/roles/cracking_tools/tasks/linux.yml
index 551746d3..f75a3371 100644
--- a/ansible/roles/cracking_tools/tasks/linux.yml
+++ b/ansible/roles/cracking_tools/tasks/linux.yml
@@ -24,6 +24,132 @@
mode: '0755'
become: true
+# Kali rolling ships kernel 6.19.x, which the Kali archive's NVIDIA driver
+# (550.163.01) cannot compile against — DKMS exits 2. NVIDIA's official
+# CUDA Debian repo carries 575+ which supports modern kernels and offers
+# `nvidia-open-kernel-dkms` (open-source kernel module) for Turing+ GPUs.
+# We add this repo first so the apt install below resolves to fresh
+# packages instead of the stale Kali ones.
+- name: Add NVIDIA CUDA apt repository (Kali ships 550.x which fails on kernel 6.19+)
+ ansible.builtin.shell: |
+ set -euxo pipefail
+ cd /tmp
+ curl -fsSLo cuda-keyring.deb \
+ https://developer.download.nvidia.com/compute/cuda/repos/debian13/x86_64/cuda-keyring_1.1-1_all.deb
+ apt-get install -y ./cuda-keyring.deb
+ apt-get update -q
+ rm -f cuda-keyring.deb
+ args:
+ creates: /usr/share/keyrings/cuda-archive-keyring.gpg
+ executable: /bin/bash
+ become: true
+ when:
+ - cracking_tools_install_nvidia_driver | bool
+ - ansible_facts['os_family'] == 'Debian'
+
+# Install kernel headers + dkms FIRST in their own apt transaction, so they
+# are fully configured before NVIDIA's dpkg postinst runs `dkms autoinstall`.
+# When mixed in a single apt-get call, dpkg may configure
+# `nvidia-kernel-open-dkms` before `linux-headers-cloud-amd64` finishes
+# setting up, and DKMS exits 2 because the headers aren't yet in place.
+- name: Install kernel headers and DKMS prerequisites
+ ansible.builtin.apt:
+ name:
+ - linux-headers-cloud-amd64
+ - dkms
+ - build-essential
+ - firmware-misc-nonfree
+ state: present
+ install_recommends: true
+ become: true
+ when:
+ - cracking_tools_install_nvidia_driver | bool
+ - ansible_facts['os_family'] == 'Debian'
+
+# Driven through shell+tee instead of ansible.builtin.apt: the apt module
+# captures dpkg stderr but truncates large stdout (DKMS kernel-module build
+# errors land deep in apt-get's output, well after the cutoff). With tee we
+# can show the real error on failure.
+- name: Install NVIDIA driver and OpenCL runtime (with full log)
+ ansible.builtin.shell:
+ cmd: |
+ set -o pipefail
+ DEBIAN_FRONTEND=noninteractive apt-get install -y \
+ -o Dpkg::Options::=--force-confdef \
+ -o Dpkg::Options::=--force-confold \
+ -o APT::Install-Recommends=yes \
+ {{ cracking_tools_nvidia_driver_packages | map('quote') | join(' ') }} \
+ 2>&1 | tee /tmp/ares-nvidia-install.log
+ executable: /bin/bash
+ become: true
+ register: cracking_tools_nvidia_install_result
+ changed_when: false
+ failed_when: false
+ when:
+ - cracking_tools_install_nvidia_driver | bool
+ - ansible_facts['os_family'] == 'Debian'
+
+- name: Show NVIDIA install log tail on failure
+ ansible.builtin.command: tail -200 /tmp/ares-nvidia-install.log
+ become: true
+ register: cracking_tools_nvidia_install_tail
+ changed_when: false
+ when:
+ - cracking_tools_install_nvidia_driver | bool
+ - cracking_tools_nvidia_install_result.rc | default(0) != 0
+
+- name: Print NVIDIA install tail
+ ansible.builtin.debug:
+ var: cracking_tools_nvidia_install_tail.stdout_lines
+ when:
+ - cracking_tools_install_nvidia_driver | bool
+ - cracking_tools_nvidia_install_result.rc | default(0) != 0
+
+- name: Dump DKMS make.log on failure
+ ansible.builtin.shell: |
+ set -o pipefail
+ set +e
+ for f in /var/lib/dkms/nvidia/*/build/make.log; do
+ echo "==== $f ===="
+ tail -150 "$f" 2>&1 || true
+ done
+ echo "==== build env ===="
+ which gcc cc make 2>&1 || true
+ gcc --version 2>&1 || true
+ dpkg -l build-essential gcc make 2>&1 | tail -10 || true
+ args:
+ executable: /bin/bash
+ register: cracking_tools_dkms_make_log
+ changed_when: false
+ failed_when: false
+ when:
+ - cracking_tools_install_nvidia_driver | bool
+ - cracking_tools_nvidia_install_result.rc | default(0) != 0
+
+- name: Print DKMS make.log
+ ansible.builtin.debug:
+ var: cracking_tools_dkms_make_log.stdout_lines
+ when:
+ - cracking_tools_install_nvidia_driver | bool
+ - cracking_tools_nvidia_install_result.rc | default(0) != 0
+
+- name: Fail if NVIDIA install failed
+ ansible.builtin.fail:
+ msg: "NVIDIA driver install failed (rc={{ cracking_tools_nvidia_install_result.rc }}); see tail above"
+ when:
+ - cracking_tools_install_nvidia_driver | bool
+ - cracking_tools_nvidia_install_result.rc | default(0) != 0
+
+- name: Install NVIDIA CUDA toolkit
+ ansible.builtin.apt:
+ name: "{{ cracking_tools_nvidia_cuda_toolkit_packages }}"
+ state: present
+ install_recommends: true
+ become: true
+ when:
+ - cracking_tools_install_cuda_toolkit | bool
+ - ansible_facts['os_family'] == 'Debian'
+
- name: Install GPU support packages
ansible.builtin.apt:
name: "{{ cracking_tools_opencl_packages }}"
@@ -51,6 +177,33 @@
- cracking_tools_gpu_support | bool
- cracking_tools_nvidia_opencl_icd | default(false) | bool
+# nvidia-smi/clinfo will return non-zero on a CPU-only AMI builder (no GPU
+# attached) — that's expected. The check is purely informational so a logged
+# failure on the first GPU boot is easy to spot.
+- name: Verify NVIDIA driver (non-fatal — no GPU on builder hosts)
+ ansible.builtin.command: nvidia-smi
+ register: cracking_tools_nvidia_smi
+ changed_when: false
+ failed_when: false
+ when: cracking_tools_install_nvidia_driver | bool
+
+- name: Verify OpenCL platform discovery (non-fatal)
+ ansible.builtin.command: clinfo -l
+ register: cracking_tools_clinfo
+ changed_when: false
+ failed_when: false
+ when:
+ - cracking_tools_gpu_support | bool
+ - cracking_tools_install_nvidia_driver | bool
+
+- name: Show GPU/OpenCL detection summary
+ ansible.builtin.debug:
+ msg:
+ - "nvidia-smi rc={{ cracking_tools_nvidia_smi.rc | default('skipped') }}"
+ - "clinfo rc={{ cracking_tools_clinfo.rc | default('skipped') }}"
+ - "{{ cracking_tools_clinfo.stdout | default('clinfo not run') }}"
+ when: cracking_tools_install_nvidia_driver | bool
+
- name: Ensure libgcc runtime is present for hashcat
when:
- cracking_tools_install_hashcat
diff --git a/ansible/roles/lateral_movement_tools/README.md b/ansible/roles/lateral_movement_tools/README.md
index 8d194ff0..690de5fd 100644
--- a/ansible/roles/lateral_movement_tools/README.md
+++ b/ansible/roles/lateral_movement_tools/README.md
@@ -118,7 +118,7 @@ Install and configure lateral movement and credential extraction tools for Ares
- **Create symlink for ffitarget.h in standard include path** (ansible.builtin.file) - Conditional
- **Install rubyzip gem for evil-winrm dependency** (community.general.gem) - Conditional
- **Install evil-winrm gem (Ubuntu only, Kali uses apt)** (community.general.gem) - Conditional
-- **Update vulnerable ruby gem dependencies (net-imap, resolv, rexml, uri, zlib)** (ansible.builtin.command) - Conditional
+- **Update vulnerable ruby gem dependencies (Ubuntu only - Kali patches via apt)** (ansible.builtin.command) - Conditional
- **Install pth-toolkit (Kali only - may not be available in all repos)** (ansible.builtin.apt) - Conditional
- **Warn if pth-toolkit installation failed** (ansible.builtin.debug) - Conditional
- **Install Impacket from source for lateral movement tools** (ansible.builtin.include_tasks) - Conditional
diff --git a/ansible/roles/lateral_movement_tools/tasks/linux.yml b/ansible/roles/lateral_movement_tools/tasks/linux.yml
index 5ca9c59f..3abc6318 100644
--- a/ansible/roles/lateral_movement_tools/tasks/linux.yml
+++ b/ansible/roles/lateral_movement_tools/tasks/linux.yml
@@ -229,12 +229,25 @@
- ansible_facts['distribution'] != 'Kali'
- lateral_movement_tools_install_evil_winrm
-- name: Update vulnerable ruby gem dependencies (net-imap, resolv, rexml, uri, zlib)
- ansible.builtin.command: gem update net-imap resolv rexml uri zlib
+# `gem update` is skipped on Kali: evil-winrm ships via apt and Kali tracks
+# CVE patches for net-imap/rexml/uri/zlib through its `ruby-*` debs. On
+# AMI builders, `gem update` here also tends to SIGKILL (rc=-9) inside the
+# Image Builder runner regardless of `--no-document`, so we keep it
+# best-effort with `failed_when: false` and limit it to non-Kali Debian.
+- name: Update vulnerable ruby gem dependencies (Ubuntu only - Kali patches via apt)
+ ansible.builtin.command: gem update --no-document {{ item }}
become: true
changed_when: true
+ failed_when: false
+ loop:
+ - net-imap
+ - resolv
+ - rexml
+ - uri
+ - zlib
when:
- ansible_facts['os_family'] == 'Debian'
+ - ansible_facts['distribution'] != 'Kali'
- lateral_movement_tools_install_evil_winrm
- name: Install pth-toolkit (Kali only - may not be available in all repos)