Blue Agent Documentation

Overview

The Ares Blue Agent is an autonomous SOC investigation system. It picks up Grafana alerts, queries Loki logs and Prometheus metrics for evidence, maps findings to MITRE ATT&CK, and writes investigation reports.

Key Capabilities:

Alert triage and multi-stage investigation (triage → causation → lateral → synthesis)
LogQL/PromQL query optimization with rate limiting and retry
Evidence extraction using the Pyramid of Pain framework
MITRE ATT&CK technique mapping and gap analysis
Lateral movement detection and scope expansion
Attack precursor identification (root cause analysis)
Historical investigation store for pattern matching and false-positive tracking
Red-Blue correlation to surface detection gaps
Markdown report generation with timeline, evidence inventory, and recommendations

Core Architecture

Main Components

Investigation Orchestrator

Location: ares-cli/src/orchestrator/blue/

The investigation orchestrator manages the full investigation lifecycle:

Coordinates LLM-powered investigation agents for Grafana alerts
Dispatches tasks to specialized sub-agents (triage, threat hunter, lateral analyst, escalation)
Chains follow-up investigations based on discovered evidence types
Enforces hard timeout watchdog (1 min/step + 2 min buffer)
Generates partial reports on timeout
Handles investigation state persistence via Redis (task queues run on NATS JetStream)

Blue Worker Task Loop

Location: ares-cli/src/worker/blue_task_loop.rs

Runs the worker-side investigation loop with:

Adaptive query limits based on alert severity and stage
Query optimization and duplicate detection
Rate limiting to prevent resource abuse
Automatic retry with exponential backoff
Resilience mechanisms for failed queries

Investigation State Model

Location: ares-core/src/models/blue.rs

The SharedBlueTeamState model tracks:

Investigation ID, alert context, current stage
Evidence inventory with pyramid level classification
Timeline of events with MITRE technique mappings
Investigative questions from question engines
Query execution log
Identified MITRE techniques and tactics
Queried hosts/users for scope tracking
Lateral movement graph
Attack synopsis and recommendations
Escalation status

Investigation Workflow

Investigation Stages

1. TRIAGE - "WHAT is happening?"

Initial alert analysis
First-level evidence gathering
IOC extraction (IPs, domains, hashes, processes)
Basic timeline construction
Query limit: 8 queries (12 for critical alerts)

2. CAUSATION - "WHY did it happen?"

Root cause analysis
Precursor attack identification
Attack chain reconstruction
Evidence validation and correlation
Query limit: 14 queries

3. LATERAL - "What is the SCOPE?"

Lateral movement detection
Impact assessment across hosts/users
Scope expansion to compromised assets
Connection graph construction
Query limit: 20 queries

4. SYNTHESIS - Report generation

Evidence consolidation
MITRE ATT&CK mapping
Pyramid of Pain assessment
Recommendations generation
Markdown report creation
Query limit: 20 queries

Investigation Stage Progression

Alert Detected
      ↓
  TRIAGE (query observability data)
      ↓
  CAUSATION (find root cause)
      ↓
  LATERAL (assess scope)
      ↓
  SYNTHESIS (generate report)
      ↓
Report Delivered

Toolsets

Investigation Tools

Location: ares-tools/src/ (blue feature)

Evidence Recording

record_evidence(
    evidence_type: EvidenceType,  // ip, domain, hash, process, file, user, etc.
    value: String,
    pyramid_level: i32,           // 1=Hash Values, 6=TTPs
    mitre_techniques: Vec<String>,
    confidence: f64,              // 0.0-1.0
    description: String,
    source_query: Option<String>
)

Evidence Types:

ip - IP addresses
domain - Domain names
hash - File hashes
process - Process names/paths
file - File paths
user - User accounts
service - Services/daemons
tool - Attack tools
malware - Malware families
technique - MITRE techniques
behavior - Attack behaviors

Pyramid of Pain Levels:

Hash Values (trivial to change)
IP Addresses
Domain Names
Network/Host Artifacts
Tools
TTPs (hard to change)

Timeline Management

add_timeline_event(
    timestamp: String,
    description: String,
    mitre_technique: Option<String>,
    evidence_ids: Vec<String>,
    severity: String  // info, low, medium, high, critical
)

Investigation Tracking

track_host_investigation(hostname: String)
track_user_investigation(username: String)

Completion Tools

complete_investigation(
    attack_synopsis: String,
    recommendations: Vec<String>,
    should_escalate: bool,
    escalation_reason: Option<String>
)

Finalizes investigation with:

Attack summary and recommendations
Automatic response guidance extraction from alert annotations
Fallback synopsis generation from collected evidence
Investigation report generation trigger

Grafana Integration Tools

get_firing_alerts() -> Vec<Alert>
get_alert_history(alert_name, lookback_hours) -> Vec<Alert>
post_investigation_started(investigation_id, alert_name)
post_investigation_completed(investigation_id, report_url)

Features:

MCP connection management (60s timeout with fallback)
Multi-endpoint support for different Grafana versions
Automatic annotation creation on Grafana dashboards

Observability Tools

LokiTools - LogQL Queries

query_loki(
    logql: String,
    start_time: String,
    end_time: String,
    limit: i32 = 100
) -> Vec<LogLine>

Features:

Query validation and optimization
Regex error detection (catches empty-compatible patterns like .*)
Label matchers, line filters, parsers support
Result streaming with configurable line limits
Automatic time range adjustment on timeout

PrometheusTools - PromQL Queries

query_prometheus_instant(query: String, time: String)
query_prometheus_range(query: String, start: String, end: String, step: String)
get_metric_metadata(metric: String)

Query Template Tools

Pre-built LogQL queries optimized for detecting red team attack patterns:

Windows Event ID detection templates
Pattern-based filters for common attack techniques
Performance optimization (prefer |= over |~)
Optimized selectors to prevent Loki timeouts

Example templates:

Lateral movement detection (RDP, SMB, WMI, PSExec)
Privilege escalation events
Credential dumping patterns
Suspicious process execution
Network reconnaissance

Question Engine Tools

get_combined_questions() -> Vec<InvestigativeQuestion>

Generates investigative questions from three engines:

MITRE Navigator Engine
- Maps evidence to MITRE techniques
- Predicts follow-on techniques in attack chains
- Identifies tactic gaps in coverage
Pyramid Climber Engine
- Pushes investigation from IOCs toward TTPs
- Encourages evidence at higher pyramid levels
- Guides analysts toward actionable intelligence
Detection Recipes Engine
- Windows Security Event patterns
- Structured investigation workflows
- Event ID correlation patterns

Learning Tools

find_similar_investigations(
    alert_name: String,
    mitre_techniques: Vec<String>,
    severity: String
) -> Vec<Investigation>

Features:

Historical investigation lookup
Query effectiveness statistics
False positive pattern learning
Investigation pattern matching

MITRE Lookup Tools

Technique name resolution
Tactic mapping (Reconnaissance, Initial Access, Execution, etc.)
Attack lifecycle coverage analysis
Technique relationship mapping

Detection & Response Features

Alert Correlation

Location: ares-core/src/correlation/

The AlertCluster class groups related alerts using similarity scoring:

Similarity Factors:

Common hosts (40% weight)
Common users (30% weight)
Common IPs (20% weight)
Shared MITRE techniques (10% weight)

Features:

Time-window clustering
Extracts hosts, users, IPs, techniques from alert labels/annotations
Identifies campaign patterns across multiple alerts

Lateral Movement Analysis

Location: ares-core/src/state/

The LateralGraph tracks host-to-host connections and attack spread:

Connection Types:

SMB (file shares)
RDP (remote desktop)
WMI (Windows Management Instrumentation)
PSExec (remote execution)
SSH (secure shell)
WinRM (Windows Remote Management)
DCOM (Distributed COM)

Features:

Investigated vs pending hosts tracking
Pivot suggestions for scope expansion
Evidence linkage to connections
MITRE technique associations

Red-Blue Correlation

Location: ares-core/src/correlation/

Correlates red team activities with blue team detections to identify gaps:

Components:

RedTeamActivity - Captures red team attack actions
BlueTeamDetection - Records blue team alert/investigation results
CorrelationMatch - Links activities to detections
DetectionGap - Identifies undetected red team activities
CorrelationReport - Full correlation analysis

Match Quality Levels:

STRONG - Direct correlation with high confidence
GOOD - Clear correlation with supporting evidence
WEAK - Possible correlation with limited evidence
TENUOUS - Low confidence correlation

Evidence Validation

Location: ares-core/src/

Automatic validation of recorded evidence:

IOC extraction from query results
Validation against recent query results
Confidence adjustment based on validation status
Suggested IOCs from query data
Source query tracking for provenance

Query Resilience

Location: ares-core/src/

Ensures reliable query execution:

Automatic retry with exponential backoff
Timeout handling with time range reduction
Query result caching
Connection pooling

Query Management

Adaptive Query Limits

Query limits scale based on alert severity and investigation stage:

Base Limits:

Normal alerts: 8 queries per investigation
Critical alerts: 12 queries per investigation

Stage-Based Limits:

Triage: 8 queries
Causation: 14 queries
Lateral: 20 queries
Synthesis: 20 queries

Bonus Queries:

+3 for finding evidence
+2 for reaching Pyramid level 4+ (Tools/TTPs)

Hard Limits:

Maximum 25 total queries
Maximum 2 runs of identical query (duplicate detection)
Free retries for queries returning 0 results

LogQL Optimization

Prevents Broad Selectors:

# BAD - Too broad, causes timeouts
{job=~".+"}
{deployment=~".+"}

# GOOD - Specific labels
{job="eventlog"}
{deployment="windows-hosts"}

Filter Recommendations:

# PREFER: Fast string contains
{job="eventlog"} |= "4624"

# AVOID: Slow regex when unnecessary
{job="eventlog"} |~ "4624"

Best Practices:

Use specific label selectors (job, deployment, namespace)
Apply line filters (|=) before regex patterns (|~)
Limit time ranges for large datasets
Use streaming aggregations when possible

Grafana Integration

MCP (Model Context Protocol) Integration

The blue agent uses MCP to connect to Grafana and access observability data:

Capabilities:

Grafana datasource discovery
Loki label name and value enumeration
Prometheus metric discovery
Alert rule management
Dashboard and panel access
Annotation creation and management
Multi-architecture image rendering

Setup: See Grafana MCP Setup for MCP server installation instructions.

Markdown Report Generation

Location: ares-core/src/reports/

Investigation reports include:

Executive Summary
- High-level findings
- Alert context and severity
- Key evidence summary
Timeline of Events
- Chronological attack progression
- Pyramid level indicators
- MITRE technique mappings
MITRE ATT&CK Mapping
- Identified techniques and tactics
- Tactical coverage analysis
- Attack lifecycle visualization
Pyramid of Pain Assessment
- IOC type distribution
- Progression toward TTPs
- Actionable intelligence rating
Evidence Inventory
- Complete evidence list with sources
- Confidence ratings
- Validation status
Scope Analysis
- Affected hosts and users
- Impacted services
- Lateral movement paths
Recommendations
- Immediate response actions
- Remediation steps
- Detection improvements
Appendix
- Raw query data
- Investigation metadata
- JSON export

Investigation Persistence

Completed investigations are stored for learning and reference:

Investigation store for historical lookup
Query effectiveness statistics
Pattern matching for similar cases
False positive tracking

Advanced Investigation Capabilities

Four Question Engines

The blue agent uses four mandatory question engines to guide investigations:

1. Precursor Attack Chain Engine

Identifies what came BEFORE the detected technique:

Analyzes MITRE attack phases
Identifies likely precursor techniques
Builds complete attack chains
Focuses on root cause analysis

2. MITRE Navigator Engine

Maps techniques and predicts progression:

Maps evidence to MITRE techniques
Predicts follow-on techniques
Identifies tactical gaps in coverage
Suggests techniques commonly seen together

3. Pyramid of Pain Climber Engine

Pushes investigation toward actionable intelligence:

Guides from IOCs (hashes, IPs) toward TTPs
Encourages evidence at higher pyramid levels
Focuses on attacker behaviors vs artifacts
Prioritizes hard-to-change indicators

4. Detection Recipes Engine

Provides structured investigation workflows:

Windows Event ID patterns
Event correlation sequences
Investigation checklists
Known attack patterns

Agent Instructions & Anti-Patterns

Critical Focus Areas:

Query efficiency: query → record evidence → complete (minimize query loops)
Use current time values (not stale alert timestamps)
Mandatory datasource discovery workflow
Label value enumeration to prevent timeouts
Immediate evidence recording after queries
Precursor investigation emphasis (root cause)
Lateral scope expansion for high/critical alerts

Anti-Patterns to Avoid:

Multiple queries without recording evidence
Broad regex patterns in label selectors
Long time ranges on high-cardinality data
Duplicate or redundant queries
Investigation without following question engines
Ignoring query result validation

Key Files Reference

Component	Path
Blue Orchestrator	`ares-cli/src/orchestrator/blue/`
Blue Worker Task Loop	`ares-cli/src/worker/blue_task_loop.rs`
Blue CLI Commands	`ares-cli/src/blue/`
Core Models	`ares-core/src/models/`
State Management	`ares-core/src/state/`
Correlation Engine	`ares-core/src/correlation/`
Report Generation	`ares-core/src/reports/`
Tool Dispatch	`ares-tools/src/`
Configuration	`config/ares.yaml`

Configuration

Investigation Configuration

Blue agent configuration in config/ files:

blue_team:
  investigation:
    max_queries: 25  # Hard query limit
    timeout_per_step: 60  # Seconds per investigation step
    timeout_buffer: 120  # Extra seconds before hard timeout
    query_cache_ttl: 300  # Query cache TTL in seconds

  observability:
    loki_timeout: 30  # Loki query timeout
    prometheus_timeout: 30  # Prometheus query timeout
    default_log_limit: 100  # Default log line limit

  reporting:
    format: markdown  # Report format
    include_raw_data: true  # Include appendix with raw data
    export_json: true  # Export JSON alongside markdown

Usage

Prerequisites

API keys in .env or 1Password: ANTHROPIC_API_KEY, OPENAI_API_KEY, GRAFANA_SERVICE_ACCOUNT_TOKEN, DREADNODE_API_KEY
Grafana MCP configured (see Grafana MCP Usage)
Redis accessible (K8s in-cluster, or port-forwarded for local/EC2)
ares binary built (cargo build --release)

Quick Start

# 1. Start a blue investigation from the latest red team operation
task blue:once LATEST=true

# 2. Monitor progress
task blue:multi:status LATEST=true

# 3. View results
task blue:multi:evidence LATEST=true
task blue:multi:techniques LATEST=true
task blue:reports:consolidate LATEST=true

Taskfile Commands

All blue team tasks are invoked via task blue:<command>. Most accept OPERATION_ID=op-xxx or LATEST=true to identify the target.

Starting Investigations

# Single investigation from a red team operation (local execution)
task blue:once OPERATION_ID=op-xxx
task blue:once LATEST=true

# Single investigation from a red team operation (K8s remote)
task blue:once:remote LATEST=true

# Submit a specific alert JSON file
task blue:investigate ALERT=alert.json

# Continuous poll mode (re-checks every POLL_INTERVAL seconds)
task blue:poll

# Multi-agent investigation via K8s orchestrator
task blue:multi ALERT=alert.json
task blue:multi ALERT=alert.json INVESTIGATION_ID=inv-xxx MULTI_AGENT=true

# Multi-agent from red team operation (K8s remote)
task blue:multi:remote LATEST=true
task blue:multi:remote OPERATION_ID=op-xxx

Monitoring Investigations

# Investigation status
task blue:multi:status LATEST=true
task blue:multi:status INVESTIGATION_ID=inv-xxx

# Aggregate status for all investigations in an operation
task blue:multi:operation-status LATEST=true
task blue:multi:operation-status LATEST=true WATCH=10  # auto-refresh

# List all investigations
task blue:multi:list

# Runtime info
task blue:multi:runtime LATEST=true

# Triage decision audit trail
task blue:multi:triage-status LATEST=true

# Follow logs
task blue:multi:logs                          # orchestrator only
task blue:multi:logs ALL=true                 # all blue pods
task blue:multi:logs ROLE=threat-hunter       # specific role

Viewing Results

# Evidence collected (Pyramid of Pain items)
task blue:multi:evidence LATEST=true
task blue:multi:evidence LATEST=true JSON=true  # machine-readable

# MITRE ATT&CK techniques identified
task blue:multi:techniques LATEST=true

Reports

# Generate consolidated report from Redis state
task blue:reports:consolidate LATEST=true
task blue:reports:consolidate OPERATION_ID=op-xxx OUTPUT_DIR=./reports

# Export detection playbook (runs on red orchestrator pod)
task blue:playbook LATEST=true
task blue:playbook OPERATION_ID=op-xxx JSON=true

# List / view local reports
task blue:reports:list
task blue:reports:latest

# Clean up reports
task blue:reports:clean

Cleanup

# Delete a single investigation
task blue:multi:delete INVESTIGATION_ID=inv-xxx

# Delete an operation and all its investigations
task blue:multi:delete-operation OPERATION_ID=op-xxx

# Clean up investigations older than N hours
task blue:multi:cleanup MAX_AGE_HOURS=24
task blue:multi:cleanup ALL=true DRY_RUN=true  # preview before deleting

Direct CLI Commands

For environments without Taskfile, or when you need more control, use ares directly. Add --k8s <NAMESPACE> for K8s or --ec2 <NAME> for EC2 transport.

# Submit from red team operation alerts
ares blue from-operation --latest
ares --k8s attack-simulation blue from-operation op-xxx

# Submit a single alert
ares blue submit '{"alert_title":"Suspicious LSASS","severity":"high"}'

# Continuous poll mode
ares blue watch --poll-interval 30 --max-steps 50

# Investigation status and results
ares blue list
ares blue status --latest
ares blue evidence --latest
ares blue evidence --latest --json
ares blue techniques --latest
ares blue runtime --latest
ares blue triage-status --latest
ares blue operation-status --latest --watch 10

# Report generation
ares blue report --latest --output-dir ./reports
ares blue report --operation-id op-xxx --regenerate

# Cleanup
ares blue delete inv-xxx --force
ares blue delete-operation op-xxx --force
ares blue cleanup --max-age-hours 24 --all --force
ares blue cleanup --dry-run

EC2 Deployment

When running on EC2 instead of K8s, port-forward Redis first:

# Start SSM port-forward (Redis on localhost:16379)
task ec2:redis:forward EC2_NAME=ares-tools

# In another terminal, run blue commands with the forwarded Redis
ARES_REDIS_URL=redis://localhost:16379 ares blue from-operation --latest

Running Blue Alongside Red

Set BLUE_ENABLED=1 to start blue team investigations automatically when a red team operation runs:

task red:ec2:multi TARGET=dreadgoad DOMAIN=contoso.local BLUE_ENABLED=1

Taskfile Variables

Variable	Default	Description
`MODEL`	config file	LLM model override
`POLL_INTERVAL`	`30`	Seconds between poll cycles
`MAX_STEPS_BLUE`	`50`	Max agent steps (watch/poll mode)
`MAX_STEPS_BLUE_ONCE`	`15`	Max agent steps (once/investigate mode)
`GRAFANA_URL`	(none - must be set)	Grafana instance
`K8S_NAMESPACE`	`attack-simulation`	K8s namespace for remote commands
`REPORT_DIR`	`./reports`	Report output directory
`LOG_DIR`	`./logs`	Log output directory

Summary

The Ares Blue Agent handles autonomous SOC investigation:

Picks up alerts from Grafana
Queries Loki and Prometheus with rate limiting and retry
Extracts evidence using the Pyramid of Pain framework
Maps to MITRE ATT&CK for tactical context and gap analysis
Identifies attack precursors to build complete attack chains
Detects lateral movement and expands investigation scope
Correlates related alerts to identify campaign patterns
Learns from past investigations
Generates reports with timelines, recommendations, and evidence
Posts annotations back to Grafana

The blue agent cuts investigation time by automating the triage-to-report pipeline. The Red-Blue correlation loop surfaces detection gaps that manual review tends to miss.

FilesExpand file tree

blue.md

Latest commit

History

blue.md

File metadata and controls

Blue Agent Documentation

Overview

Core Architecture

Main Components

Investigation Orchestrator

Blue Worker Task Loop

Investigation State Model

Investigation Workflow

Investigation Stages

1. TRIAGE - "WHAT is happening?"

2. CAUSATION - "WHY did it happen?"

3. LATERAL - "What is the SCOPE?"

4. SYNTHESIS - Report generation

Investigation Stage Progression

Toolsets

Investigation Tools

Evidence Recording

Timeline Management

Investigation Tracking

Completion Tools

Grafana Integration Tools

Observability Tools

LokiTools - LogQL Queries

PrometheusTools - PromQL Queries

Query Template Tools

Question Engine Tools

Learning Tools

MITRE Lookup Tools

Detection & Response Features

Alert Correlation

Lateral Movement Analysis

Red-Blue Correlation

Evidence Validation

Query Resilience

Query Management

Adaptive Query Limits

LogQL Optimization

Grafana Integration

MCP (Model Context Protocol) Integration

Markdown Report Generation

Investigation Persistence

Advanced Investigation Capabilities

Four Question Engines

1. Precursor Attack Chain Engine

2. MITRE Navigator Engine

3. Pyramid of Pain Climber Engine

4. Detection Recipes Engine

Agent Instructions & Anti-Patterns

Key Files Reference

Configuration

Investigation Configuration

Usage

Prerequisites

Quick Start

Taskfile Commands

Starting Investigations

Monitoring Investigations

Viewing Results

Reports

Cleanup

Direct CLI Commands

EC2 Deployment

Running Blue Alongside Red

Taskfile Variables

Summary