Cisco-Talos · DavidJBianco · May 14, 2026 · May 14, 2026
diff --git a/TODO.md b/TODO.md
@@ -2,7 +2,7 @@
 
 **Status:** Phase 8.5 (Dual src/dst HostContext) COMPLETE; Pre-MVP quality fixes ongoing
 **Started:** 2026-03-11
-**Last Updated:** 2026-04-29
+**Last Updated:** 2026-05-14
 
 See [CHANGELOG.md](CHANGELOG.md) for detailed development history of completed phases.
 
@@ -241,7 +241,7 @@ Replaced manual per-emitter field coordination with SecurityEvent intermediate r
 - [x] **P1** Web application response/session realism follow-up — Added data-driven inbound `web_server` visitor profiles so human visitors consume `traffic_rates.web` as top-level actions, then fan out into required page assets/API calls through `site_maps.yaml`; crawler, health-check, API-client, and opportunistic-probe traffic now uses source-native configured request/status/User-Agent profiles. Static resource sizes are stable per host/path, human navigation and render fanout timing use `timing_profiles.yaml`, and docs/skill references now explain the budget and config ownership. Verification passed: focused web/timing/baseline tests (`107 passed, 1 skipped`), config-related tests (`64 passed`), `uv run eforge validate-config`, repo-wide Ruff checks/format checks, full normal `uv run pytest -q` (`3012 passed, 15 skipped`), and `git diff --check`.
 - [x] **P1** Well-synced network sensor timing follow-up — Replaced hardcoded multi-sensor Zeek +/-400ms skew plus broad path delay with a validated `network_sensor_observation` timing profile. The default `well_synced` profile keeps stable per-sensor clock skew within +/-1.5ms and per-flow capture/path delay within 50-2000us while preserving canonical packet/byte truth unless source-native observation variance is explicitly enabled. Verification passed with focused Zeek/timing tests, `uv run eforge validate-config`, repo-wide Ruff checks/format checks, full normal `uv run pytest -q` (`3012 passed, 15 skipped`), and `git diff --check`.
 - [x] **P1** Source identity and endpoint baseline realism sprint — completed TLS/X.509 issuer-compatible chain signatures, Sysmon Event 7 native third-party module identity, config-driven Windows scheduled-process timing, and DHCP registry emission policy tied to lease activity. Verified with `uv run eforge validate-config`, focused regressions, Ruff, normal pytest, and slow-inclusive pytest.
-- [ ] **P2** Endpoint/eCAR baseline variance follow-up — Loop 96 found workstation eCAR category volumes and Linux process lifecycle evidence too uniform and complete. The realistic endpoint observation-gap portion is now handled by named observation profiles; remaining work should focus on host/persona-specific volume variance, long-lived process state, and benign unmatched endpoint artifacts.
+- [x] **P2** Endpoint/eCAR baseline variance follow-up — addressed through the host/activity profile realism layer. Host family, role, persona, and stable per-host multipliers now shape endpoint, process, registry, scheduled-task, syslog, bash, eCAR, Windows, Zeek, firewall, IDS, web, and proxy rates; config-driven encoded PowerShell variants and benign endpoint texture reduce repeated per-host artifacts. Verification passed with focused host-activity/config/ASA/baseline tests, `uv run eforge validate-config`, Ruff checks/format checks, full normal `uv run pytest -v`, and slow-inclusive `uv run pytest -v --include-slow --no-cov` (`3057 passed, 1 skipped`).
 - [x] **Later architectural sprint: imperfect observation and source coverage** — implemented a training-friendly `complete` default plus overlay-compatible named observation profiles that apply deterministic source-level drop/delay/coverage semantics without modeling contradictions. The policy covers endpoint, network, proxy/web, firewall, IDS, Windows, Sysmon, Zeek, syslog, bash history, and eCAR source families, while ground truth preserves canonical truth and records source evidence status. Verification passed: focused observation/config/ground-truth tests, `uv run eforge validate-config`, Ruff checks/format checks, full normal `uv run pytest -v` (`3036 passed, 15 skipped`), and slow-inclusive `uv run pytest -v --include-slow` (`3050 passed, 1 skipped`).
 - [x] Full slow-suite regression cleanup after loop-65 merge — explicit-proxy storyline beacons now preserve authored hostname+destination IP pairs only when the storyline marks that pair as intentional, normal proxy-origin DNS resolution remains intact, and the parallel-generation LogonID assertion treats Type 7 unlock reuse as valid slice-of-time Windows behavior. Verified with targeted proxy/parallel tests, `uv run ruff check .`, `uv run ruff format --check .`, and `uv run pytest -v --include-slow` (`2875 passed, 23 skipped`).
   Detection Engineer blind review completed for the regenerated Loop 61 dataset at `scenarios/iteration-test/data`; reviewer verdict: Synthetic, 63/100 confidence. Main findings: one PROXY-01 sshd accepted-login lifecycle gap/self-source artifact and Windows 4648 explicit-credential caller PID/image provenance ambiguity around `WS-MCHEN-01`.
@@ -437,7 +437,7 @@ Data works but experienced analysts spot tells. Grouped by format for efficient
 - [x] Event 10 source/target pairs too narrow — fixed by widening `process_access_patterns.yaml` and seeded long-lived process actors. Verification audit output: 950 Event 10 records used 16 source/target pairs.
 - [x] Registry writer processes too narrow — fixed with key-family-aware writer selection. Verification audit output: Event 12/13 records used 12 writer process images and 1,968 unique TargetObject paths with 0 template artifacts.
 - [x] Event 7 residual attribution issues — tightened generic module/process matching and retained process-aware DLL materialization. Verification audit output: 380 Event 7 records used 42 unique ImageLoaded paths.
-- [ ] Cross-source distribution realism layer — defer until data-source reviews are complete. Independent Sysmon reviews found that field-level realism improved, but per-host event volumes and recipe selection remain too uniform. Design a deterministic host/activity profile layer derived from scenario facts (host type, roles, assigned_user, persona, services, stable seed) and use it to shape Sysmon, Windows Security, Zeek, syslog, firewall, web, proxy, and eCAR/EDR rates. Avoid implementing Sysmon-only profile logic unless needed as a narrow bug fix.
+- [x] Cross-source distribution realism layer — implemented a deterministic, overlay-capable host/activity profile layer derived from host family, roles, persona/risk, services, and stable per-host variance. Baseline generation now uses these profiles to scale Windows Security/Sysmon/eCAR, Zeek/network/web/proxy, Linux syslog/bash, firewall/ASA, IDS, auth, endpoint registry, scheduled process, and service-noise volumes without requiring scenario YAML changes.
 
 **Zeek:**
 - [x] Zeek DNS / network support log review — fixed DNS/TLS PTR coherence, added realistic TXT lookup variety, prevented CDN-hostname MX artifacts, increased file-server SMB target coverage, and made SSH pivot UIDs respect sensor visibility. Tests, docs, skills, and skill references updated where needed.
@@ -583,8 +583,8 @@ Data works but experienced analysts spot tells. Grouped by format for efficient
 - [x] Security: bound threat-detection deny timestamp tracking window to prevent unbounded memory/CPU growth
 - [x] ASA imperfect-observation realism — addressed by the general observation profile layer. `complete` preserves paired training-friendly firewall evidence, while non-default profiles can apply deterministic ASA source-family gaps that create realistic missing/partial firewall evidence without rewriting canonical truth.
 - [ ] ASA message type diversity limited to 106023/302013-16/305011-12 — missing 111008, 113004, 733100, 106001, 725001, 304001
-- [ ] ASA deny baseline burstiness/profile variance — defer to a general per-source activity profile rather than a one-off ASA fix. Current deny events are uniformly spaced (3-7s); real scans should have configurable burst/quiet periods, campaign-level cadence, and source-specific variance.
-- [ ] ASA deny metadata diversity — defer to a general field-distribution realism layer. Current deny events use `[0x0, 0x0]` hash values uniformly; a later profile should model when hashes remain zero vs vary by platform/message/context.
+- [x] ASA deny baseline burstiness/profile variance — fixed through host activity profiles and firewall-deny burst scheduling. Baseline denies now use deterministic burst/quiet periods and host/profile variance instead of uniform 3-7 second spacing.
+- [x] ASA deny metadata diversity — fixed by carrying deny hash metadata on canonical firewall context and rendering stable varied ASA hash values where appropriate instead of hardcoded `[0x0, 0x0]`.
 - [ ] Recognizable 45.33.32.x public IPs remain in built-in scan/attacker pools — the original `45.33.32.1` NAT PAT finding is stale, but code still uses `45.33.32.156` in scan/attacker pools. Move these values into data/config or replace them with less recognizable public-looking lab addresses during the broader public-IP/profile cleanup.
 
 **eCAR:**
@@ -598,10 +598,11 @@ Data works but experienced analysts spot tells. Grouped by format for efficient
 **Cross-Source / General:**
 - [x] Configurable cross-source evidence disagreement — implemented as named observation profiles with `complete` as the default. Non-default profiles can introduce deterministic dropped/delayed/filtered/out-of-window evidence across Zeek, web, proxy, firewall, IDS, Windows, Sysmon, syslog, bash history, and eCAR without contradictions or ambiguous rewrites; ground truth retains source evidence status for traceability.
 - [x] Cross-sensor timestamp precision identical to 15+ decimal places — microsecond jitter added in snort.py, windows.py, and storyline.py
-- [ ] **P2** Per-host-type event rate multiplier — Domain controllers generate ~50 events/hr but real DCs running AD/DNS/DFS/GPO produce thousands/hr. `system.type` is used for routing but never for volume scaling. Need `event_rate_multiplier` on System model (or implicit per-type defaults) applied in `_calculate_events_for_hour()` and `_generate_system_traffic()`. DCs should be 3-5x workstation baseline; file servers and web servers similarly elevated.
-- [ ] Configurable per-entity artifact variation — deferred to the general host/activity profile layer. Encoded PowerShell baseline noise is currently identical across hosts (same Get-Service blob); later profiles should derive stable per-host command variants, encoded payloads, tool versions, and operator habits.
-- [ ] Configurable per-host volume variance — deferred to the general host/activity profile layer. Workstation connection counts are suspiciously uniform (808-1068 range); later profiles should widen variance by role, persona, weekday, installed apps, and stable host-specific multipliers.
+- [x] **P2** Per-host-type event rate multiplier — implemented as implicit host/activity profile defaults rather than scenario YAML fields. Domain controllers, file servers, web servers, proxies, Linux servers, and workstations now receive role/family/persona-specific multipliers across baseline activity, auth, endpoint, network, and source-specific noise.
+- [x] Configurable per-entity artifact variation — implemented in the host/activity profile layer for baseline artifact texture, including stable per-host encoded PowerShell variants and profile-owned endpoint activity scaling.
+- [x] Configurable per-host volume variance — implemented via stable host/persona/role multipliers applied across major activity families so hosts no longer share narrow uniform volume bands by construction.
 - [ ] Configurable per-host/source log deployment coverage — observation profiles now support source-family gaps and host-scoped missingness multipliers, but explicit per-host source enablement/disablement remains future work. A later setting should model named host groups, disabled sensors, partial deployments, and collection windows when users need topology-level telemetry coverage differences rather than event-level missingness.
+- [ ] **P2** Generation speed and efficiency follow-up — Sprint 4 host/activity realism is functionally verified, but the slow-inclusive suite exposed that `pytest-cov` plus `tracemalloc` can make the medium dataset memory test pathological. A future sprint should profile generation without instrumentation noise, identify hot paths introduced by richer host activity/web fanout/firewall texture, and decide whether to optimize generation, mark the memory test `--no-cov`, or relax/update stale performance assertions.
 - [x] DNS IP pool reuse causes cross-provider resolution (CloudFront→Microsoft IPs, etc.) — domain-first selection ensures consistent domain→IP mapping via FORWARD_DNS
 - [x] AWS region mismatch between DNS PTR and SSL SNI for same IP — AWS hostname/PTR generation now derives a stable per-IP region/edge identity and PTR generation respects known forward hostname context.
 - [x] TLS volume clustering design — added data-driven TLS destination profiles with overlay support and `eforge validate-config` schema/tag checks. Auto-generated external TLS now uses weighted enterprise, certificate-infra, package-update, developer-tool, and long-tail browsing profiles with stable per-host preferences. Smoke output had 28,544 TLS SNI rows, 116 distinct names, top SNI share 5.5%, and top-5 share 18.0%.

diff --git a/commands/eforge/config.md b/commands/eforge/config.md
@@ -70,6 +70,7 @@ When writing to the overlay, files are partial — they contain ONLY the user's
 | Modify Windows auth realism | `windows_auth_realism.yaml` | (standalone — Security log auth timing and failed-logon profile knobs) |
 | Modify baseline auth noise | `auth_noise.yaml` | (standalone — stale scheduled-credential accounts and irregular recurrence timing) |
 | Modify endpoint background noise | `endpoint_noise.yaml` | (standalone — scheduled-process timing and DHCP registry emission policy) |
+| Modify host activity distribution | `host_activity_profiles.yaml` | (standalone — host/persona/role rate-family multipliers, firewall deny bursts, and artifact variants) |
 | Modify source observation coverage | `observation_profiles.yaml` | Scenario `observation_profile` selects the named profile; keep `complete` as the default training profile |
 | Modify causal/source timing | `timing_profiles.yaml` | (standalone — causal prerequisite, source latency, teardown, and Windows/Sysmon collision-spacing knobs) |
 | ~~Format definitions~~ | Not user-customizable | Engine internals — requires code changes |

diff --git a/commands/eforge/references/config-dependency-graph.md b/commands/eforge/references/config-dependency-graph.md
@@ -49,6 +49,14 @@ Each row is a file; columns show what it depends on and what depends on it.
 | depends on | nothing | Standalone rate table |
 | **depended on by** | Engine (runtime) | Drives all baseline traffic rate calculations (user activity, web top-level actions, DNS, SMB, Kerberos, LDAP, persona connections) |
 
+### host_activity_profiles.yaml
+| Direction | File | Relationship |
+|-----------|------|-------------|
+| depends on | scenario host metadata | Uses system type, roles, assigned users, primary systems, and user personas to resolve coarse activity multipliers |
+| depends on | `traffic_rates.yaml` | Multiplies resolved baseline rates after global intensity and scenario `baseline_activity.traffic_rates` overrides are applied |
+| **depended on by** | Engine (runtime) | Shapes host/persona/role baseline volume, endpoint noise, Linux/syslog shell activity, firewall deny bursts, IDS/ICMP rates, and encoded PowerShell artifact variation |
+| validated by | `eforge validate-config` | Enforces known rate-family names, ordered positive bounds, core host types, firewall deny burst settings, and artifact variant pools |
+
 ### web_session_profiles.yaml
 | Direction | File | Relationship |
 |-----------|------|-------------|

diff --git a/commands/eforge/references/config-host-activity.md b/commands/eforge/references/config-host-activity.md
@@ -15,9 +15,10 @@ Schema documentation for host-level activity config files. User customizations g
 5. [windows_auth_realism.yaml](#windows_auth_realismyaml)
 6. [auth_noise.yaml](#auth-noise-auth_noiseyaml)
 7. [endpoint_noise.yaml](#endpoint-noise-endpoint_noiseyaml)
-8. [observation_profiles.yaml](#observation-profiles-observation_profilesyaml)
-9. [timing_profiles.yaml](#timing_profilesyaml)
-10. [Domain Controller Baseline Activity](#domain-controller-baseline-activity)
+8. [host_activity_profiles.yaml](#host-activity-profiles-host_activity_profilesyaml)
+9. [observation_profiles.yaml](#observation-profiles-observation_profilesyaml)
+10. [timing_profiles.yaml](#timing_profilesyaml)
+11. [Domain Controller Baseline Activity](#domain-controller-baseline-activity)
 
 ---
 
@@ -350,6 +351,55 @@ registry_noise:
 
 ---
 
+## Host Activity Profiles (`host_activity_profiles.yaml`)
+
+Controls coarse host/persona/role volume multipliers for baseline realism. This layer is intentionally rate-family based rather than event-type based: it keeps scenario authors from managing per-emitter matrices while still making domain controllers, servers, workstations, sysadmins, developers, and exposed roles produce distinct volumes.
+
+```yaml
+rate_families:
+  default_bounds: [0.25, 6.0]
+  bounds:
+    windows_machine_auth: [0.5, 8.0]
+    firewall_deny: [0.4, 5.0]
+
+host_types:
+  workstation:
+    base_multiplier: 1.0
+    variance: [0.75, 1.35]
+    families:
+      inbound_network: 0.65
+  server:
+    base_multiplier: 1.8
+    variance: [0.85, 1.45]
+    families:
+      windows_service_process: 1.15
+  domain_controller:
+    base_multiplier: 4.0
+    variance: [0.9, 1.3]
+    families:
+      dc_kerberos: 1.5
+
+role_profiles:
+  web_server:
+    families:
+      inbound_network: 2.0
+      firewall_deny: 1.35
+
+persona_profiles:
+  sysadmin:
+    families:
+      linux_remote_admin: 1.45
+      windows_remote_admin: 1.35
+```
+
+Resolved multipliers apply after global intensity defaults and scenario `baseline_activity.traffic_rates` overrides. Use `traffic_rates.yaml` for global low/medium/high defaults; use `host_activity_profiles.yaml` when the rate should differ by host type, role, persona, or deterministic per-host variance.
+
+Valid rate families are: `user_activity`, `web`, `dns_interval`, `ntp`, `smb_interval`, `kerberos`, `ldap`, `persona_connections`, `role_network`, `inbound_network`, `windows_service_process`, `windows_registry`, `windows_scheduled_task`, `windows_remote_thread`, `windows_process_access`, `windows_module_load`, `windows_remote_admin`, `windows_service_logon`, `windows_machine_auth`, `dc_kerberos`, `linux_syslog`, `linux_remote_admin`, `linux_shell`, `firewall_deny`, `ids_alert`, and `icmp_monitoring`.
+
+`artifact_variants.powershell_encoded` provides data-driven benign encoded PowerShell payload templates and parameter pools. `firewall_deny` controls ASA deny burst windows, quiet periods, and mostly-zero metadata hash frequency. Run `eforge validate-config` after overlay changes; it rejects unknown rate-family names, missing core host types, inverted ranges, invalid probabilities, and empty artifact pools.
+
+---
+
 ## Observation Profiles (`observation_profiles.yaml`)
 
 Defines named source-observation profiles selected by scenario `observation_profile`. Keep `complete` as the default for training-friendly perfect source coverage and correlation. Use non-default profiles only when a scenario intentionally needs realistic source gaps or ingestion delays.