Cisco-Talos · DavidJBianco · May 15, 2026 · May 15, 2026
diff --git a/TODO.md b/TODO.md
@@ -2,7 +2,7 @@
 
 **Status:** Phase 8.5 (Dual src/dst HostContext) COMPLETE; Pre-MVP quality fixes ongoing
 **Started:** 2026-03-11
-**Last Updated:** 2026-05-14
+**Last Updated:** 2026-05-15
 
 See [CHANGELOG.md](CHANGELOG.md) for detailed development history of completed phases.
 
@@ -243,6 +243,8 @@ Replaced manual per-emitter field coordination with SecurityEvent intermediate r
 - [x] **P1** Source identity and endpoint baseline realism sprint — completed TLS/X.509 issuer-compatible chain signatures, Sysmon Event 7 native third-party module identity, config-driven Windows scheduled-process timing, and DHCP registry emission policy tied to lease activity. Verified with `uv run eforge validate-config`, focused regressions, Ruff, normal pytest, and slow-inclusive pytest.
 - [x] **P2** Endpoint/eCAR baseline variance follow-up — addressed through the host/activity profile realism layer. Host family, role, persona, and stable per-host multipliers now shape endpoint, process, registry, scheduled-task, syslog, bash, eCAR, Windows, Zeek, firewall, IDS, web, and proxy rates; config-driven encoded PowerShell variants and benign endpoint texture reduce repeated per-host artifacts. Verification passed with focused host-activity/config/ASA/baseline tests, `uv run eforge validate-config`, Ruff checks/format checks, full normal `uv run pytest -v`, and slow-inclusive `uv run pytest -v --include-slow --no-cov` (`3057 passed, 1 skipped`).
 - [x] **Later architectural sprint: imperfect observation and source coverage** — implemented a training-friendly `complete` default plus overlay-compatible named observation profiles that apply deterministic source-level drop/delay/coverage semantics without modeling contradictions. The policy covers endpoint, network, proxy/web, firewall, IDS, Windows, Sysmon, Zeek, syslog, bash history, and eCAR source families, while ground truth preserves canonical truth and records source evidence status. Verification passed: focused observation/config/ground-truth tests, `uv run eforge validate-config`, Ruff checks/format checks, full normal `uv run pytest -v` (`3036 passed, 15 skipped`), and slow-inclusive `uv run pytest -v --include-slow` (`3050 passed, 1 skipped`).
+- [x] Observation-aware automated eval and manifest — generation now writes `OBSERVATION_MANIFEST.json` beside ground truth, `eforge eval` loads it when present, coverage-style causality metrics report raw and observation-adjusted scores for expected non-visible evidence, and correctness/contradiction checks remain strict. Verification passed with config validation, Ruff checks/format checks, focused eval/manifest tests, and full normal `uv run pytest -v` (`3047 passed, 15 skipped`).
+- [x] Post-host-activity score check — synced `dev`, cleaned up stale TODOs, regenerated/evaluated `scenarios/iteration-test` from the current iteration-test prompt with `enterprise_standard` observation, and ran one blind expert-panel review without entering another fix loop. Automated eval passed at `92.39` over `108,858` records; blind synthetic-confidence averaged `82.75`. Highest-leverage follow-ups are Linux SSH/syslog lifecycle ordering, Zeek observation-tree consistency, X.509 metadata coherence, Windows OS-build/local-SID identity, and static web asset manifests.
 - [x] Full slow-suite regression cleanup after loop-65 merge — explicit-proxy storyline beacons now preserve authored hostname+destination IP pairs only when the storyline marks that pair as intentional, normal proxy-origin DNS resolution remains intact, and the parallel-generation LogonID assertion treats Type 7 unlock reuse as valid slice-of-time Windows behavior. Verified with targeted proxy/parallel tests, `uv run ruff check .`, `uv run ruff format --check .`, and `uv run pytest -v --include-slow` (`2875 passed, 23 skipped`).
   Detection Engineer blind review completed for the regenerated Loop 61 dataset at `scenarios/iteration-test/data`; reviewer verdict: Synthetic, 63/100 confidence. Main findings: one PROXY-01 sshd accepted-login lifecycle gap/self-source artifact and Windows 4648 explicit-credential caller PID/image provenance ambiguity around `WS-MCHEN-01`.
 
@@ -279,7 +281,7 @@ Verification is complete: dedicated `tests/unit/test_world_model.py` coverage wa
 
 - [x] **SUPERSEDED** Canonical emitter field provenance blind-review remaining findings from 78% synthetic review — superseded by later full-path storyline normalization, bash typo/path cleanup, proxy domain-class path/content profiles, and Sysmon follow-on ordering fixes. The still-current related work is now represented by web/session realism, imperfect observation/source coverage, and process lifecycle modeling TODOs.
 
-- [ ] Source-specific process lifecycle completeness modeling — deferred design item. Add a configurable telemetry coverage/profile layer that can model realistic Security/Sysmon/eCAR missingness, ingestion delay, audit-policy gaps, and endpoint coverage variance without ad hoc omissions in individual emitters. This should be part of the broader cross-source distribution realism layer, not a Windows-only workaround.
+- [x] **SUPERSEDED** Source-specific process lifecycle completeness modeling — the broad requirement is now covered by named observation profiles plus the host/activity profile layer. Observation profiles model deterministic source-family missingness/delay/coverage semantics for Security/Sysmon/eCAR and other sources, while host activity profiles add endpoint/source volume variance; the remaining narrower deployment-topology gap is tracked as configurable per-host/source log deployment coverage.
 
 - [x] Open PR consolidation into `dev` — re-applied the storyline typing-cadence monotonicity fix from PR #81, folded Dependabot pytest/Pygments updates into the dev workflow, and added Dependabot configuration so future dependency PRs target `dev`.
 
@@ -601,7 +603,7 @@ Data works but experienced analysts spot tells. Grouped by format for efficient
 - [x] **P2** Per-host-type event rate multiplier — implemented as implicit host/activity profile defaults rather than scenario YAML fields. Domain controllers, file servers, web servers, proxies, Linux servers, and workstations now receive role/family/persona-specific multipliers across baseline activity, auth, endpoint, network, and source-specific noise.
 - [x] Configurable per-entity artifact variation — implemented in the host/activity profile layer for baseline artifact texture, including stable per-host encoded PowerShell variants and profile-owned endpoint activity scaling.
 - [x] Configurable per-host volume variance — implemented via stable host/persona/role multipliers applied across major activity families so hosts no longer share narrow uniform volume bands by construction.
-- [ ] Configurable per-host/source log deployment coverage — observation profiles now support source-family gaps and host-scoped missingness multipliers, but explicit per-host source enablement/disablement remains future work. A later setting should model named host groups, disabled sensors, partial deployments, and collection windows when users need topology-level telemetry coverage differences rather than event-level missingness.
+- [ ] Configurable per-host/source log deployment coverage — observation profiles now support source-family gaps and host-scoped missingness multipliers, but explicit per-host source enablement/disablement remains future work. A later setting should model named host groups, disabled sensors, partial deployments, and collection windows when users need topology-level telemetry coverage differences rather than event-level missingness or host/activity volume variance.
 - [ ] **P2** Generation speed and efficiency follow-up — Sprint 4 host/activity realism is functionally verified, but the slow-inclusive suite exposed that `pytest-cov` plus `tracemalloc` can make the medium dataset memory test pathological. A future sprint should profile generation without instrumentation noise, identify hot paths introduced by richer host activity/web fanout/firewall texture, and decide whether to optimize generation, mark the memory test `--no-cov`, or relax/update stale performance assertions.
 - [x] DNS IP pool reuse causes cross-provider resolution (CloudFront→Microsoft IPs, etc.) — domain-first selection ensures consistent domain→IP mapping via FORWARD_DNS
 - [x] AWS region mismatch between DNS PTR and SSL SNI for same IP — AWS hostname/PTR generation now derives a stable per-IP region/edge identity and PTR generation respects known forward hostname context.

diff --git a/commands/eforge/config.md b/commands/eforge/config.md
@@ -71,7 +71,7 @@ When writing to the overlay, files are partial — they contain ONLY the user's
 | Modify baseline auth noise | `auth_noise.yaml` | (standalone — stale scheduled-credential accounts and irregular recurrence timing) |
 | Modify endpoint background noise | `endpoint_noise.yaml` | (standalone — scheduled-process timing and DHCP registry emission policy) |
 | Modify host activity distribution | `host_activity_profiles.yaml` | (standalone — host/persona/role rate-family multipliers, firewall deny bursts, and artifact variants) |
-| Modify source observation coverage | `observation_profiles.yaml` | Scenario `observation_profile` selects the named profile; keep `complete` as the default training profile |
+| Modify source observation coverage | `observation_profiles.yaml` | Scenario `observation_profile` selects the named profile; generated `OBSERVATION_MANIFEST.json` lets eval account for expected gaps; keep `complete` as the default training profile |
 | Modify causal/source timing | `timing_profiles.yaml` | (standalone — causal prerequisite, source latency, teardown, and Windows/Sysmon collision-spacing knobs) |
 | ~~Format definitions~~ | Not user-customizable | Engine internals — requires code changes |
 | ~~Evaluation rules~~ | Not user-customizable | Must match format definitions — requires code changes |

diff --git a/commands/eforge/evaluate.md b/commands/eforge/evaluate.md
@@ -36,6 +36,7 @@ scenarios/<scenario-name>/
   scenario.yaml
   ENVIRONMENT.md
   GROUND_TRUTH.md
+  OBSERVATION_MANIFEST.json  ← optional, generated for source-observation-aware eval
   data/              ← this is the output_dir for eforge eval
 ```
 
@@ -65,6 +66,12 @@ Present a clear summary of the evaluation results. The report shows two tiers fo
 - **Minimum** (hard gate): must pass or the dataset fails overall
 - **Aspirational** (informational): a stretch target; failure here is noted but does not fail the dataset
 
+If the scenario uses `observation_profile` other than `complete`, check whether the report says
+the observation manifest was loaded. With a manifest, coverage-style causality sub-scores may be
+adjusted for expected source gaps and will show a `raw` score when the adjusted score differs.
+Do not describe this as a lowered threshold: visible contradictions, parseability failures,
+source-native field mismatches, and evidence marked `visible` or `delayed` remain real failures.
+
 For each pillar, explain what the score means in practical terms:
 
 **Pillar 1: Parseability (weight 0.30)**
@@ -81,11 +88,11 @@ For each pillar, explain what the score means in practical terms:
 
 **Pillar 3: Causality (weight 0.25)**
 - Causal Ordering: Are logon→process→logoff sequences correctly ordered? DNS before TCP? Kerberos TGT/TGS before domain logons?
-- Storyline Event Presence: Are all storyline events visible in at least one log source?
+- Storyline Event Presence: Are all expected-visible storyline events visible in at least one log source? For non-`complete` observation profiles with a manifest, source rows marked `dropped`, `filtered`, or `out_of_window` are excluded from this coverage denominator.
 - Indicator Accuracy: Do traces carry the correct IPs, usernames, hostnames from the scenario?
-- Pivot Linkability: Can a hunter pivot between consecutive attack steps using shared field values?
-- Storyline Temporal Integrity: Are attack events in the right relative order at the right times?
-- Storyline Trace Coverage: For each expected log format on each involved host, does the storyline leave a trace?
+- Pivot Linkability: Can a hunter pivot between consecutive expected-visible attack steps using shared field values?
+- Storyline Temporal Integrity: Are expected-visible attack events in the right relative order at the right times?
+- Storyline Trace Coverage: For each expected-visible log format group on each involved host, does the storyline leave a trace?
 
 **Pillar 4: Timing (weight 0.20)**
 - Attack-Chain Timing: Do elapsed times between consecutive storyline steps fall within plausible bounds? Bounds come from `timing_bounds.yaml` — default 5s–2h, with per-action-type overrides (e.g., lateral movement: 30s–1h, exfiltration: 60s–24h). First matching keyword in the step activity wins.

diff --git a/commands/eforge/references/config-dependency-graph.md b/commands/eforge/references/config-dependency-graph.md
@@ -170,7 +170,7 @@ Each row is a file; columns show what it depends on and what depends on it.
 | Direction | File | Relationship |
 |-----------|------|-------------|
 | depends on | scenario `observation_profile` | The scenario selects a named profile; the profile file owns source-level missingness/delay values |
-| **depended on by** | Event dispatcher, GROUND_TRUTH.md | Applies deterministic source-observation drops/delays after canonical state updates and reports source evidence status |
+| **depended on by** | Event dispatcher, GROUND_TRUTH.md, OBSERVATION_MANIFEST.json, `eforge eval` | Applies deterministic source-observation drops/delays after canonical state updates, reports source evidence status, and lets eval distinguish expected gaps from missing visible evidence |
 | validated by | `eforge validate-config` and `eforge validate` | Config validation checks source-family names/ranges; scenario validation checks that the named profile exists |
 
 ### network_params.yaml

diff --git a/commands/eforge/references/config-evaluation.md b/commands/eforge/references/config-evaluation.md
@@ -21,6 +21,15 @@ Schema documentation for data quality evaluation rule files in `src/evidenceforg
 
 Controls the two-tier acceptance model for `eforge eval`. Each sub-score has a **minimum** (hard gate: dataset fails if below) and an **aspirational** target (informational stretch goal). Pillar weights must sum to 1.0.
 
+When a generated dataset includes `OBSERVATION_MANIFEST.json` beside `GROUND_TRUTH.md`,
+`eforge eval` automatically applies observation-aware coverage scoring. Non-`complete`
+profiles can adjust only coverage-style causality sub-scores (`event_presence`,
+`pivot_linkability`, `temporal_integrity`, and `storyline_trace_coverage`) by excluding
+evidence that the manifest marks `dropped`, `filtered`, or `out_of_window`. Source-native
+correctness gates such as parseability, value plausibility, field agreement, and visible causal
+ordering remain strict. Adjusted sub-scores expose `raw_score` in JSON and show `raw:<score>` in
+the text report.
+
 ### Structure
 
 ```yaml

diff --git a/commands/eforge/references/config-host-activity.md b/commands/eforge/references/config-host-activity.md
@@ -430,6 +430,11 @@ profiles:
 
 Profiles are intentionally source-level, not event-type matrices. Scenario authors select a named profile; code owns safe source-native application semantics so new event types inherit their source-family default. Non-complete profiles may make evidence `visible`, `delayed`, `dropped`, `filtered`, or `out_of_window`, but must not create contradictory identifiers or field values across sources.
 
+Generation writes `OBSERVATION_MANIFEST.json` beside `GROUND_TRUTH.md`. `eforge eval` uses this
+sidecar to adjust only coverage-style causality scoring for expected missing evidence under
+non-`complete` profiles. The raw score remains visible in the report, and source-native
+correctness checks are not relaxed.
+
 Valid source families are `windows_security`, `sysmon`, `ecar`, `syslog`, `bash_history`, `zeek`, `proxy`, `web`, `asa`, and `ids`. Run `eforge validate-config` after overlay changes; it rejects unknown source-family names, invalid probabilities, and inverted ranges. Run `eforge validate` on scenarios that use a non-default profile so unknown profile names are caught before generation.
 
 ---

diff --git a/commands/eforge/references/scenario-reference.md b/commands/eforge/references/scenario-reference.md
@@ -405,7 +405,8 @@ training-friendly perfect source coverage and correlation. Non-default profiles
 deterministic source-level missingness and source-native delays while preserving canonical truth:
 they can make evidence `visible`, `delayed`, `dropped`, `filtered`, or `out_of_window`, but they
 must not create contradictory users, PIDs, ports, hashes, UIDs, or session identifiers across
-sources. `GROUND_TRUTH.md` records source evidence status when a non-complete profile is used.
+sources. `GROUND_TRUTH.md` records source evidence status for instructors, and
+`OBSERVATION_MANIFEST.json` records the same source-observation contract for automated eval.
 
 ## Storyline
 

diff --git a/docs/design/data-quality-prd.md b/docs/design/data-quality-prd.md
@@ -339,6 +339,12 @@ Every sub-score now has:
 
 Thresholds are stored in `src/evidenceforge/config/evaluation/thresholds.yaml` for tuning without code changes. Calibration against purpose-built scenarios is deferred to a separate pass.
 
+Datasets generated with non-`complete` observation profiles include `OBSERVATION_MANIFEST.json`.
+When present, eval uses it to adjust coverage-style causality sub-scores for evidence that was
+intentionally `dropped`, `filtered`, or `out_of_window`. Hard correctness gates remain strict:
+observation profiles do not excuse parse failures, impossible values, source-native contradictions,
+or evidence marked `visible`/`delayed` but missing from logs.
+
 ### Calibration Plan
 
 Thresholds are currently judgment-based. After the restructure is stable, the plan is to design purpose-built calibration scenarios (known-good and known-bad), run `eforge eval` against them, and use the results to propose empirically grounded threshold values. Out of scope for v0.5.1.

diff --git a/docs/reference/CUSTOMIZING_CONFIG.md b/docs/reference/CUSTOMIZING_CONFIG.md
@@ -193,6 +193,14 @@ The `eforge eval` scoring rules are also YAML-based and can be tuned per-project
 
 All eval config files live in `src/evidenceforge/config/evaluation/`. They are **not** overlaid from `.eforge/config/` — edit them in-place if you want project-specific tuning, or copy the package files into your project and set the `EFORGE_EVAL_CONFIG_DIR` environment variable to point to your copies.
 
+Generated scenario directories may also include `OBSERVATION_MANIFEST.json` beside
+`GROUND_TRUTH.md`. `eforge eval` loads this sidecar automatically when present. For
+non-`complete` observation profiles, causality coverage metrics use the manifest to exclude
+source evidence that was intentionally `dropped`, `filtered`, or `out_of_window`, while still
+failing visible contradictions, parse errors, value mismatches, and missing evidence that the
+manifest marks `visible` or `delayed`. Text and JSON reports keep the adjusted score and expose
+the raw score for affected sub-scores.
+
 For full schema documentation for each file, see the skill reference: `/eforge:references:config-evaluation`.
 
 ## Reference Documentation

diff --git a/docs/reference/scenario-reference.md b/docs/reference/scenario-reference.md
@@ -405,7 +405,8 @@ training-friendly perfect source coverage and correlation. Non-default profiles
 deterministic source-level missingness and source-native delays while preserving canonical truth:
 they can make evidence `visible`, `delayed`, `dropped`, `filtered`, or `out_of_window`, but they
 must not create contradictory users, PIDs, ports, hashes, UIDs, or session identifiers across
-sources. `GROUND_TRUTH.md` records source evidence status when a non-complete profile is used.
+sources. `GROUND_TRUTH.md` records source evidence status for instructors, and
+`OBSERVATION_MANIFEST.json` records the same source-observation contract for automated eval.
 
 ## Storyline