Cisco-Talos · DavidJBianco · May 15, 2026 · May 15, 2026
diff --git a/README.md b/README.md
@@ -97,7 +97,7 @@ For details on the overlay system, manual editing, and cross-file dependencies,
 
 EvidenceForge creates multi-format security log datasets from YAML scenario definitions. You describe an environment (users, systems, network topology) and a storyline (attack events), and EvidenceForge generates temporally consistent logs across all formats simultaneously — complete with cross-referenced LogonIDs, PIDs, timestamps, and UIDs.
 
-Every attack scenario includes a `GROUND_TRUTH.md` file documenting exactly what happened, when, and where — making the datasets immediately usable for threat hunting training.
+Every generated scenario includes a `GROUND_TRUTH.md` file. Attack scenarios document exactly what happened, when, and where, while baseline-only scenarios explicitly document that no malicious events were generated.
 
 ### Key Capabilities
 
@@ -106,7 +106,7 @@ Every attack scenario includes a `GROUND_TRUTH.md` file documenting exactly what
 - **Realistic baseline noise** — 26 lateral movement patterns, process→network correlation, network-level red herrings, and 18 Linux syslog categories create noise that analysts must work through
 - **OS-aware generation** — Windows systems produce Windows Event + Sysmon logs; Linux systems produce syslog + bash history
 - **Network visibility modeling** — Define sensor placement (SPAN/TAP), direction, and monitored segments
-- **Ground truth documentation** — Every attack scenario generates a GROUND_TRUTH.md with narrative, timeline, and IOCs
+- **Ground truth documentation** — Every run generates a GROUND_TRUTH.md; attack scenarios include narrative, timeline, and IOCs
 - **Parallel generation** — Threaded emitters write all formats simultaneously with temporal consistency
 - **Scenario validation** — Cross-reference checking, uniqueness constraints, and network topology validation
 - **Data quality evaluation** — 5-dimension scoring framework (23 sub-scores) with acceptance criteria

diff --git a/TODO.md b/TODO.md
@@ -334,7 +334,7 @@ Verification is complete: dedicated `tests/unit/test_world_model.py` coverage wa
 - [x] Security: cap firewall deny baseline amplification (`deny_ratio`/hourly deny volume) to prevent scenario-driven local DoS — `NetworkSensor.deny_ratio` now enforces `<= 50.0`.
 - [x] Security: prevent IPv6 scenario DoS in DNS AAAA fallback (`_ipv4_to_fake_ipv6` no longer evaluates for IPv6 destination IPs; AAAA uses mapped IPv6 or preserves IPv6 literal).
 - [x] Security: bounded/pruned ActivityGenerator DNS cache (60s prune cadence, 600s TTL-horizon eviction, 50k hard cap) to prevent unbounded memory growth from unique `(src_ip, hostname)` keys.
-- [ ] `eforge generate --force` overwrite can fail for scenarios that do not emit `GROUND_TRUTH.md` — explicit-proxy smoke testing exposed that replacing an existing output directory expects staged ground truth even when fresh no-storyline generation produced only `data/`. Decide whether no-storyline generation should always write an empty `GROUND_TRUTH.md` or overwrite swap should tolerate its absence.
+- [x] `eforge generate --force` overwrite can fail for scenarios that do not emit `GROUND_TRUTH.md` — fixed the root contract so every successful generation emits a matched `data/`, `GROUND_TRUTH.md`, and `OBSERVATION_MANIFEST.json` sidecar set, including baseline-only scenarios. The CLI swap stays strict and now requires staged data, ground truth, and observation manifest before replacing old output. Verification passed with focused engine/CLI/ground-truth/manifest tests, `eforge validate-config`, Ruff checks, and full normal `uv run pytest -v` (`3051 passed, 15 skipped`).
 
 - [x] **`uv.lock` not committed** — gitignored, so CI `setup-uv@v4` cache fails. Remove from `.gitignore` and commit.
 - [x] **`eforge validate` can't find personas in dev mode** — works when installed (`eforge validate`) but not via `uv run eforge validate`. Blocks dev workflow.

diff --git a/commands/eforge/generate.md b/commands/eforge/generate.md
@@ -93,7 +93,8 @@ Generation writes log files to a `data/` subdirectory alongside the scenario fil
 scenarios/<scenario-name>/
   scenario.yaml          ← input
   ENVIRONMENT.md         ← created by /eforge scenario
-  GROUND_TRUTH.md        ← generated (answer key)
+  GROUND_TRUTH.md        ← generated answer key (empty for benign baseline-only runs)
+  OBSERVATION_MANIFEST.json ← generated source-observation sidecar
   data/                  ← generated log files
     windows/
       security.xml
@@ -104,14 +105,14 @@ scenarios/<scenario-name>/
     ...
 ```
 
-If `data/`, `GROUND_TRUTH.md`, or `ENVIRONMENT.md` already exist, the CLI prompts before overwriting. Use `--force` to skip the prompt (for automation / AI use).
+If generated output (`data/`, `GROUND_TRUTH.md`, or `OBSERVATION_MANIFEST.json`) already exists, the CLI prompts before overwriting. Use `--force` to skip the prompt (for automation / AI use). `ENVIRONMENT.md` is scenario-authored and is preserved.
 
 ### 3. Post-Generation
 
 After successful generation:
 - List the generated files and their sizes
 - Check that expected formats were produced
-- If the scenario had a storyline, note that `GROUND_TRUTH.md` was generated alongside the scenario file — this is the answer key containing the full attack timeline and IOCs
+- Note that `GROUND_TRUTH.md` and `OBSERVATION_MANIFEST.json` were generated alongside the scenario file. For baseline-only runs, `GROUND_TRUTH.md` explicitly says no malicious events were generated.
 - `ENVIRONMENT.md` (created by `/eforge scenario`) is already in the same directory — no copying needed
 - Note that the causal expansion engine auto-generates prerequisite events (DNS lookups before connections, Kerberos TGT/TGS before logons, audit events from command patterns, etc.) — these appear in the logs but are not explicitly listed in the scenario YAML
 - Summarize the output for the user

diff --git a/commands/eforge/references/evidence-formats.md b/commands/eforge/references/evidence-formats.md
@@ -10,7 +10,8 @@ This document lists every evidence type EvidenceForge can generate, where to fin
 
 ```
 output/
-  GROUND_TRUTH.md                          # Attack narrative, timeline, IOCs
+  GROUND_TRUTH.md                          # Ground truth sidecar; empty for baseline-only runs
+  OBSERVATION_MANIFEST.json                # Source-observation sidecar for eval
   ENVIRONMENT.md                           # Student-facing environment description (created by /eforge scenario skill)
   <hostname.domain>/                       # Per-host directories (FQDN)
     windows_event_security.xml             # Windows Security channel events

diff --git a/docs/design/PRD.md b/docs/design/PRD.md
@@ -36,7 +36,7 @@ The tool addresses the need for realistic, large-volume training datasets withou
 - Schema validation for scenario files (Pydantic-based)
 - Cross-reference validation (users, systems, personas, groups referenced correctly)
 - Evaluation framework with concrete metrics (format compliance, consistency, statistical properties)
-- Ground truth documentation (GROUND_TRUTH.md) for scenarios with malicious activity
+- Ground truth documentation (GROUND_TRUTH.md) for every generated scenario
 - Network topology and sensor placement modeling for traffic visibility
 - Persona-based temporal activity distribution with configurable work hours, intensity, and risk profiles
 - Comprehensive test coverage (95%+) with pytest
@@ -154,7 +154,7 @@ eforge generate SCENARIO_FILE [--output DIR] [--verbose] [--debug]
 9. Write to organized directory structure with incremental flushing (10K event buffer)
 10. Show progress with Rich progress bars (per-hour baseline, per-event storyline)
 11. Log details to `generation.log` in output directory
-12. Generate GROUND_TRUTH.md when malicious/suspicious activities are present
+12. Generate GROUND_TRUTH.md and OBSERVATION_MANIFEST.json sidecars
 
 #### Workflow 6: Evaluate Output
 ```bash
@@ -430,7 +430,8 @@ Generated logs are written to a timestamped output directory:
 output/
   scenario-name-YYYYMMDD-HHMMSS/
     generation.log              # Detailed generation log
-    GROUND_TRUTH.md            # Attack ground truth (if malicious activity present)
+    GROUND_TRUTH.md            # Ground truth sidecar (empty for baseline-only scenarios)
+    OBSERVATION_MANIFEST.json  # Source-observation sidecar
     windows_events.xml         # Windows Event Logs
     zeek_conn.log              # Zeek connection logs
     ecar.json                  # ECAR events
@@ -442,7 +443,7 @@ output/
 
 **GROUND_TRUTH.md Format**
 
-When a scenario includes malicious or suspicious activities (not baseline-only scenarios), the generator creates a GROUND_TRUTH.md file documenting the attack for training and evaluation purposes.
+Every successful generation creates a GROUND_TRUTH.md file. Attack/red-herring scenarios document the narrative, timeline, and IOCs for training and evaluation; baseline-only scenarios explicitly state that no malicious events were generated.
 
 ```markdown
 # Ground Truth: [Scenario Name]

diff --git a/docs/reference/EVIDENCE_FORMATS.md b/docs/reference/EVIDENCE_FORMATS.md
@@ -10,7 +10,8 @@ This document lists every evidence type EvidenceForge can generate, where to fin
 
 ```
 output/
-  GROUND_TRUTH.md                          # Attack narrative, timeline, IOCs
+  GROUND_TRUTH.md                          # Ground truth sidecar; empty for baseline-only runs
+  OBSERVATION_MANIFEST.json                # Source-observation sidecar for eval
   ENVIRONMENT.md                           # Student-facing environment description (created by /eforge scenario skill)
   <hostname.domain>/                       # Per-host directories (FQDN)
     windows_event_security.xml             # Windows Security channel events

diff --git a/src/evidenceforge/cli/commands.py b/src/evidenceforge/cli/commands.py
@@ -278,7 +278,7 @@ def generate(
     console.print(f"\n[bold]Data directory:[/bold] {data_dir}")
     console.print(f"[bold]Ground truth:[/bold] {ground_truth_dir / 'GROUND_TRUTH.md'}")
 
-    # Check for existing generated output (data/ and GROUND_TRUTH.md only).
+    # Check for existing generated output (data/ and generated sidecars only).
     # ENVIRONMENT.md is authored by /eforge scenario, not the engine — never touch it.
     existing = []
     if data_dir.exists():
@@ -387,15 +387,19 @@ def progress_callback(event_type: str, data: dict) -> None:
 
         # Transactional swap: backup old → install new → cleanup backup.
         # If any step fails (including KeyboardInterrupt), old output is
-        # restored from backup. data/ and GROUND_TRUTH.md are always kept
-        # as a matched pair — partial preservation is never valid.
+        # restored from backup. data/ and generated sidecars are always kept
+        # as a matched set — partial preservation is never valid.
         if staging_dir:
             staged_gt = gen_gt_dir / "GROUND_TRUTH.md"
             staged_manifest = gen_gt_dir / OBSERVATION_MANIFEST_FILENAME
             if not gen_data_dir.exists():
                 raise RuntimeError("Staged data/ directory missing after generation")
             if not staged_gt.exists():
                 raise RuntimeError("Staged GROUND_TRUTH.md missing after generation")
+            if not staged_manifest.exists():
+                raise RuntimeError(
+                    f"Staged {OBSERVATION_MANIFEST_FILENAME} missing after generation"
+                )
 
             # Clean up stale rollback dirs from prior killed runs
             for stale in ground_truth_dir.glob(".eforge_rollback_*"):

diff --git a/src/evidenceforge/generation/engine/core.py b/src/evidenceforge/generation/engine/core.py
@@ -119,7 +119,7 @@ def generate(self) -> None:
         2. Generate baseline activity (hour-by-hour iteration)
         3. Execute storyline events (if present)
         4. Finalize and close emitters
-        5. Generate GROUND_TRUTH.md (if malicious activity present)
+        5. Generate GROUND_TRUTH.md and OBSERVATION_MANIFEST.json sidecars
         """
         logger.info(f"Starting generation for scenario: {self.scenario.name}")
 
@@ -185,17 +185,20 @@ def generate(self) -> None:
             self._finalize()
             self._report_progress("phase_end", {"phase": "finalize"})
 
-        # Phase 5: Generate ground truth (if malicious activity or red herrings present)
-        if self.malicious_events or self.red_herring_events:
-            logger.info(
-                f"Generating GROUND_TRUTH.md with {len(self.malicious_events)} malicious events"
-            )
-            self._report_progress(
-                "phase_start",
-                {"phase": "ground_truth", "description": "Generating ground truth documentation"},
-            )
-            self._generate_ground_truth()
-            self._report_progress("phase_end", {"phase": "ground_truth"})
+        # Phase 5: Generate sidecars for every successful run. Baseline-only
+        # datasets still need an empty GROUND_TRUTH.md so CLI overwrite swaps
+        # can keep data and metadata as a matched pair.
+        logger.info(
+            "Generating GROUND_TRUTH.md with %d malicious events and %d red herrings",
+            len(self.malicious_events),
+            len(self.red_herring_events),
+        )
+        self._report_progress(
+            "phase_start",
+            {"phase": "ground_truth", "description": "Generating ground truth documentation"},
+        )
+        self._generate_ground_truth()
+        self._report_progress("phase_end", {"phase": "ground_truth"})
 
         logger.info("Generation complete")
 
@@ -464,7 +467,7 @@ def _finalize(self) -> None:
         logger.info("All emitters closed")
 
     def _generate_ground_truth(self) -> None:
-        """Generate GROUND_TRUTH.md documentation."""
+        """Generate GROUND_TRUTH.md and observation manifest sidecars."""
         from evidenceforge.events.observation_manifest import (
             OBSERVATION_MANIFEST_FILENAME,
             write_observation_manifest,

diff --git a/src/evidenceforge/generation/ground_truth.py b/src/evidenceforge/generation/ground_truth.py
@@ -509,34 +509,34 @@ def _format_iocs(self, iocs: dict[str, set]) -> str:
         Returns:
             Formatted IOC sections (Markdown)
         """
-        if not iocs:
+        if not iocs or not any(values for values in iocs.values()):
             return "*No IOCs extracted.*\n"
 
         sections = []
 
         # Network IOCs
-        if "network" in iocs:
+        if iocs.get("network"):
             sections.append("### Network IOCs\n")
             for ioc in sorted(iocs["network"]):
                 sections.append(f"- {ioc}")
             sections.append("")
 
         # Process IOCs
-        if "processes" in iocs:
+        if iocs.get("processes"):
             sections.append("### Process IOCs\n")
             for ioc in sorted(iocs["processes"]):
                 sections.append(f"- {ioc}")
             sections.append("")
 
         # User IOCs
-        if "users" in iocs:
+        if iocs.get("users"):
             sections.append("### User IOCs\n")
             for ioc in sorted(iocs["users"]):
                 sections.append(f"- {ioc} (compromised account)")
             sections.append("")
 
         # File IOCs
-        if "files" in iocs:
+        if iocs.get("files"):
             sections.append("### File IOCs\n")
             for ioc in sorted(iocs["files"]):
                 sections.append(f"- {ioc}")