Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

**Status:** Phase 8.5 (Dual src/dst HostContext) COMPLETE; Pre-MVP quality fixes ongoing
**Started:** 2026-03-11
**Last Updated:** 2026-05-14
**Last Updated:** 2026-05-15

See [CHANGELOG.md](CHANGELOG.md) for detailed development history of completed phases.

Expand Down Expand Up @@ -243,6 +243,8 @@ Replaced manual per-emitter field coordination with SecurityEvent intermediate r
- [x] **P1** Source identity and endpoint baseline realism sprint — completed TLS/X.509 issuer-compatible chain signatures, Sysmon Event 7 native third-party module identity, config-driven Windows scheduled-process timing, and DHCP registry emission policy tied to lease activity. Verified with `uv run eforge validate-config`, focused regressions, Ruff, normal pytest, and slow-inclusive pytest.
- [x] **P2** Endpoint/eCAR baseline variance follow-up — addressed through the host/activity profile realism layer. Host family, role, persona, and stable per-host multipliers now shape endpoint, process, registry, scheduled-task, syslog, bash, eCAR, Windows, Zeek, firewall, IDS, web, and proxy rates; config-driven encoded PowerShell variants and benign endpoint texture reduce repeated per-host artifacts. Verification passed with focused host-activity/config/ASA/baseline tests, `uv run eforge validate-config`, Ruff checks/format checks, full normal `uv run pytest -v`, and slow-inclusive `uv run pytest -v --include-slow --no-cov` (`3057 passed, 1 skipped`).
- [x] **Later architectural sprint: imperfect observation and source coverage** — implemented a training-friendly `complete` default plus overlay-compatible named observation profiles that apply deterministic source-level drop/delay/coverage semantics without modeling contradictions. The policy covers endpoint, network, proxy/web, firewall, IDS, Windows, Sysmon, Zeek, syslog, bash history, and eCAR source families, while ground truth preserves canonical truth and records source evidence status. Verification passed: focused observation/config/ground-truth tests, `uv run eforge validate-config`, Ruff checks/format checks, full normal `uv run pytest -v` (`3036 passed, 15 skipped`), and slow-inclusive `uv run pytest -v --include-slow` (`3050 passed, 1 skipped`).
- [x] Observation-aware automated eval and manifest — generation now writes `OBSERVATION_MANIFEST.json` beside ground truth, `eforge eval` loads it when present, coverage-style causality metrics report raw and observation-adjusted scores for expected non-visible evidence, and correctness/contradiction checks remain strict. Verification passed with config validation, Ruff checks/format checks, focused eval/manifest tests, and full normal `uv run pytest -v` (`3047 passed, 15 skipped`).
- [x] Post-host-activity score check — synced `dev`, cleaned up stale TODOs, regenerated/evaluated `scenarios/iteration-test` from the current iteration-test prompt with `enterprise_standard` observation, and ran one blind expert-panel review without entering another fix loop. Automated eval passed at `92.39` over `108,858` records; blind synthetic-confidence averaged `82.75`. Highest-leverage follow-ups are Linux SSH/syslog lifecycle ordering, Zeek observation-tree consistency, X.509 metadata coherence, Windows OS-build/local-SID identity, and static web asset manifests.
- [x] Full slow-suite regression cleanup after loop-65 merge — explicit-proxy storyline beacons now preserve authored hostname+destination IP pairs only when the storyline marks that pair as intentional, normal proxy-origin DNS resolution remains intact, and the parallel-generation LogonID assertion treats Type 7 unlock reuse as valid slice-of-time Windows behavior. Verified with targeted proxy/parallel tests, `uv run ruff check .`, `uv run ruff format --check .`, and `uv run pytest -v --include-slow` (`2875 passed, 23 skipped`).
Detection Engineer blind review completed for the regenerated Loop 61 dataset at `scenarios/iteration-test/data`; reviewer verdict: Synthetic, 63/100 confidence. Main findings: one PROXY-01 sshd accepted-login lifecycle gap/self-source artifact and Windows 4648 explicit-credential caller PID/image provenance ambiguity around `WS-MCHEN-01`.

Expand Down Expand Up @@ -279,7 +281,7 @@ Verification is complete: dedicated `tests/unit/test_world_model.py` coverage wa

- [x] **SUPERSEDED** Canonical emitter field provenance blind-review remaining findings from 78% synthetic review — superseded by later full-path storyline normalization, bash typo/path cleanup, proxy domain-class path/content profiles, and Sysmon follow-on ordering fixes. The still-current related work is now represented by web/session realism, imperfect observation/source coverage, and process lifecycle modeling TODOs.

- [ ] Source-specific process lifecycle completeness modeling — deferred design item. Add a configurable telemetry coverage/profile layer that can model realistic Security/Sysmon/eCAR missingness, ingestion delay, audit-policy gaps, and endpoint coverage variance without ad hoc omissions in individual emitters. This should be part of the broader cross-source distribution realism layer, not a Windows-only workaround.
- [x] **SUPERSEDED** Source-specific process lifecycle completeness modeling — the broad requirement is now covered by named observation profiles plus the host/activity profile layer. Observation profiles model deterministic source-family missingness/delay/coverage semantics for Security/Sysmon/eCAR and other sources, while host activity profiles add endpoint/source volume variance; the remaining narrower deployment-topology gap is tracked as configurable per-host/source log deployment coverage.

- [x] Open PR consolidation into `dev` — re-applied the storyline typing-cadence monotonicity fix from PR #81, folded Dependabot pytest/Pygments updates into the dev workflow, and added Dependabot configuration so future dependency PRs target `dev`.

Expand Down Expand Up @@ -601,7 +603,7 @@ Data works but experienced analysts spot tells. Grouped by format for efficient
- [x] **P2** Per-host-type event rate multiplier — implemented as implicit host/activity profile defaults rather than scenario YAML fields. Domain controllers, file servers, web servers, proxies, Linux servers, and workstations now receive role/family/persona-specific multipliers across baseline activity, auth, endpoint, network, and source-specific noise.
- [x] Configurable per-entity artifact variation — implemented in the host/activity profile layer for baseline artifact texture, including stable per-host encoded PowerShell variants and profile-owned endpoint activity scaling.
- [x] Configurable per-host volume variance — implemented via stable host/persona/role multipliers applied across major activity families so hosts no longer share narrow uniform volume bands by construction.
- [ ] Configurable per-host/source log deployment coverage — observation profiles now support source-family gaps and host-scoped missingness multipliers, but explicit per-host source enablement/disablement remains future work. A later setting should model named host groups, disabled sensors, partial deployments, and collection windows when users need topology-level telemetry coverage differences rather than event-level missingness.
- [ ] Configurable per-host/source log deployment coverage — observation profiles now support source-family gaps and host-scoped missingness multipliers, but explicit per-host source enablement/disablement remains future work. A later setting should model named host groups, disabled sensors, partial deployments, and collection windows when users need topology-level telemetry coverage differences rather than event-level missingness or host/activity volume variance.
- [ ] **P2** Generation speed and efficiency follow-up — Sprint 4 host/activity realism is functionally verified, but the slow-inclusive suite exposed that `pytest-cov` plus `tracemalloc` can make the medium dataset memory test pathological. A future sprint should profile generation without instrumentation noise, identify hot paths introduced by richer host activity/web fanout/firewall texture, and decide whether to optimize generation, mark the memory test `--no-cov`, or relax/update stale performance assertions.
- [x] DNS IP pool reuse causes cross-provider resolution (CloudFront→Microsoft IPs, etc.) — domain-first selection ensures consistent domain→IP mapping via FORWARD_DNS
- [x] AWS region mismatch between DNS PTR and SSL SNI for same IP — AWS hostname/PTR generation now derives a stable per-IP region/edge identity and PTR generation respects known forward hostname context.
Expand Down
2 changes: 1 addition & 1 deletion commands/eforge/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ When writing to the overlay, files are partial — they contain ONLY the user's
| Modify baseline auth noise | `auth_noise.yaml` | (standalone — stale scheduled-credential accounts and irregular recurrence timing) |
| Modify endpoint background noise | `endpoint_noise.yaml` | (standalone — scheduled-process timing and DHCP registry emission policy) |
| Modify host activity distribution | `host_activity_profiles.yaml` | (standalone — host/persona/role rate-family multipliers, firewall deny bursts, and artifact variants) |
| Modify source observation coverage | `observation_profiles.yaml` | Scenario `observation_profile` selects the named profile; keep `complete` as the default training profile |
| Modify source observation coverage | `observation_profiles.yaml` | Scenario `observation_profile` selects the named profile; generated `OBSERVATION_MANIFEST.json` lets eval account for expected gaps; keep `complete` as the default training profile |
| Modify causal/source timing | `timing_profiles.yaml` | (standalone — causal prerequisite, source latency, teardown, and Windows/Sysmon collision-spacing knobs) |
| ~~Format definitions~~ | Not user-customizable | Engine internals — requires code changes |
| ~~Evaluation rules~~ | Not user-customizable | Must match format definitions — requires code changes |
Expand Down
15 changes: 11 additions & 4 deletions commands/eforge/evaluate.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ scenarios/<scenario-name>/
scenario.yaml
ENVIRONMENT.md
GROUND_TRUTH.md
OBSERVATION_MANIFEST.json ← optional, generated for source-observation-aware eval
data/ ← this is the output_dir for eforge eval
```

Expand Down Expand Up @@ -65,6 +66,12 @@ Present a clear summary of the evaluation results. The report shows two tiers fo
- **Minimum** (hard gate): must pass or the dataset fails overall
- **Aspirational** (informational): a stretch target; failure here is noted but does not fail the dataset

If the scenario uses `observation_profile` other than `complete`, check whether the report says
the observation manifest was loaded. With a manifest, coverage-style causality sub-scores may be
adjusted for expected source gaps and will show a `raw` score when the adjusted score differs.
Do not describe this as a lowered threshold: visible contradictions, parseability failures,
source-native field mismatches, and evidence marked `visible` or `delayed` remain real failures.

For each pillar, explain what the score means in practical terms:

**Pillar 1: Parseability (weight 0.30)**
Expand All @@ -81,11 +88,11 @@ For each pillar, explain what the score means in practical terms:

**Pillar 3: Causality (weight 0.25)**
- Causal Ordering: Are logon→process→logoff sequences correctly ordered? DNS before TCP? Kerberos TGT/TGS before domain logons?
- Storyline Event Presence: Are all storyline events visible in at least one log source?
- Storyline Event Presence: Are all expected-visible storyline events visible in at least one log source? For non-`complete` observation profiles with a manifest, source rows marked `dropped`, `filtered`, or `out_of_window` are excluded from this coverage denominator.
- Indicator Accuracy: Do traces carry the correct IPs, usernames, hostnames from the scenario?
- Pivot Linkability: Can a hunter pivot between consecutive attack steps using shared field values?
- Storyline Temporal Integrity: Are attack events in the right relative order at the right times?
- Storyline Trace Coverage: For each expected log format on each involved host, does the storyline leave a trace?
- Pivot Linkability: Can a hunter pivot between consecutive expected-visible attack steps using shared field values?
- Storyline Temporal Integrity: Are expected-visible attack events in the right relative order at the right times?
- Storyline Trace Coverage: For each expected-visible log format group on each involved host, does the storyline leave a trace?

**Pillar 4: Timing (weight 0.20)**
- Attack-Chain Timing: Do elapsed times between consecutive storyline steps fall within plausible bounds? Bounds come from `timing_bounds.yaml` — default 5s–2h, with per-action-type overrides (e.g., lateral movement: 30s–1h, exfiltration: 60s–24h). First matching keyword in the step activity wins.
Expand Down
2 changes: 1 addition & 1 deletion commands/eforge/references/config-dependency-graph.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ Each row is a file; columns show what it depends on and what depends on it.
| Direction | File | Relationship |
|-----------|------|-------------|
| depends on | scenario `observation_profile` | The scenario selects a named profile; the profile file owns source-level missingness/delay values |
| **depended on by** | Event dispatcher, GROUND_TRUTH.md | Applies deterministic source-observation drops/delays after canonical state updates and reports source evidence status |
| **depended on by** | Event dispatcher, GROUND_TRUTH.md, OBSERVATION_MANIFEST.json, `eforge eval` | Applies deterministic source-observation drops/delays after canonical state updates, reports source evidence status, and lets eval distinguish expected gaps from missing visible evidence |
| validated by | `eforge validate-config` and `eforge validate` | Config validation checks source-family names/ranges; scenario validation checks that the named profile exists |

### network_params.yaml
Expand Down
9 changes: 9 additions & 0 deletions commands/eforge/references/config-evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,15 @@ Schema documentation for data quality evaluation rule files in `src/evidenceforg

Controls the two-tier acceptance model for `eforge eval`. Each sub-score has a **minimum** (hard gate: dataset fails if below) and an **aspirational** target (informational stretch goal). Pillar weights must sum to 1.0.

When a generated dataset includes `OBSERVATION_MANIFEST.json` beside `GROUND_TRUTH.md`,
`eforge eval` automatically applies observation-aware coverage scoring. Non-`complete`
profiles can adjust only coverage-style causality sub-scores (`event_presence`,
`pivot_linkability`, `temporal_integrity`, and `storyline_trace_coverage`) by excluding
evidence that the manifest marks `dropped`, `filtered`, or `out_of_window`. Source-native
correctness gates such as parseability, value plausibility, field agreement, and visible causal
ordering remain strict. Adjusted sub-scores expose `raw_score` in JSON and show `raw:<score>` in
the text report.

### Structure

```yaml
Expand Down
5 changes: 5 additions & 0 deletions commands/eforge/references/config-host-activity.md
Original file line number Diff line number Diff line change
Expand Up @@ -430,6 +430,11 @@ profiles:

Profiles are intentionally source-level, not event-type matrices. Scenario authors select a named profile; code owns safe source-native application semantics so new event types inherit their source-family default. Non-complete profiles may make evidence `visible`, `delayed`, `dropped`, `filtered`, or `out_of_window`, but must not create contradictory identifiers or field values across sources.

Generation writes `OBSERVATION_MANIFEST.json` beside `GROUND_TRUTH.md`. `eforge eval` uses this
sidecar to adjust only coverage-style causality scoring for expected missing evidence under
non-`complete` profiles. The raw score remains visible in the report, and source-native
correctness checks are not relaxed.

Valid source families are `windows_security`, `sysmon`, `ecar`, `syslog`, `bash_history`, `zeek`, `proxy`, `web`, `asa`, and `ids`. Run `eforge validate-config` after overlay changes; it rejects unknown source-family names, invalid probabilities, and inverted ranges. Run `eforge validate` on scenarios that use a non-default profile so unknown profile names are caught before generation.

---
Expand Down
3 changes: 2 additions & 1 deletion commands/eforge/references/scenario-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -405,7 +405,8 @@ training-friendly perfect source coverage and correlation. Non-default profiles
deterministic source-level missingness and source-native delays while preserving canonical truth:
they can make evidence `visible`, `delayed`, `dropped`, `filtered`, or `out_of_window`, but they
must not create contradictory users, PIDs, ports, hashes, UIDs, or session identifiers across
sources. `GROUND_TRUTH.md` records source evidence status when a non-complete profile is used.
sources. `GROUND_TRUTH.md` records source evidence status for instructors, and
`OBSERVATION_MANIFEST.json` records the same source-observation contract for automated eval.

## Storyline

Expand Down
6 changes: 6 additions & 0 deletions docs/design/data-quality-prd.md
Original file line number Diff line number Diff line change
Expand Up @@ -339,6 +339,12 @@ Every sub-score now has:

Thresholds are stored in `src/evidenceforge/config/evaluation/thresholds.yaml` for tuning without code changes. Calibration against purpose-built scenarios is deferred to a separate pass.

Datasets generated with non-`complete` observation profiles include `OBSERVATION_MANIFEST.json`.
When present, eval uses it to adjust coverage-style causality sub-scores for evidence that was
intentionally `dropped`, `filtered`, or `out_of_window`. Hard correctness gates remain strict:
observation profiles do not excuse parse failures, impossible values, source-native contradictions,
or evidence marked `visible`/`delayed` but missing from logs.

### Calibration Plan

Thresholds are currently judgment-based. After the restructure is stable, the plan is to design purpose-built calibration scenarios (known-good and known-bad), run `eforge eval` against them, and use the results to propose empirically grounded threshold values. Out of scope for v0.5.1.
Expand Down
8 changes: 8 additions & 0 deletions docs/reference/CUSTOMIZING_CONFIG.md
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,14 @@ The `eforge eval` scoring rules are also YAML-based and can be tuned per-project

All eval config files live in `src/evidenceforge/config/evaluation/`. They are **not** overlaid from `.eforge/config/` — edit them in-place if you want project-specific tuning, or copy the package files into your project and set the `EFORGE_EVAL_CONFIG_DIR` environment variable to point to your copies.

Generated scenario directories may also include `OBSERVATION_MANIFEST.json` beside
`GROUND_TRUTH.md`. `eforge eval` loads this sidecar automatically when present. For
non-`complete` observation profiles, causality coverage metrics use the manifest to exclude
source evidence that was intentionally `dropped`, `filtered`, or `out_of_window`, while still
failing visible contradictions, parse errors, value mismatches, and missing evidence that the
manifest marks `visible` or `delayed`. Text and JSON reports keep the adjusted score and expose
the raw score for affected sub-scores.

For full schema documentation for each file, see the skill reference: `/eforge:references:config-evaluation`.

## Reference Documentation
Expand Down
3 changes: 2 additions & 1 deletion docs/reference/scenario-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -405,7 +405,8 @@ training-friendly perfect source coverage and correlation. Non-default profiles
deterministic source-level missingness and source-native delays while preserving canonical truth:
they can make evidence `visible`, `delayed`, `dropped`, `filtered`, or `out_of_window`, but they
must not create contradictory users, PIDs, ports, hashes, UIDs, or session identifiers across
sources. `GROUND_TRUTH.md` records source evidence status when a non-complete profile is used.
sources. `GROUND_TRUTH.md` records source evidence status for instructors, and
`OBSERVATION_MANIFEST.json` records the same source-observation contract for automated eval.

## Storyline

Expand Down
Loading
Loading