Skip to content

[codex] improve assessment-loop realism#163

Draft
DavidJBianco wants to merge 61 commits into
devfrom
codex/eforge-assess-loops-1-10
Draft

[codex] improve assessment-loop realism#163
DavidJBianco wants to merge 61 commits into
devfrom
codex/eforge-assess-loops-1-10

Conversation

@DavidJBianco
Copy link
Copy Markdown
Collaborator

@DavidJBianco DavidJBianco commented May 16, 2026

Summary

  • Continue the iterative EvidenceForge realism assessment work through Loop 30 on the draft PR branch.
  • Fix high-leverage blind-review findings across browser/process attribution, Linux endpoint cadence, Zeek multi-sensor timing, public DNS/X.509 realism, persistent HTTP transaction modeling, Zeek parent-flow accounting, Linux syslog texture, DNS/C2 cadence, systemd timer realism, and eCAR FLOW principal attribution.
  • Preserve per-loop assessment results in TODO.md, including automated eval scores, hard probes, blind reviewer synthetic-confidence scores, deliberation outcomes, and next recommended targets.

Why

The blind assessment loops kept surfacing concrete synthetic tells in generator-owned behavior. This branch fixes those root causes at the data/config, canonical event, timing, source-observation, and emitter layers so generated evidence agrees by construction and better matches source-native enterprise telemetry.

Validation

  • uv run eforge validate-config
  • uv run ruff check .
  • uv run ruff format --check .
  • uv run pytest --no-cov -q after each fix pass; latest full suite: 3162 passed, 37 skipped
  • Repeated iteration-test regeneration/evaluation through Loop 30
  • Latest Loop 30 quantitative eval: 95.99/100 across 76,333 records, all hard gates passing
  • Latest Loop 30 hard probe: 4,579/13,240 eCAR FLOW records now carry mixed principals, with zero pid=-1 principal leaks and zero failed-flow principal claims

Latest reviewer signal

Loop 30 blind-review synthetic-confidence scores:

  • Threat Hunter: 84
  • Detection Engineer: 62
  • Network Forensics: 68
  • Host/EDR: 39 after inversion from a Real verdict at confidence 61
  • Average: 63.25

The next highest-leverage generator-owned finding is the verified DB bash-history/eCAR timing mismatch for the scp /tmp/mhs-archive.sql.gz command. Scenario-authored label legibility remains deferred unless scenario edits are explicitly authorized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant