Skip to content

Security audit mode: phased implementation plan #189

@lukeinglis

Description

@lukeinglis

Context

PR #167 adds a single security hygiene dimension (bandit + npm audit). @akashgit's comment proposed expanding this into a full security audit mode with multi-scanner support, sub-dimensions, a security agent, and structured reports.

This issue captures a phased plan to get there incrementally.

Design Decisions (Need Input)

These choices shape the implementation. Worth aligning on before Phase 2:

  1. Eval hierarchy placement: Security as a hygiene dimension (0.05-0.10 weight, always runs) vs. separate eval tier vs. dedicated mode only?
  2. Out-of-the-box scanners: bandit + npm audit are obvious. semgrep, trivy, git-secrets add coverage but complexity. Which are must-haves?
  3. Aggregation formula: Severity-weighted (1.0 - critical*0.4 - high*0.2 - medium*0.1 - low*0.05) vs. per-scanner average vs. language-prevalence weighted?
  4. Run cadence: Every eval cycle (thorough, slow) vs. on-demand via --mode security-audit (fast default cycles)?

Phased Plan

Phase 0: Foundation (PR #167)

Merge current PR after addressing review findings. This gives us:

  • Single eval_security() hygiene dimension
  • bandit (Python) + npm audit (Node) support
  • Neutral fallback when no scanner detected
  • Weight 0.08 in hygiene tier

Remaining work on #167: fix bandit not-installed detection, update stale dimension counts in test_runner.py / runner.py / docs, handle silent tool failures, terminology consistency.

Phase 1: Scanner Abstraction (~800 LOC)

Create factory/security/ package with a pluggable scanner architecture:

class SecurityScanner(Protocol):
    def detect(self, project_path: Path) -> bool: ...
    def run(self, project_path: Path) -> SecurityScanResult: ...
  • SecurityScanResult and SecurityIssue models in factory/models.py (severity, category, file, remediation)
  • Scanner registry with auto-detection
  • Refactor eval_security() to use the registry instead of inline bandit/npm logic
  • Implement BanditScanner, NpmAuditScanner
  • Add SemgrepScanner, TrivyScanner, GitSecretsScanner (pending decision on which are must-haves)

Phase 2: Sub-dimensions + CEO Awareness (~400 LOC)

Break single "security" dimension into categories:

  • security_dependencies: dependency scanning (npm audit, pip audit, cargo audit)
  • security_code: code pattern analysis (bandit, semgrep)
  • security_secrets: hardcoded secrets detection (git-secrets, detect-secrets)
  • security_permissions: file/function permission checks (custom rules)

CEO/Strategist integration:

  • Enhanced Strategist prompt to generate security-focused hypotheses when security score < threshold
  • Optional security policy guards in guards.py (e.g., check_no_secrets() as a hard gate that fails experiments introducing secrets)

Phase 3: Security Audit Mode (~500 LOC)

New --mode security-audit in CLI:

  1. Researcher: analyze security posture, identify threat surface
  2. Strategist: generate security-focused hypotheses (prioritized by severity)
  3. Builder: implement fixes
  4. Reviewer: enforce security policies on changes
  5. Evaluator: measure security score improvement

Optional: dedicated security_auditor agent role with specialized prompt, spawned by CEO or Researcher.

Phase 4: Dashboard + Docs (~400 LOC)

  • Security panel in dashboard: severity distribution, scanner coverage, resolved vs. outstanding issues
  • Cross-project security patterns view
  • Architecture docs, contributing guide updates, README section

Effort Estimates

Phase Scope LOC Time
0 Single dimension (PR #167) ~200 Done
1 Scanner abstraction + registry ~800 1-2 weeks
2 Sub-dimensions + CEO awareness ~400 1 week
3 Security audit mode ~500 1 week
4 Dashboard + docs ~400 1 week

Architecture Notes

  • Existing eval system is pluggable. Hygiene dimensions use auto-detect + subprocess + parse pattern. Security scanners fit this naturally.
  • Guards are hard gates. Security policies that are non-negotiable (no secrets in code) should be guards, not dimensions. Dimensions contribute to score; guards fail experiments.
  • CEO routes everything. For security to be non-negotiable, it needs either high eval weight, guard enforcement, or dedicated mode. Weight alone (0.05-0.10 in hygiene = 2.5-5% of total score) may not be enough to drive CEO prioritization.
  • Agents are stateless. Security context must be passed via task description or artifacts (.factory/security/scan_results.json).
  • Scanners can be slow. Phase 1 runs sequentially; parallel execution is a Phase 3+ optimization if needed.

Risks

Risk Mitigation
Security scanners slow down every eval cycle Optional skip flag, run on-demand in security mode, parallel execution later
False positives waste CEO/Builder time Severity filtering, configurable thresholds per scanner
Low weight means CEO ignores security Enforce via guards (hard gates) or dedicated mode
Too many scanner options = config complexity Auto-detect sensible defaults per language

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions