Context
PR #167 adds a single security hygiene dimension (bandit + npm audit). @akashgit's comment proposed expanding this into a full security audit mode with multi-scanner support, sub-dimensions, a security agent, and structured reports.
This issue captures a phased plan to get there incrementally.
Design Decisions (Need Input)
These choices shape the implementation. Worth aligning on before Phase 2:
- Eval hierarchy placement: Security as a hygiene dimension (0.05-0.10 weight, always runs) vs. separate eval tier vs. dedicated mode only?
- Out-of-the-box scanners: bandit + npm audit are obvious. semgrep, trivy, git-secrets add coverage but complexity. Which are must-haves?
- Aggregation formula: Severity-weighted (
1.0 - critical*0.4 - high*0.2 - medium*0.1 - low*0.05) vs. per-scanner average vs. language-prevalence weighted?
- Run cadence: Every eval cycle (thorough, slow) vs. on-demand via
--mode security-audit (fast default cycles)?
Phased Plan
Phase 0: Foundation (PR #167)
Merge current PR after addressing review findings. This gives us:
- Single
eval_security() hygiene dimension
- bandit (Python) + npm audit (Node) support
- Neutral fallback when no scanner detected
- Weight 0.08 in hygiene tier
Remaining work on #167: fix bandit not-installed detection, update stale dimension counts in test_runner.py / runner.py / docs, handle silent tool failures, terminology consistency.
Phase 1: Scanner Abstraction (~800 LOC)
Create factory/security/ package with a pluggable scanner architecture:
class SecurityScanner(Protocol):
def detect(self, project_path: Path) -> bool: ...
def run(self, project_path: Path) -> SecurityScanResult: ...
SecurityScanResult and SecurityIssue models in factory/models.py (severity, category, file, remediation)
- Scanner registry with auto-detection
- Refactor
eval_security() to use the registry instead of inline bandit/npm logic
- Implement
BanditScanner, NpmAuditScanner
- Add
SemgrepScanner, TrivyScanner, GitSecretsScanner (pending decision on which are must-haves)
Phase 2: Sub-dimensions + CEO Awareness (~400 LOC)
Break single "security" dimension into categories:
security_dependencies: dependency scanning (npm audit, pip audit, cargo audit)
security_code: code pattern analysis (bandit, semgrep)
security_secrets: hardcoded secrets detection (git-secrets, detect-secrets)
security_permissions: file/function permission checks (custom rules)
CEO/Strategist integration:
- Enhanced Strategist prompt to generate security-focused hypotheses when security score < threshold
- Optional security policy guards in
guards.py (e.g., check_no_secrets() as a hard gate that fails experiments introducing secrets)
Phase 3: Security Audit Mode (~500 LOC)
New --mode security-audit in CLI:
- Researcher: analyze security posture, identify threat surface
- Strategist: generate security-focused hypotheses (prioritized by severity)
- Builder: implement fixes
- Reviewer: enforce security policies on changes
- Evaluator: measure security score improvement
Optional: dedicated security_auditor agent role with specialized prompt, spawned by CEO or Researcher.
Phase 4: Dashboard + Docs (~400 LOC)
- Security panel in dashboard: severity distribution, scanner coverage, resolved vs. outstanding issues
- Cross-project security patterns view
- Architecture docs, contributing guide updates, README section
Effort Estimates
| Phase |
Scope |
LOC |
Time |
| 0 |
Single dimension (PR #167) |
~200 |
Done |
| 1 |
Scanner abstraction + registry |
~800 |
1-2 weeks |
| 2 |
Sub-dimensions + CEO awareness |
~400 |
1 week |
| 3 |
Security audit mode |
~500 |
1 week |
| 4 |
Dashboard + docs |
~400 |
1 week |
Architecture Notes
- Existing eval system is pluggable. Hygiene dimensions use auto-detect + subprocess + parse pattern. Security scanners fit this naturally.
- Guards are hard gates. Security policies that are non-negotiable (no secrets in code) should be guards, not dimensions. Dimensions contribute to score; guards fail experiments.
- CEO routes everything. For security to be non-negotiable, it needs either high eval weight, guard enforcement, or dedicated mode. Weight alone (0.05-0.10 in hygiene = 2.5-5% of total score) may not be enough to drive CEO prioritization.
- Agents are stateless. Security context must be passed via task description or artifacts (
.factory/security/scan_results.json).
- Scanners can be slow. Phase 1 runs sequentially; parallel execution is a Phase 3+ optimization if needed.
Risks
| Risk |
Mitigation |
| Security scanners slow down every eval cycle |
Optional skip flag, run on-demand in security mode, parallel execution later |
| False positives waste CEO/Builder time |
Severity filtering, configurable thresholds per scanner |
| Low weight means CEO ignores security |
Enforce via guards (hard gates) or dedicated mode |
| Too many scanner options = config complexity |
Auto-detect sensible defaults per language |
References
Context
PR #167 adds a single security hygiene dimension (bandit + npm audit). @akashgit's comment proposed expanding this into a full security audit mode with multi-scanner support, sub-dimensions, a security agent, and structured reports.
This issue captures a phased plan to get there incrementally.
Design Decisions (Need Input)
These choices shape the implementation. Worth aligning on before Phase 2:
1.0 - critical*0.4 - high*0.2 - medium*0.1 - low*0.05) vs. per-scanner average vs. language-prevalence weighted?--mode security-audit(fast default cycles)?Phased Plan
Phase 0: Foundation (PR #167)
Merge current PR after addressing review findings. This gives us:
eval_security()hygiene dimensionRemaining work on #167: fix bandit not-installed detection, update stale dimension counts in test_runner.py / runner.py / docs, handle silent tool failures, terminology consistency.
Phase 1: Scanner Abstraction (~800 LOC)
Create
factory/security/package with a pluggable scanner architecture:SecurityScanResultandSecurityIssuemodels infactory/models.py(severity, category, file, remediation)eval_security()to use the registry instead of inline bandit/npm logicBanditScanner,NpmAuditScannerSemgrepScanner,TrivyScanner,GitSecretsScanner(pending decision on which are must-haves)Phase 2: Sub-dimensions + CEO Awareness (~400 LOC)
Break single "security" dimension into categories:
security_dependencies: dependency scanning (npm audit, pip audit, cargo audit)security_code: code pattern analysis (bandit, semgrep)security_secrets: hardcoded secrets detection (git-secrets, detect-secrets)security_permissions: file/function permission checks (custom rules)CEO/Strategist integration:
guards.py(e.g.,check_no_secrets()as a hard gate that fails experiments introducing secrets)Phase 3: Security Audit Mode (~500 LOC)
New
--mode security-auditin CLI:Optional: dedicated
security_auditoragent role with specialized prompt, spawned by CEO or Researcher.Phase 4: Dashboard + Docs (~400 LOC)
Effort Estimates
Architecture Notes
.factory/security/scan_results.json).Risks
References