Security audit mode: phased implementation plan

## Context

PR #167 adds a single security hygiene dimension (bandit + npm audit). @akashgit's [comment](https://github.com/akashgit/remote-factory/pull/167#issuecomment-4364685818) proposed expanding this into a full security audit mode with multi-scanner support, sub-dimensions, a security agent, and structured reports.

This issue captures a phased plan to get there incrementally.

## Design Decisions (Need Input)

These choices shape the implementation. Worth aligning on before Phase 2:

1. **Eval hierarchy placement**: Security as a hygiene dimension (0.05-0.10 weight, always runs) vs. separate eval tier vs. dedicated mode only?
2. **Out-of-the-box scanners**: bandit + npm audit are obvious. semgrep, trivy, git-secrets add coverage but complexity. Which are must-haves?
3. **Aggregation formula**: Severity-weighted (`1.0 - critical*0.4 - high*0.2 - medium*0.1 - low*0.05`) vs. per-scanner average vs. language-prevalence weighted?
4. **Run cadence**: Every eval cycle (thorough, slow) vs. on-demand via `--mode security-audit` (fast default cycles)?

## Phased Plan

### Phase 0: Foundation (PR #167)

Merge current PR after addressing review findings. This gives us:
- Single `eval_security()` hygiene dimension
- bandit (Python) + npm audit (Node) support
- Neutral fallback when no scanner detected
- Weight 0.08 in hygiene tier

**Remaining work on #167:** fix bandit not-installed detection, update stale dimension counts in test_runner.py / runner.py / docs, handle silent tool failures, terminology consistency.

### Phase 1: Scanner Abstraction (~800 LOC)

Create `factory/security/` package with a pluggable scanner architecture:

```python
class SecurityScanner(Protocol):
    def detect(self, project_path: Path) -> bool: ...
    def run(self, project_path: Path) -> SecurityScanResult: ...
```

- `SecurityScanResult` and `SecurityIssue` models in `factory/models.py` (severity, category, file, remediation)
- Scanner registry with auto-detection
- Refactor `eval_security()` to use the registry instead of inline bandit/npm logic
- Implement `BanditScanner`, `NpmAuditScanner`
- Add `SemgrepScanner`, `TrivyScanner`, `GitSecretsScanner` (pending decision on which are must-haves)

### Phase 2: Sub-dimensions + CEO Awareness (~400 LOC)

Break single "security" dimension into categories:
- `security_dependencies`: dependency scanning (npm audit, pip audit, cargo audit)
- `security_code`: code pattern analysis (bandit, semgrep)
- `security_secrets`: hardcoded secrets detection (git-secrets, detect-secrets)
- `security_permissions`: file/function permission checks (custom rules)

CEO/Strategist integration:
- Enhanced Strategist prompt to generate security-focused hypotheses when security score < threshold
- Optional security policy guards in `guards.py` (e.g., `check_no_secrets()` as a hard gate that fails experiments introducing secrets)

### Phase 3: Security Audit Mode (~500 LOC)

New `--mode security-audit` in CLI:
1. Researcher: analyze security posture, identify threat surface
2. Strategist: generate security-focused hypotheses (prioritized by severity)
3. Builder: implement fixes
4. Reviewer: enforce security policies on changes
5. Evaluator: measure security score improvement

Optional: dedicated `security_auditor` agent role with specialized prompt, spawned by CEO or Researcher.

### Phase 4: Dashboard + Docs (~400 LOC)

- Security panel in dashboard: severity distribution, scanner coverage, resolved vs. outstanding issues
- Cross-project security patterns view
- Architecture docs, contributing guide updates, README section

## Effort Estimates

| Phase | Scope | LOC | Time |
|-------|-------|-----|------|
| 0 | Single dimension (PR #167) | ~200 | Done |
| 1 | Scanner abstraction + registry | ~800 | 1-2 weeks |
| 2 | Sub-dimensions + CEO awareness | ~400 | 1 week |
| 3 | Security audit mode | ~500 | 1 week |
| 4 | Dashboard + docs | ~400 | 1 week |

## Architecture Notes

- **Existing eval system is pluggable.** Hygiene dimensions use auto-detect + subprocess + parse pattern. Security scanners fit this naturally.
- **Guards are hard gates.** Security policies that are non-negotiable (no secrets in code) should be guards, not dimensions. Dimensions contribute to score; guards fail experiments.
- **CEO routes everything.** For security to be non-negotiable, it needs either high eval weight, guard enforcement, or dedicated mode. Weight alone (0.05-0.10 in hygiene = 2.5-5% of total score) may not be enough to drive CEO prioritization.
- **Agents are stateless.** Security context must be passed via task description or artifacts (`.factory/security/scan_results.json`).
- **Scanners can be slow.** Phase 1 runs sequentially; parallel execution is a Phase 3+ optimization if needed.

## Risks

| Risk | Mitigation |
|------|------------|
| Security scanners slow down every eval cycle | Optional skip flag, run on-demand in security mode, parallel execution later |
| False positives waste CEO/Builder time | Severity filtering, configurable thresholds per scanner |
| Low weight means CEO ignores security | Enforce via guards (hard gates) or dedicated mode |
| Too many scanner options = config complexity | Auto-detect sensible defaults per language |

## References

- PR #167: initial security dimension
- PR #167 review comments: bandit detection bug, stale counts, silent failures
- @akashgit's vision comment: security audit mode with multi-scanner, sub-dimensions, agent integration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Security audit mode: phased implementation plan #189

Context

Design Decisions (Need Input)

Phased Plan

Phase 0: Foundation (PR #167)

Phase 1: Scanner Abstraction (~800 LOC)

Phase 2: Sub-dimensions + CEO Awareness (~400 LOC)

Phase 3: Security Audit Mode (~500 LOC)

Phase 4: Dashboard + Docs (~400 LOC)

Effort Estimates

Architecture Notes

Risks

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Phase	Scope	LOC	Time
0	Single dimension (PR #167)	~200	Done
1	Scanner abstraction + registry	~800	1-2 weeks
2	Sub-dimensions + CEO awareness	~400	1 week
3	Security audit mode	~500	1 week
4	Dashboard + docs	~400	1 week

Risk	Mitigation
Security scanners slow down every eval cycle	Optional skip flag, run on-demand in security mode, parallel execution later
False positives waste CEO/Builder time	Severity filtering, configurable thresholds per scanner
Low weight means CEO ignores security	Enforce via guards (hard gates) or dedicated mode
Too many scanner options = config complexity	Auto-detect sensible defaults per language

Security audit mode: phased implementation plan #189

Description

Context

Design Decisions (Need Input)

Phased Plan

Phase 0: Foundation (PR #167)

Phase 1: Scanner Abstraction (~800 LOC)

Phase 2: Sub-dimensions + CEO Awareness (~400 LOC)

Phase 3: Security Audit Mode (~500 LOC)

Phase 4: Dashboard + Docs (~400 LOC)

Effort Estimates

Architecture Notes

Risks

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions