feat: add security hygiene eval dimension (bandit + npm audit) by lukeinglis · Pull Request #167 · akashgit/remote-factory

lukeinglis · 2026-05-01T18:54:59Z

Summary

Adds a 7th mandatory hygiene dimension: security scanning. Listed as a "Good First Issue" in the contributing guide.

Python: runs bandit -r . -f json -q, counts issues from JSON output
Node.js: runs npm audit --json, sums vulnerabilities across severity levels
Returns neutral score (0.5) when no scanner is detected (same pattern as other dimensions)
Partial credit: max(0.0, 1.0 - issue_count * 0.1)

Changes

File	Change
`factory/eval/hygiene.py`	Add `eval_security()`, rebalance `HYGIENE_WEIGHTS` (sum still 1.0), update `compute_hygiene_results()`
`tests/eval/test_hygiene.py`	7 new tests for security dimension, update dimension count assertions
`tests/eval/test_runner.py`	Update hardcoded dimension counts (11 -> 12, add "security" to hygiene names set)

Weight rebalancing

All weights scaled proportionally to accommodate the new dimension:

Dimension	Before	After
tests	0.30	0.28
coverage	0.25	0.23
lint	0.15	0.14
type_check	0.10	0.09
guard_patterns	0.10	0.09
config_parser	0.10	0.09
security	n/a	0.08

Test plan

All 25 hygiene tests pass (uv run pytest tests/eval/test_hygiene.py -v)
Full suite: 1327 passed (uv run pytest tests/ -x -q)
Weights sum to 1.0 (validated by existing test_weights_sum_to_one)
New dimension appears in compute_hygiene_results() output
Bandit clean/issues/not-installed paths all tested
npm audit clean/vulnerabilities paths tested
Score floor at zero verified

Add a 7th mandatory hygiene dimension that runs security scanners on detected sub-projects: - Python: runs bandit with JSON output, counts issues - Node.js: runs npm audit with JSON output, sums vulnerabilities - Returns neutral score (0.5) when no scanner is detected Rebalances HYGIENE_WEIGHTS to maintain sum of 1.0 with the new dimension at 0.08 weight. Tests and coverage keep the highest weights. Includes 7 tests covering clean scans, issue parsing, tool-not-found fallback, and score floor at zero. Updates existing tests for the new dimension count (6 -> 7, 11 -> 12 total with growth). Closes akashgit#128 item from contributing.md "Good First Issues" table. Signed-off-by: Luke Inglis <lukeinglis21@yahoo.com>

codecov · 2026-05-01T18:56:12Z

Codecov Report

❌ Patch coverage is 88.88889% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.25%. Comparing base (d556bbb) to head (7af7e51).

Files with missing lines	Patch %	Lines
factory/eval/hygiene.py	88.88%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #167      +/-   ##
==========================================
+ Coverage   86.23%   86.25%   +0.01%     
==========================================
  Files          45       45              
  Lines        6307     6352      +45     
==========================================
+ Hits         5439     5479      +40     
- Misses        868      873       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

akashgit

Code Review: PR #167 — Security Hygiene Eval Dimension

Thorough review of all changes across factory/eval/hygiene.py, tests/eval/test_hygiene.py, and tests/eval/test_runner.py. Several issues found, including one blocker.

BLOCKER: Bandit "not installed" detection is broken

File: factory/eval/hygiene.py, eval_security() — bandit path

if rc == 1 and "Command not found" in stderr:
    continue

The _run_cmd helper returns "Command not found: {cmd[0]}" via the FileNotFoundError handler (line 97). However, the command is ["python", "-m", "bandit", ...] — Python itself exists on the system. When bandit is not installed, python -m bandit returns:

rc = 1
stderr = "No module named bandit" (NOT "Command not found")
stdout = ""

The FileNotFoundError path in _run_cmd only fires if python itself is missing. So on any system with Python installed but without bandit, this check will never match. The code will then try json.loads("") which returns {}, giving count = 0. Since rc != 0 and count == 0, neither the clean nor the issues branch fires, ran_any stays False for this sub-project, and the dimension silently falls back to neutral. In this specific case the end result happens to be correct by accident (neutral score when tool is missing), but only because the fallthrough logic coincidentally does the right thing.

This becomes a real bug in a mixed project (Python + Node) where npm audit runs successfully but bandit is not installed: bandit's failure is silently swallowed, and you get partial results without any indication that Python security scanning was skipped.

The test test_bandit_not_installed is also testing the wrong failure mode — it mocks _run_cmd to return "Command not found: bandit", which is what FileNotFoundError produces when bandit is the direct command. But the actual command is python -m bandit, so the real stderr would be "No module named bandit".

Fix: Check for both patterns:

if rc != 0 and ("Command not found" in stderr or "No module named" in stderr):
    continue

And update the test to use the realistic mock:

mock.return_value = (1, "", "No module named bandit")

ISSUE 1: Incomplete test_runner.py updates — missed assertions

File: tests/eval/test_runner.py

The PR updates >= 11 to >= 12 on lines 65 and 82, and adds "security" to the hygiene_names set on line 87. But it misses several updates in the same file:

Line 10: Docstring still says "all 11 mandatory dimensions" — should be 12
Line 14: Comment still says # 6 hygiene + 5 growth = 11 mandatory — should be # 7 hygiene + 5 growth = 12 mandatory
Line 26: assert len(result.results) >= 11 — should be >= 12
Line 29: Comment says "beyond the 11" — should be "beyond the 12"
Line 40: Comment says # 11 mandatory + 2 project additions — should be # 12 mandatory + 2 project additions
Line 43: assert len(result.results) >= 13 — should be >= 14
Line 64: Comment says # All 11 mandatory should still be present — should be 12
Missing: assert "security" in names in test_always_has_mandatory_dimensions (line 9-26) — every other mandatory dimension is explicitly asserted, but security is not

While the tests still pass due to >=, this is an incomplete update that will confuse future contributors.

ISSUE 2: runner.py docstrings not updated

File: factory/eval/runner.py (not touched by PR, but needs updating)

Multiple docstrings and comments reference "6 hygiene dimensions" and "11 mandatory":

Line 4: - 6 hygiene dimensions (tests, lint, type_check, coverage, guard_patterns, config_parser) — missing security
Line 97: Hygiene (mandatory 6 + eval/score.py additions) — should be 7
Line 126: mandatory 11 dimensions — should be 12
Line 244: Compute 6 mandatory hygiene dimensions — should be 7

These docstrings are authoritative references that other agents (CEO, Strategist) and the runner itself rely on. Stale counts break the contract.

ISSUE 3: docs/ not updated

File: docs/architecture.md line 104 and docs/eval.md line 7

These documentation files reference "6 mandatory dimensions" and "6 dimensions" in the hygiene tier. The docs/eval.md table of hygiene dimensions doesn't include security. Since this PR introduces a new mandatory dimension, the documentation should reflect it.

ISSUE 4: Silent failure on tool errors

File: factory/eval/hygiene.py, eval_security()

When a security scanner runs but crashes (e.g., bandit internal error with rc=2, or npm audit with a corrupt lockfile returning malformed JSON), the code silently drops the result:

The "Command not found" check doesn't match (rc != 1, or wrong stderr)
JSON parsing fails or returns count = 0
Neither rc == 0 and count == 0 nor count > 0 fires
ran_any stays False for this sub-project

Compare with existing dimensions like eval_lint (line 206-217), which unconditionally sets ran_any = True on any non-zero rc, treating tool errors as "1 error." The security dimension should be consistent: if the tool exists and returns a non-zero rc that isn't "not installed," that should count as a failed scan, not be silently ignored.

Suggested approach: After the "Command not found" / "No module named" continue check, treat any rc != 0 and count == 0 case as a ran-but-failed scenario:

if rc != 0 and count == 0:
    ran_any = True
    total_issues += 1
    details_parts.append(f"{sp.name}: scanner error (rc={rc})")

ISSUE 5: npm audit can return rc != 0 for non-vulnerability reasons

File: factory/eval/hygiene.py, eval_security() — npm audit path

npm audit returns rc=1 for multiple reasons beyond "vulnerabilities found":

ENOLOCK: no lockfile present
EAUDITNOLOCK: audit with no lockfile
Network errors

The PR's check if rc == 1 and "Command not found" in stderr only handles the literal FileNotFoundError case. The ENOLOCK case (returns valid JSON with {"error": {...}} but no "metadata" key) happens to fall through gracefully (count=0, no branch fires), but this is accidental. A project with package.json but no lockfile will silently have its Node security scanning skipped without any indication.

Consider checking for the npm error JSON pattern explicitly and logging it in details.

ISSUE 6: Weight rebalancing proportionality claim is inaccurate

PR description claims weights are "scaled proportionally." Let me check:

Dimension	Before	After	Ratio (After/Before)
tests	0.30	0.28	0.933
coverage	0.25	0.23	0.920
lint	0.15	0.14	0.933
type_check	0.10	0.09	0.900
guard_patterns	0.10	0.09	0.900
config_parser	0.10	0.09	0.900

If truly proportional (factor = 0.92 = 1 - 0.08), tests should be 0.276 (rounded to 0.28, OK), coverage should be 0.23 (OK), lint should be 0.138 (rounded to 0.14, OK), type_check should be 0.092 (rounded to 0.09, OK). The rounding means the proportionality is approximate but acceptable. The sum is exactly 1.0. No fix needed here, but the claim of "proportional" in the PR description should note it's approximate due to rounding.

This is not blocking, just noting it for accuracy.

NIT 1: Redundant local `import json` in eval_guard_patterns

File: factory/eval/hygiene.py line 373

The PR adds import json at the top level (line 18 in the diff), but line 373 still has a local import json inside eval_guard_patterns. Since the PR is already modifying this file and adding the top-level import, the local import should be removed.

NIT 2: Dimension weight comment format inconsistency

File: factory/eval/hygiene.py

Existing dimensions use comment format like:

# ── Dimension 1: tests (weight 0.30) ──────────────────────────────

The PR's comment says:

# ── Dimension 7: security (weight 0.08) ─────────────────────────

The trailing dash count is shorter (53 chars vs 58 chars for other section headers). All other dimension headers use the same length. Minor, but per team policy on consistency.

NIT 3: Inconsistent detail string formatting

File: factory/eval/hygiene.py, eval_security()

For Python: f"{sp.name}: {count} issues"
For Node: f"{sp.name}(js): {count} vulnerabilities"

"issues" vs "vulnerabilities" — should pick one term for consistency, or better: "{count} security findings" for both. The existing lint dimension uses "errors" consistently across all languages.

Summary

Category	Count
BLOCKER	1 (bandit not-installed detection)
ISSUE	5 (test gaps, stale docstrings, docs, silent failures, npm edge cases)
NIT	3 (redundant import, comment format, terminology)

The core idea is sound and the implementation follows the existing patterns reasonably well, but the "not installed" detection bug needs to be fixed (it would cause confusing behavior in mixed Python+Node projects), and the incomplete dimension count updates across test_runner.py, runner.py, and docs/ need to be addressed. All findings should be resolved before merge per team policy.

akashgit

Code Review

Finding 1: Bandit "not installed" detection is broken — BLOCKER

In factory/eval/hygiene.py, the check for bandit not being installed looks for "Command not found" in stderr. But python -m bandit when bandit isn't installed produces "No module named bandit" — so this check never matches. The corresponding test test_bandit_not_installed mocks the wrong failure mode.

The end result happens to be correct by accident (falls through to neutral score), but breaks in mixed Python+Node projects where the Node path would still run but the Python score contribution is wrong.

Fix: Check for "No module named" in stderr instead of "Command not found".

Finding 2: Hardcoded counts in test_runner.py not updated — ISSUE

Multiple lines in test_runner.py (lines 10, 14, 26, 29, 40, 43, 64) still hardcode 11 dimensions — should be 12 now. Also missing assert "security" in names from test_always_has_mandatory_dimensions.

Finding 3: Docstrings in runner.py not updated — ISSUE

Lines 4, 97, 126, 244 in factory/eval/runner.py still reference "6 hygiene" and "11 mandatory" dimensions. These should say "7 hygiene" and "12 mandatory".

Finding 4: Documentation not updated — ISSUE

docs/architecture.md (line 104) and docs/eval.md (line 7) reference "6 mandatory dimensions" without security. Adding a new eval dimension should update all docs.

Finding 5: Silent failure on tool errors — ISSUE

Scanner crashes (rc=2, malformed JSON from npm audit) are silently dropped instead of being treated as failures. This is inconsistent with how eval_lint handles the same scenario. A tool that crashes should produce a score of 0.0 with a diagnostic message, not silently return a neutral score.

Finding 6: npm audit non-vulnerability errors swallowed — ISSUE

ENOLOCK and other npm errors are silently swallowed without any diagnostic output. At minimum, include the error in the details field so users can diagnose why security scoring returned neutral.

Finding 7: Redundant import — NIT

eval_guard_patterns has a local import json that's now redundant since the PR adds json at the top level.

Finding 8: Terminology inconsistency — NIT

Python path uses "issues" while Node path uses "vulnerabilities". Pick one term and use it consistently.

Verdict: Do not merge. The bandit detection bug (Finding 1) is a correctness blocker. The hardcoded counts, stale docstrings, and missing docs (Findings 2-4) mean the PR is incomplete — adding a new eval dimension requires updating all references to dimension counts. Silent failure handling (Findings 5-6) is inconsistent with existing patterns.

akashgit · 2026-05-02T20:44:53Z

Hey Luke — really like the direction here. Security as a first-class eval dimension is something we've been wanting, and the bandit + npm audit combo is a solid starting point.

That said, I think there's an opportunity to think bigger. Rather than just a single eval dimension that runs two tools, what if we designed this as a proper security audit mode — almost like a security guard agent that can:

Run static analysis (bandit, npm audit, but also semgrep, trivy, etc. as the project warrants)
Check for common vulnerability patterns (hardcoded secrets, unsafe deserialization, shell injection)
Score across multiple security sub-dimensions (dependencies, code patterns, secrets, permissions)
Optionally produce a structured security report alongside the score

Think of it like: the current eval dimensions measure "is this code healthy?" — a security mode would answer "is this code safe?" It could be lightweight by default (just the quick checks) but extensible for users who want deeper scans.

This would need some planning around:

How it integrates with the existing eval system (one composite "security" dimension vs. multiple sub-dimensions?)
Which tools to support out of the box vs. making it pluggable
Whether it runs every cycle or on-demand (security scans can be slow)
How the security agent/guard fits into the CEO's workflow — does it gate merges? Advisory only?

Would love to see you sketch out a design doc or proposal for this. The fixes from the review still apply to whatever shape this takes, so worth addressing those in the meantime. But I'd hold off on merging this as-is until we've thought through the broader security story — it'll be easier to get the API right the first time than to change it later.

Great instinct bringing this up though — keep going with it!

akashgit · 2026-05-07T20:21:07Z

Closing in favor of PR #190, which supersedes this with a pluggable scanner architecture. The issues flagged in review (bandit detection bug, stale dimension counts, silent failures) are being tracked against #190.

akashgit requested changes May 2, 2026

View reviewed changes

This was referenced May 4, 2026

Security audit mode: phased implementation plan #189

Open

feat: pluggable security scanner architecture (Phase 1) #190

Open

akashgit closed this May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add security hygiene eval dimension (bandit + npm audit)#167

feat: add security hygiene eval dimension (bandit + npm audit)#167
lukeinglis wants to merge 1 commit into
akashgit:mainfrom
lukeinglis:feat/security-eval-dimension

lukeinglis commented May 1, 2026

Uh oh!

codecov Bot commented May 1, 2026

Uh oh!

akashgit left a comment

Uh oh!

akashgit left a comment

Uh oh!

akashgit commented May 2, 2026

Uh oh!

akashgit commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lukeinglis commented May 1, 2026

Summary

Changes

Weight rebalancing

Test plan

Uh oh!

codecov Bot commented May 1, 2026

Codecov Report

Uh oh!

akashgit left a comment

Choose a reason for hiding this comment

Code Review: PR #167 — Security Hygiene Eval Dimension

BLOCKER: Bandit "not installed" detection is broken

ISSUE 1: Incomplete test_runner.py updates — missed assertions

ISSUE 2: runner.py docstrings not updated

ISSUE 3: docs/ not updated

ISSUE 4: Silent failure on tool errors

ISSUE 5: npm audit can return rc != 0 for non-vulnerability reasons

ISSUE 6: Weight rebalancing proportionality claim is inaccurate

NIT 1: Redundant local import json in eval_guard_patterns

NIT 2: Dimension weight comment format inconsistency

NIT 3: Inconsistent detail string formatting

Summary

Uh oh!

akashgit left a comment

Choose a reason for hiding this comment

Code Review

Finding 1: Bandit "not installed" detection is broken — BLOCKER

Finding 2: Hardcoded counts in test_runner.py not updated — ISSUE

Finding 3: Docstrings in runner.py not updated — ISSUE

Finding 4: Documentation not updated — ISSUE

Finding 5: Silent failure on tool errors — ISSUE

Finding 6: npm audit non-vulnerability errors swallowed — ISSUE

Finding 7: Redundant import — NIT

Finding 8: Terminology inconsistency — NIT

Uh oh!

akashgit commented May 2, 2026

Uh oh!

akashgit commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NIT 1: Redundant local `import json` in eval_guard_patterns