feat: add security hygiene eval dimension (bandit + npm audit)#167
feat: add security hygiene eval dimension (bandit + npm audit)#167lukeinglis wants to merge 1 commit into
Conversation
Add a 7th mandatory hygiene dimension that runs security scanners on detected sub-projects: - Python: runs bandit with JSON output, counts issues - Node.js: runs npm audit with JSON output, sums vulnerabilities - Returns neutral score (0.5) when no scanner is detected Rebalances HYGIENE_WEIGHTS to maintain sum of 1.0 with the new dimension at 0.08 weight. Tests and coverage keep the highest weights. Includes 7 tests covering clean scans, issue parsing, tool-not-found fallback, and score floor at zero. Updates existing tests for the new dimension count (6 -> 7, 11 -> 12 total with growth). Closes akashgit#128 item from contributing.md "Good First Issues" table. Signed-off-by: Luke Inglis <lukeinglis21@yahoo.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #167 +/- ##
==========================================
+ Coverage 86.23% 86.25% +0.01%
==========================================
Files 45 45
Lines 6307 6352 +45
==========================================
+ Hits 5439 5479 +40
- Misses 868 873 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
akashgit
left a comment
There was a problem hiding this comment.
Code Review: PR #167 — Security Hygiene Eval Dimension
Thorough review of all changes across factory/eval/hygiene.py, tests/eval/test_hygiene.py, and tests/eval/test_runner.py. Several issues found, including one blocker.
BLOCKER: Bandit "not installed" detection is broken
File: factory/eval/hygiene.py, eval_security() — bandit path
if rc == 1 and "Command not found" in stderr:
continueThe _run_cmd helper returns "Command not found: {cmd[0]}" via the FileNotFoundError handler (line 97). However, the command is ["python", "-m", "bandit", ...] — Python itself exists on the system. When bandit is not installed, python -m bandit returns:
rc = 1stderr = "No module named bandit"(NOT "Command not found")stdout = ""
The FileNotFoundError path in _run_cmd only fires if python itself is missing. So on any system with Python installed but without bandit, this check will never match. The code will then try json.loads("") which returns {}, giving count = 0. Since rc != 0 and count == 0, neither the clean nor the issues branch fires, ran_any stays False for this sub-project, and the dimension silently falls back to neutral. In this specific case the end result happens to be correct by accident (neutral score when tool is missing), but only because the fallthrough logic coincidentally does the right thing.
This becomes a real bug in a mixed project (Python + Node) where npm audit runs successfully but bandit is not installed: bandit's failure is silently swallowed, and you get partial results without any indication that Python security scanning was skipped.
The test test_bandit_not_installed is also testing the wrong failure mode — it mocks _run_cmd to return "Command not found: bandit", which is what FileNotFoundError produces when bandit is the direct command. But the actual command is python -m bandit, so the real stderr would be "No module named bandit".
Fix: Check for both patterns:
if rc != 0 and ("Command not found" in stderr or "No module named" in stderr):
continueAnd update the test to use the realistic mock:
mock.return_value = (1, "", "No module named bandit")ISSUE 1: Incomplete test_runner.py updates — missed assertions
File: tests/eval/test_runner.py
The PR updates >= 11 to >= 12 on lines 65 and 82, and adds "security" to the hygiene_names set on line 87. But it misses several updates in the same file:
- Line 10: Docstring still says
"all 11 mandatory dimensions"— should be 12 - Line 14: Comment still says
# 6 hygiene + 5 growth = 11 mandatory— should be# 7 hygiene + 5 growth = 12 mandatory - Line 26:
assert len(result.results) >= 11— should be>= 12 - Line 29: Comment says
"beyond the 11"— should be "beyond the 12" - Line 40: Comment says
# 11 mandatory + 2 project additions— should be# 12 mandatory + 2 project additions - Line 43:
assert len(result.results) >= 13— should be>= 14 - Line 64: Comment says
# All 11 mandatory should still be present— should be 12 - Missing:
assert "security" in namesintest_always_has_mandatory_dimensions(line 9-26) — every other mandatory dimension is explicitly asserted, but security is not
While the tests still pass due to >=, this is an incomplete update that will confuse future contributors.
ISSUE 2: runner.py docstrings not updated
File: factory/eval/runner.py (not touched by PR, but needs updating)
Multiple docstrings and comments reference "6 hygiene dimensions" and "11 mandatory":
- Line 4:
- 6 hygiene dimensions (tests, lint, type_check, coverage, guard_patterns, config_parser)— missingsecurity - Line 97:
Hygiene (mandatory 6 + eval/score.py additions)— should be 7 - Line 126:
mandatory 11 dimensions— should be 12 - Line 244:
Compute 6 mandatory hygiene dimensions— should be 7
These docstrings are authoritative references that other agents (CEO, Strategist) and the runner itself rely on. Stale counts break the contract.
ISSUE 3: docs/ not updated
File: docs/architecture.md line 104 and docs/eval.md line 7
These documentation files reference "6 mandatory dimensions" and "6 dimensions" in the hygiene tier. The docs/eval.md table of hygiene dimensions doesn't include security. Since this PR introduces a new mandatory dimension, the documentation should reflect it.
ISSUE 4: Silent failure on tool errors
File: factory/eval/hygiene.py, eval_security()
When a security scanner runs but crashes (e.g., bandit internal error with rc=2, or npm audit with a corrupt lockfile returning malformed JSON), the code silently drops the result:
- The "Command not found" check doesn't match (rc != 1, or wrong stderr)
- JSON parsing fails or returns count = 0
- Neither
rc == 0 and count == 0norcount > 0fires ran_anystays False for this sub-project
Compare with existing dimensions like eval_lint (line 206-217), which unconditionally sets ran_any = True on any non-zero rc, treating tool errors as "1 error." The security dimension should be consistent: if the tool exists and returns a non-zero rc that isn't "not installed," that should count as a failed scan, not be silently ignored.
Suggested approach: After the "Command not found" / "No module named" continue check, treat any rc != 0 and count == 0 case as a ran-but-failed scenario:
if rc != 0 and count == 0:
ran_any = True
total_issues += 1
details_parts.append(f"{sp.name}: scanner error (rc={rc})")ISSUE 5: npm audit can return rc != 0 for non-vulnerability reasons
File: factory/eval/hygiene.py, eval_security() — npm audit path
npm audit returns rc=1 for multiple reasons beyond "vulnerabilities found":
ENOLOCK: no lockfile presentEAUDITNOLOCK: audit with no lockfile- Network errors
The PR's check if rc == 1 and "Command not found" in stderr only handles the literal FileNotFoundError case. The ENOLOCK case (returns valid JSON with {"error": {...}} but no "metadata" key) happens to fall through gracefully (count=0, no branch fires), but this is accidental. A project with package.json but no lockfile will silently have its Node security scanning skipped without any indication.
Consider checking for the npm error JSON pattern explicitly and logging it in details.
ISSUE 6: Weight rebalancing proportionality claim is inaccurate
PR description claims weights are "scaled proportionally." Let me check:
| Dimension | Before | After | Ratio (After/Before) |
|---|---|---|---|
| tests | 0.30 | 0.28 | 0.933 |
| coverage | 0.25 | 0.23 | 0.920 |
| lint | 0.15 | 0.14 | 0.933 |
| type_check | 0.10 | 0.09 | 0.900 |
| guard_patterns | 0.10 | 0.09 | 0.900 |
| config_parser | 0.10 | 0.09 | 0.900 |
If truly proportional (factor = 0.92 = 1 - 0.08), tests should be 0.276 (rounded to 0.28, OK), coverage should be 0.23 (OK), lint should be 0.138 (rounded to 0.14, OK), type_check should be 0.092 (rounded to 0.09, OK). The rounding means the proportionality is approximate but acceptable. The sum is exactly 1.0. No fix needed here, but the claim of "proportional" in the PR description should note it's approximate due to rounding.
This is not blocking, just noting it for accuracy.
NIT 1: Redundant local import json in eval_guard_patterns
File: factory/eval/hygiene.py line 373
The PR adds import json at the top level (line 18 in the diff), but line 373 still has a local import json inside eval_guard_patterns. Since the PR is already modifying this file and adding the top-level import, the local import should be removed.
NIT 2: Dimension weight comment format inconsistency
File: factory/eval/hygiene.py
Existing dimensions use comment format like:
# ── Dimension 1: tests (weight 0.30) ──────────────────────────────The PR's comment says:
# ── Dimension 7: security (weight 0.08) ─────────────────────────The trailing dash count is shorter (53 chars vs 58 chars for other section headers). All other dimension headers use the same length. Minor, but per team policy on consistency.
NIT 3: Inconsistent detail string formatting
File: factory/eval/hygiene.py, eval_security()
For Python: f"{sp.name}: {count} issues"
For Node: f"{sp.name}(js): {count} vulnerabilities"
"issues" vs "vulnerabilities" — should pick one term for consistency, or better: "{count} security findings" for both. The existing lint dimension uses "errors" consistently across all languages.
Summary
| Category | Count |
|---|---|
| BLOCKER | 1 (bandit not-installed detection) |
| ISSUE | 5 (test gaps, stale docstrings, docs, silent failures, npm edge cases) |
| NIT | 3 (redundant import, comment format, terminology) |
The core idea is sound and the implementation follows the existing patterns reasonably well, but the "not installed" detection bug needs to be fixed (it would cause confusing behavior in mixed Python+Node projects), and the incomplete dimension count updates across test_runner.py, runner.py, and docs/ need to be addressed. All findings should be resolved before merge per team policy.
akashgit
left a comment
There was a problem hiding this comment.
Code Review
Finding 1: Bandit "not installed" detection is broken — BLOCKER
In factory/eval/hygiene.py, the check for bandit not being installed looks for "Command not found" in stderr. But python -m bandit when bandit isn't installed produces "No module named bandit" — so this check never matches. The corresponding test test_bandit_not_installed mocks the wrong failure mode.
The end result happens to be correct by accident (falls through to neutral score), but breaks in mixed Python+Node projects where the Node path would still run but the Python score contribution is wrong.
Fix: Check for "No module named" in stderr instead of "Command not found".
Finding 2: Hardcoded counts in test_runner.py not updated — ISSUE
Multiple lines in test_runner.py (lines 10, 14, 26, 29, 40, 43, 64) still hardcode 11 dimensions — should be 12 now. Also missing assert "security" in names from test_always_has_mandatory_dimensions.
Finding 3: Docstrings in runner.py not updated — ISSUE
Lines 4, 97, 126, 244 in factory/eval/runner.py still reference "6 hygiene" and "11 mandatory" dimensions. These should say "7 hygiene" and "12 mandatory".
Finding 4: Documentation not updated — ISSUE
docs/architecture.md (line 104) and docs/eval.md (line 7) reference "6 mandatory dimensions" without security. Adding a new eval dimension should update all docs.
Finding 5: Silent failure on tool errors — ISSUE
Scanner crashes (rc=2, malformed JSON from npm audit) are silently dropped instead of being treated as failures. This is inconsistent with how eval_lint handles the same scenario. A tool that crashes should produce a score of 0.0 with a diagnostic message, not silently return a neutral score.
Finding 6: npm audit non-vulnerability errors swallowed — ISSUE
ENOLOCK and other npm errors are silently swallowed without any diagnostic output. At minimum, include the error in the details field so users can diagnose why security scoring returned neutral.
Finding 7: Redundant import — NIT
eval_guard_patterns has a local import json that's now redundant since the PR adds json at the top level.
Finding 8: Terminology inconsistency — NIT
Python path uses "issues" while Node path uses "vulnerabilities". Pick one term and use it consistently.
Verdict: Do not merge. The bandit detection bug (Finding 1) is a correctness blocker. The hardcoded counts, stale docstrings, and missing docs (Findings 2-4) mean the PR is incomplete — adding a new eval dimension requires updating all references to dimension counts. Silent failure handling (Findings 5-6) is inconsistent with existing patterns.
|
Hey Luke — really like the direction here. Security as a first-class eval dimension is something we've been wanting, and the bandit + npm audit combo is a solid starting point. That said, I think there's an opportunity to think bigger. Rather than just a single eval dimension that runs two tools, what if we designed this as a proper security audit mode — almost like a security guard agent that can:
Think of it like: the current eval dimensions measure "is this code healthy?" — a security mode would answer "is this code safe?" It could be lightweight by default (just the quick checks) but extensible for users who want deeper scans. This would need some planning around:
Would love to see you sketch out a design doc or proposal for this. The fixes from the review still apply to whatever shape this takes, so worth addressing those in the meantime. But I'd hold off on merging this as-is until we've thought through the broader security story — it'll be easier to get the API right the first time than to change it later. Great instinct bringing this up though — keep going with it! |
Summary
Adds a 7th mandatory hygiene dimension: security scanning. Listed as a "Good First Issue" in the contributing guide.
bandit -r . -f json -q, counts issues from JSON outputnpm audit --json, sums vulnerabilities across severity levelsmax(0.0, 1.0 - issue_count * 0.1)Changes
factory/eval/hygiene.pyeval_security(), rebalanceHYGIENE_WEIGHTS(sum still 1.0), updatecompute_hygiene_results()tests/eval/test_hygiene.pytests/eval/test_runner.pyWeight rebalancing
All weights scaled proportionally to accommodate the new dimension:
Test plan
uv run pytest tests/eval/test_hygiene.py -v)uv run pytest tests/ -x -q)test_weights_sum_to_one)compute_hygiene_results()output