fix(report): strip ANSI/control bytes from report output (closes #186)#187
fix(report): strip ANSI/control bytes from report output (closes #186)#187assinchu wants to merge 1 commit into
Conversation
rng1995
left a comment
There was a problem hiding this comment.
[Automated SkillSpector Review]
Approved.
Sanitizes ANSI escape/control bytes from finding free-text fields once in the report node so terminal/JSON/Markdown/SARIF output all stay clean UTF-8 — a sensible single chokepoint.
Verified the correctness risk in _sanitize_finding: it does getattr(finding, f) + dataclasses.replace(finding, **{...}) for message, explanation, remediation, finding, context, matched_text, code_snippet. I checked models.Finding and all seven are real init fields with defaults, so replace() won't raise. _ANSI_RE + _CONTROL_RE together strip ESC fully (any residual ESC is caught by the control-char class) while preserving tab/newline/CR and multibyte UTF-8. Tests cover the helper plus all four output formats.
Non-blocking: GitHub reports this branch as having a merge conflict (mergeable_state: dirty) — please rebase before merge.
Scanned skill content (and LLM output quoting it) can carry ANSI escape sequences and control bytes (NUL, ESC, ...) into finding text. Emitted verbatim, these make a report register as binary: GitLab/editors offer 'download' instead of rendering the Markdown, and terminals print garbled output. Sanitize every finding's free-text fields once in the report node (the single scoring/formatting point), so terminal, JSON, Markdown, and SARIF output all stay clean UTF-8. Tabs, newlines, and multibyte UTF-8 (e.g. emoji severity markers) are preserved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Aravinda Sharma <7734009+assinchu@users.noreply.github.com>
6e8346b to
23347bf
Compare
Summary
Fixes #186. Scanned content (and LLM output quoting it) can carry ANSI escape
sequences and control bytes (NUL, ESC, ...) into finding text. Emitted verbatim,
these make a report register as binary — GitLab/editors offer "download"
instead of rendering the Markdown, and terminals print garbled output.
Change
Sanitize every finding's free-text fields once in the report node (the single
scoring/formatting point), so terminal, JSON, Markdown, and SARIF output all stay
clean UTF-8. Tabs, newlines, and multibyte UTF-8 (e.g. emoji severity markers)
are preserved. Non-text fields and counts are untouched.
_clean_text/_sanitize_finding+_ANSI_RE/_CONTROL_REinreport.pyfiltered_findingsat the top ofreport()before scoring/formatTests
New
tests/nodes/test_report_sanitizer.py: unit tests for_clean_text/_sanitize_finding, plus a parametrized check that no\x00/\x1bleaks intoany of the four output formats while readable content survives. Full suite green;
ruff check/formatclean.This is distinct from #144 (binary input files); it sanitizes the output.