fix(report): strip ANSI/control bytes from report output (closes #186) by assinchu · Pull Request #187 · NVIDIA/SkillSpector

assinchu · 2026-06-24T01:11:25Z

Summary

Fixes #186. Scanned content (and LLM output quoting it) can carry ANSI escape
sequences and control bytes (NUL, ESC, ...) into finding text. Emitted verbatim,
these make a report register as binary — GitLab/editors offer "download"
instead of rendering the Markdown, and terminals print garbled output.

Change

Sanitize every finding's free-text fields once in the report node (the single
scoring/formatting point), so terminal, JSON, Markdown, and SARIF output all stay
clean UTF-8. Tabs, newlines, and multibyte UTF-8 (e.g. emoji severity markers)
are preserved. Non-text fields and counts are untouched.

_clean_text / _sanitize_finding + _ANSI_RE / _CONTROL_RE in report.py
Applied to filtered_findings at the top of report() before scoring/format

Tests

New tests/nodes/test_report_sanitizer.py: unit tests for _clean_text /
_sanitize_finding, plus a parametrized check that no \x00/\x1b leaks into
any of the four output formats while readable content survives. Full suite green;
ruff check/format clean.

This is distinct from #144 (binary input files); it sanitizes the output.

rng1995

[Automated SkillSpector Review]

Approved.

Sanitizes ANSI escape/control bytes from finding free-text fields once in the report node so terminal/JSON/Markdown/SARIF output all stay clean UTF-8 — a sensible single chokepoint.

Verified the correctness risk in _sanitize_finding: it does getattr(finding, f) + dataclasses.replace(finding, **{...}) for message, explanation, remediation, finding, context, matched_text, code_snippet. I checked models.Finding and all seven are real init fields with defaults, so replace() won't raise. _ANSI_RE + _CONTROL_RE together strip ESC fully (any residual ESC is caught by the control-char class) while preserving tab/newline/CR and multibyte UTF-8. Tests cover the helper plus all four output formats.

Non-blocking: GitHub reports this branch as having a merge conflict (mergeable_state: dirty) — please rebase before merge.

Scanned skill content (and LLM output quoting it) can carry ANSI escape sequences and control bytes (NUL, ESC, ...) into finding text. Emitted verbatim, these make a report register as binary: GitLab/editors offer 'download' instead of rendering the Markdown, and terminals print garbled output. Sanitize every finding's free-text fields once in the report node (the single scoring/formatting point), so terminal, JSON, Markdown, and SARIF output all stay clean UTF-8. Tabs, newlines, and multibyte UTF-8 (e.g. emoji severity markers) are preserved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Aravinda Sharma <7734009+assinchu@users.noreply.github.com>

rng1995 approved these changes Jun 24, 2026

View reviewed changes

assinchu force-pushed the feature/report-sanitizer branch from 6e8346b to 23347bf Compare June 24, 2026 14:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(report): strip ANSI/control bytes from report output (closes #186)#187

fix(report): strip ANSI/control bytes from report output (closes #186)#187
assinchu wants to merge 1 commit into
NVIDIA:mainfrom
assinchu:feature/report-sanitizer

assinchu commented Jun 24, 2026

Uh oh!

rng1995 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

assinchu commented Jun 24, 2026

Summary

Change

Tests

Uh oh!

rng1995 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants