Skip to content

fix(report): strip ANSI/control bytes from report output (closes #186)#187

Open
assinchu wants to merge 1 commit into
NVIDIA:mainfrom
assinchu:feature/report-sanitizer
Open

fix(report): strip ANSI/control bytes from report output (closes #186)#187
assinchu wants to merge 1 commit into
NVIDIA:mainfrom
assinchu:feature/report-sanitizer

Conversation

@assinchu

Copy link
Copy Markdown
Contributor

Summary

Fixes #186. Scanned content (and LLM output quoting it) can carry ANSI escape
sequences and control bytes (NUL, ESC, ...) into finding text. Emitted verbatim,
these make a report register as binary — GitLab/editors offer "download"
instead of rendering the Markdown, and terminals print garbled output.

Change

Sanitize every finding's free-text fields once in the report node (the single
scoring/formatting point), so terminal, JSON, Markdown, and SARIF output all stay
clean UTF-8. Tabs, newlines, and multibyte UTF-8 (e.g. emoji severity markers)
are preserved. Non-text fields and counts are untouched.

  • _clean_text / _sanitize_finding + _ANSI_RE / _CONTROL_RE in report.py
  • Applied to filtered_findings at the top of report() before scoring/format

Tests

New tests/nodes/test_report_sanitizer.py: unit tests for _clean_text /
_sanitize_finding, plus a parametrized check that no \x00/\x1b leaks into
any of the four output formats while readable content survives. Full suite green;
ruff check/format clean.

This is distinct from #144 (binary input files); it sanitizes the output.

@rng1995 rng1995 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Automated SkillSpector Review]

Approved.

Sanitizes ANSI escape/control bytes from finding free-text fields once in the report node so terminal/JSON/Markdown/SARIF output all stay clean UTF-8 — a sensible single chokepoint.

Verified the correctness risk in _sanitize_finding: it does getattr(finding, f) + dataclasses.replace(finding, **{...}) for message, explanation, remediation, finding, context, matched_text, code_snippet. I checked models.Finding and all seven are real init fields with defaults, so replace() won't raise. _ANSI_RE + _CONTROL_RE together strip ESC fully (any residual ESC is caught by the control-char class) while preserving tab/newline/CR and multibyte UTF-8. Tests cover the helper plus all four output formats.

Non-blocking: GitHub reports this branch as having a merge conflict (mergeable_state: dirty) — please rebase before merge.

Scanned skill content (and LLM output quoting it) can carry ANSI escape
sequences and control bytes (NUL, ESC, ...) into finding text. Emitted
verbatim, these make a report register as binary: GitLab/editors offer
'download' instead of rendering the Markdown, and terminals print garbled
output.

Sanitize every finding's free-text fields once in the report node (the single
scoring/formatting point), so terminal, JSON, Markdown, and SARIF output all
stay clean UTF-8. Tabs, newlines, and multibyte UTF-8 (e.g. emoji severity
markers) are preserved.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Aravinda Sharma <7734009+assinchu@users.noreply.github.com>
@assinchu assinchu force-pushed the feature/report-sanitizer branch from 6e8346b to 23347bf Compare June 24, 2026 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Reports containing scanned content with ANSI/control bytes are emitted as binary (break Markdown rendering)

2 participants