Detect whitespace padding used to hide prompt-injection instructions (P9)#24
Detect whitespace padding used to hide prompt-injection instructions (P9)#24korjavin wants to merge 14 commits into
Conversation
Plan for issue NVIDIA#20 — detect large whitespace padding used to hide prompt-injection instructions from review. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
rng1995
left a comment
There was a problem hiding this comment.
This is a high-quality addition. I read the detector end to end and the design is sound for an anti-evasion rule.
Strengths
- Detection is genuinely Unicode-aware:
is_padding_charcovers ASCII controls, theZs/Zl/Zpcategories, and the explicit zero-width family, rather than ASCII space/tab only — that closes the obvious bypasses._split_linestreating CR/CRLF/U+2028/U+2029/U+0085 as boundaries (and the offset table with the trailing sentinel) keeps char-offset math correct, and the boundary off-by-one cases are tested. - All scanning is linear (
finditeron a simple alternation, anchored per-line fence match, char/line while-loops), so there's no ReDoS exposure on attacker-controlled content. The only regex change to P2/the MCP check is swapping a literal class for one built from the sharedZERO_WIDTH_CHARSset — same/expanded code points, still a plain character class. - Folding the zero-width set into one shared constant is a real improvement: it prevents P2 and the
mcp_tool_poisoninghidden-text check from drifting, and it correctly extends the MCP check to U+2060/U+FEFF. - Sensible false-positive handling: fenced-code skipping for the horizontal signal, vendored/generated globs (P9-only), the ratio signal capped at LOW, dedup so one oversized span doesn't triple-report, and severity/confidence weighting per signal. Thresholds are named constants, which makes corpus tuning trivial.
- Tests are thorough: each evasion code point inline and vertical, the MCP description path, threshold boundaries, CRLF/LF offsets, block/ratio dedup, fence and skip-glob guards, and the binary bailout.
Security — important, non-blocking: the U+FFFD bailout is itself an evasion vector
detect_whitespace_padding returns [] if _REPLACEMENT_CHAR (U+FFFD) appears anywhere in the content. Since files are read with errors="replace", that's a reasonable guard against truly binary blobs — but a U+FFFD is also a perfectly valid character an author can embed in a UTF-8 SKILL.md. So an adversary can drop a single U+FFFD into the file and disable P9 for the entire file, then pad freely — defeating exactly the rule being added. It's mitigated (the injected text is still subject to P1–P4 and the LLM pass, and a stray U+FFFD in a manifest is itself anomalous), so I'm not blocking on it, but I'd recommend hardening before relying on P9 as a guardrail: make the binary heuristic proportion-based (e.g. bail only when U+FFFD density exceeds some fraction), or strip/ignore U+FFFD and still scan the remainder. A regression test for "embedded U+FFFD must not suppress an otherwise-detected pad" would lock it in.
Minor / optional
- Completeness: U+0085 (NEL) and U+180E aren't treated as padding chars. NEL is still caught vertically because it splits lines, but a horizontal/block run built from NEL or U+180E would slip past the in-line/block signals. Low priority given how niche these are.
- A region that is simultaneously a vertical gap and has 80+ char horizontal runs per line can emit both a vertical finding and per-line horizontal findings (the dedup only suppresses block/ratio against primaries). Minor noise, not incorrect.
- Micro-perf:
is_padding_charcallsunicodedata.categoryper character and the block/ratio pass re-encodes per char; fine under the upstream 1 MB cap, but an ASCII fast-path would cut the common case.
Net: correct, well-guarded, well-tested, and a real improvement to the prompt-injection surface. Approving — please consider tightening the U+FFFD bailout as a near-term follow-up.
|
@korjavin - Please resolve the conflicts, minor issues and merge the PR. |
Resolve conflicts: - README.md: pattern/category counts. Base was 64/16; this branch adds P9 (whitespace padding, +1) and main adds anti-refusal (+1 category, +4 patterns), so the merged totals are 69 patterns / 17 categories. Verified against the per-category counts in the detail table. - static_patterns_prompt_injection.py: keep P2's zero-width class built from the shared ZERO_WIDTH_CHARS constant (this branch) AND main's new bidi-control pattern [U+202A-U+202E,U+2066-U+2069]. The shared constant expands to exactly main's [U+200B U+200C U+200D U+2060 U+FEFF] class, so no code points are lost. Claude-Session: https://claude.ai/code/session_012ng7GJhfjXehRtPhQuAHsN
Resolve the minor issues raised in PR review: - Security (the U+FFFD bailout was itself an evasion vector): bail only when U+FFFD *density* exceeds a threshold (0.30) instead of when a single U+FFFD is present anywhere. A lone replacement char in an otherwise-textual file can no longer disable P9 for the whole file and let an attacker pad freely. Genuine binary blobs (mostly U+FFFD) still bail. Adds a regression test plus a high-density bail test. - Completeness: treat U+0085 (NEL) and U+180E as padding chars. Both fall outside Zs/Zl/Zp, so a horizontal/block run built from them previously slipped past the in-line/block signals. Adds unit + end-to-end coverage. - Micro-perf: ASCII fast-path in is_padding_char so the common case skips the per-char unicodedata.category() call. The remaining "vertical + horizontal can both report on one region" note is left as-is: the reviewer flagged it as minor noise, not incorrect, and the dedup change would risk regressions for no correctness gain. Claude-Session: https://claude.ai/code/session_012ng7GJhfjXehRtPhQuAHsN
|
Rebased/merged Conflicts resolved
Review feedback addressed
Left as-is: the "a region can emit both a vertical and per-line horizontal finding" note — it's minor noise and not incorrect, and reworking the dedup risks regressions for no correctness gain. Happy to follow up if you'd prefer it suppressed. All P9/static-pattern/MCP tests pass locally (149 passed). The CI run shows |
Adds rule P9 "Whitespace Padding" under Prompt Injection, for issue #20. It detects padding that pushes injected instructions out of a reviewer's view while the agent still reads them.
P6 through P8 were taken by System Prompt Leakage, so this uses P9. One id covers all three signals; confidence carries the weighting.
Signals (all reported as P9):
Whitespace is classified by Unicode category rather than ASCII space/tab: controls (
\t \n \r \v \f), categories Zs/Zl/Zp (U+00A0, U+2028, U+2029, U+3000, and so on), and the zero-width family (U+200B/C/D, U+2060, U+FEFF). That zero-width set is now one shared constant (ZERO_WIDTH_CHARS) used by P2's regex and themcp_tool_poisoningzero-width check, so the two cannot drift; the MCP check also picks up U+2060/U+FEFF.Each finding points at the line where padding starts and includes a visible snippet of what was hidden (for example
U+00A0 x82or\n x80).False-positive guards: markdown fenced code is skipped for the horizontal signal; vendored files are skipped (
*.min.js,*.min.css,*.lock,package-lock.json,yarn.lock,*.svg,*.map); binary-ish content (containing U+FFFD) bails out; the ratio signal stays at LOW. Eval-dataset prose and files over 1 MB are already skipped upstream.MCP manifest description fields are covered by the same detector (horizontal and block signals; the per-file ratio signal is skipped since fields are short).
Tests cover all three signals at their thresholds, the full Unicode evasion set, the false-positive guards, and the MCP path. Thresholds are named constants, easy to tune against a real corpus. Happy to adjust the signals or thresholds before merge.