Skip to content

Detect whitespace padding used to hide prompt-injection instructions (P9)#24

Open
korjavin wants to merge 14 commits into
NVIDIA:mainfrom
korjavin:feat/detect-whitespace-padding-injection
Open

Detect whitespace padding used to hide prompt-injection instructions (P9)#24
korjavin wants to merge 14 commits into
NVIDIA:mainfrom
korjavin:feat/detect-whitespace-padding-injection

Conversation

@korjavin

Copy link
Copy Markdown

Adds rule P9 "Whitespace Padding" under Prompt Injection, for issue #20. It detects padding that pushes injected instructions out of a reviewer's view while the agent still reads them.

P6 through P8 were taken by System Prompt Leakage, so this uses P9. One id covers all three signals; confidence carries the weighting.

Signals (all reported as P9):

  1. Vertical: 20+ consecutive blank or whitespace-only lines. MEDIUM, raised to HIGH when content follows a gap of 40+ lines. Confidence 0.8 with trailing content, 0.6 without.
  2. Horizontal: 80+ consecutive whitespace characters in a line, including leading indentation. MEDIUM, confidence 0.7.
  3. Ratio: a contiguous whitespace block over 2 KB, or whitespace over 90% of a file larger than 4 KB. LOW, confidence 0.4.

Whitespace is classified by Unicode category rather than ASCII space/tab: controls (\t \n \r \v \f), categories Zs/Zl/Zp (U+00A0, U+2028, U+2029, U+3000, and so on), and the zero-width family (U+200B/C/D, U+2060, U+FEFF). That zero-width set is now one shared constant (ZERO_WIDTH_CHARS) used by P2's regex and the mcp_tool_poisoning zero-width check, so the two cannot drift; the MCP check also picks up U+2060/U+FEFF.

Each finding points at the line where padding starts and includes a visible snippet of what was hidden (for example U+00A0 x82 or \n x80).

False-positive guards: markdown fenced code is skipped for the horizontal signal; vendored files are skipped (*.min.js, *.min.css, *.lock, package-lock.json, yarn.lock, *.svg, *.map); binary-ish content (containing U+FFFD) bails out; the ratio signal stays at LOW. Eval-dataset prose and files over 1 MB are already skipped upstream.

MCP manifest description fields are covered by the same detector (horizontal and block signals; the per-file ratio signal is skipped since fields are short).

Tests cover all three signals at their thresholds, the full Unicode evasion set, the false-positive guards, and the MCP path. Thresholds are named constants, easy to tune against a real corpus. Happy to adjust the signals or thresholds before merge.

@rng1995 rng1995 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a high-quality addition. I read the detector end to end and the design is sound for an anti-evasion rule.

Strengths

  • Detection is genuinely Unicode-aware: is_padding_char covers ASCII controls, the Zs/Zl/Zp categories, and the explicit zero-width family, rather than ASCII space/tab only — that closes the obvious bypasses. _split_lines treating CR/CRLF/U+2028/U+2029/U+0085 as boundaries (and the offset table with the trailing sentinel) keeps char-offset math correct, and the boundary off-by-one cases are tested.
  • All scanning is linear (finditer on a simple alternation, anchored per-line fence match, char/line while-loops), so there's no ReDoS exposure on attacker-controlled content. The only regex change to P2/the MCP check is swapping a literal class for one built from the shared ZERO_WIDTH_CHARS set — same/expanded code points, still a plain character class.
  • Folding the zero-width set into one shared constant is a real improvement: it prevents P2 and the mcp_tool_poisoning hidden-text check from drifting, and it correctly extends the MCP check to U+2060/U+FEFF.
  • Sensible false-positive handling: fenced-code skipping for the horizontal signal, vendored/generated globs (P9-only), the ratio signal capped at LOW, dedup so one oversized span doesn't triple-report, and severity/confidence weighting per signal. Thresholds are named constants, which makes corpus tuning trivial.
  • Tests are thorough: each evasion code point inline and vertical, the MCP description path, threshold boundaries, CRLF/LF offsets, block/ratio dedup, fence and skip-glob guards, and the binary bailout.

Security — important, non-blocking: the U+FFFD bailout is itself an evasion vector
detect_whitespace_padding returns [] if _REPLACEMENT_CHAR (U+FFFD) appears anywhere in the content. Since files are read with errors="replace", that's a reasonable guard against truly binary blobs — but a U+FFFD is also a perfectly valid character an author can embed in a UTF-8 SKILL.md. So an adversary can drop a single U+FFFD into the file and disable P9 for the entire file, then pad freely — defeating exactly the rule being added. It's mitigated (the injected text is still subject to P1–P4 and the LLM pass, and a stray U+FFFD in a manifest is itself anomalous), so I'm not blocking on it, but I'd recommend hardening before relying on P9 as a guardrail: make the binary heuristic proportion-based (e.g. bail only when U+FFFD density exceeds some fraction), or strip/ignore U+FFFD and still scan the remainder. A regression test for "embedded U+FFFD must not suppress an otherwise-detected pad" would lock it in.

Minor / optional

  • Completeness: U+0085 (NEL) and U+180E aren't treated as padding chars. NEL is still caught vertically because it splits lines, but a horizontal/block run built from NEL or U+180E would slip past the in-line/block signals. Low priority given how niche these are.
  • A region that is simultaneously a vertical gap and has 80+ char horizontal runs per line can emit both a vertical finding and per-line horizontal findings (the dedup only suppresses block/ratio against primaries). Minor noise, not incorrect.
  • Micro-perf: is_padding_char calls unicodedata.category per character and the block/ratio pass re-encodes per char; fine under the upstream 1 MB cap, but an ASCII fast-path would cut the common case.

Net: correct, well-guarded, well-tested, and a real improvement to the prompt-injection surface. Approving — please consider tightening the U+FFFD bailout as a near-term follow-up.

@rng1995

rng1995 commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

@korjavin - Please resolve the conflicts, minor issues and merge the PR.

korjavin added 2 commits June 24, 2026 21:25
Resolve conflicts:
- README.md: pattern/category counts. Base was 64/16; this branch adds
  P9 (whitespace padding, +1) and main adds anti-refusal (+1 category,
  +4 patterns), so the merged totals are 69 patterns / 17 categories.
  Verified against the per-category counts in the detail table.
- static_patterns_prompt_injection.py: keep P2's zero-width class built
  from the shared ZERO_WIDTH_CHARS constant (this branch) AND main's new
  bidi-control pattern [U+202A-U+202E,U+2066-U+2069]. The shared constant
  expands to exactly main's [U+200B U+200C U+200D U+2060 U+FEFF] class, so
  no code points are lost.

Claude-Session: https://claude.ai/code/session_012ng7GJhfjXehRtPhQuAHsN
Resolve the minor issues raised in PR review:

- Security (the U+FFFD bailout was itself an evasion vector): bail only
  when U+FFFD *density* exceeds a threshold (0.30) instead of when a
  single U+FFFD is present anywhere. A lone replacement char in an
  otherwise-textual file can no longer disable P9 for the whole file and
  let an attacker pad freely. Genuine binary blobs (mostly U+FFFD) still
  bail. Adds a regression test plus a high-density bail test.
- Completeness: treat U+0085 (NEL) and U+180E as padding chars. Both fall
  outside Zs/Zl/Zp, so a horizontal/block run built from them previously
  slipped past the in-line/block signals. Adds unit + end-to-end coverage.
- Micro-perf: ASCII fast-path in is_padding_char so the common case skips
  the per-char unicodedata.category() call.

The remaining "vertical + horizontal can both report on one region" note
is left as-is: the reviewer flagged it as minor noise, not incorrect, and
the dedup change would risk regressions for no correctness gain.

Claude-Session: https://claude.ai/code/session_012ng7GJhfjXehRtPhQuAHsN
@korjavin

Copy link
Copy Markdown
Author

Rebased/merged main and addressed the review feedback. Pushed as a merge commit + a follow-up fix commit.

Conflicts resolved

  • README.md — pattern/category counts. Base was 64/16; this branch adds P9 (+1 pattern) and main adds anti-refusal (+1 category, +4 patterns), so the merged totals are 69 patterns / 17 categories (verified against the per-category counts in the detail table).
  • static_patterns_prompt_injection.py (P2) — kept this branch's zero-width class built from the shared ZERO_WIDTH_CHARS constant and main's new bidi-control pattern [U+202A–U+202E, U+2066–U+2069]. The shared constant expands to exactly main's [U+200B U+200C U+200D U+2060 U+FEFF], so no code points are lost.

Review feedback addressed

  • U+FFFD bailout (the important one) is no longer an evasion vector. P9 now bails only when U+FFFD density exceeds a threshold (0.30) rather than when a single U+FFFD appears anywhere — a lone replacement char can no longer disable P9 for the whole file. Genuine binary blobs (mostly U+FFFD) still bail. Added the regression test you suggested ("embedded U+FFFD must not suppress an otherwise-detected pad") plus a high-density bail test.
  • Completeness: U+0085 (NEL) and U+180E are now treated as padding chars, so horizontal/block runs built from them are caught. Added unit + end-to-end coverage.
  • Micro-perf: added an ASCII fast-path in is_padding_char so the common case skips the per-char unicodedata.category() call.

Left as-is: the "a region can emit both a vertical and per-line horizontal finding" note — it's minor noise and not incorrect, and reworking the dedup risks regressions for no correctness gain. Happy to follow up if you'd prefer it suppressed.

All P9/static-pattern/MCP tests pass locally (149 passed). The CI run shows action_required — it needs a maintainer to approve the workflow run for this fork PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants