Skip to content

fix(input-handler): validate URLs against SSRF and add zip-slip protection#159

Merged
rng1995 merged 2 commits into
NVIDIA:mainfrom
mimran-khan:fix/input-handler-ssrf-validation
Jun 23, 2026
Merged

fix(input-handler): validate URLs against SSRF and add zip-slip protection#159
rng1995 merged 2 commits into
NVIDIA:mainfrom
mimran-khan:fix/input-handler-ssrf-validation

Conversation

@mimran-khan

Copy link
Copy Markdown
Contributor

Summary

The input handler accepts arbitrary URLs for git clone and file download without validating the target host. This enables Server-Side Request Forgery (SSRF) — an attacker could craft a skill referencing internal endpoints (cloud metadata at 169.254.169.254, internal services at 10.x.x.x, etc.) to exfiltrate data or probe the network.

Additionally, _extract_zip uses extractall() without path traversal checks, enabling zip-slip attacks where crafted archives write files outside the extraction directory.

Changes

SSRF Protection

  • Added _is_private_ip() to detect loopback, private, link-local, and reserved addresses
  • Added _validate_url_host() method enforcing allowlist + private IP check
  • Both _clone_git() and _download_file() validate before network access
  • Configurable module-level ALLOWED_GIT_HOSTS and ALLOWED_DOWNLOAD_HOSTS frozensets

Zip-Slip Prevention

  • _extract_zip() iterates all zip members and validates resolved paths stay within extraction directory
  • Raises ValueError with clear message on path traversal attempt

Allowlist Design

  • Git: github.com, gitlab.com, bitbucket.org
  • Download: above + raw.githubusercontent.com, huggingface.co
  • Enterprise deployments can extend these constants

Testing

23 new tests covering:

  • Private IP detection (127.0.0.1, ::1, 10.x, 172.16.x, 192.168.x, 169.254.x)
  • Public IP passes
  • Unresolvable hostnames blocked (fail-closed)
  • Internal git URLs rejected
  • Cloud metadata endpoints rejected
  • Allowed hosts succeed (with mocked subprocess/httpx)
  • Zip-slip attack blocked
  • Normal/nested zips extract correctly
  • Allowlist configuration assertions

Existing 6 input handler tests continue to pass.

Fixes #147

@wernerkasselman-au

Copy link
Copy Markdown
Contributor

Thanks for picking this up. I read the post-#159 input_handler.py and the SSRF gating does what it says on the two paths that matter: _validate_url_host is the first statement of both _clone_git and _download_file, and the path I was worried about (a bare http://169.254.169.254/latest/meta-data/... with no .git, which _is_file_url routes into _download_file) now raises on the allowlist branch before it touches the network, since 169.254.169.254 is not in ALLOWED_DOWNLOAD_HOSTS. Good.

One gap remains, and I think it is load-bearing. _download_file still builds httpx.Client(follow_redirects=True, timeout=30) (unchanged by the PR), and _validate_url_host only ever sees the initial url, so the allowlist is a first-hop check, not a per-hop one. A request to an allowlisted host (a github.com release or blob link legitimately 302s out to an object-store host, and any open-redirect on github.com / raw.githubusercontent.com / huggingface.co gives the same primitive) is followed by httpx to wherever the 3xx points with no second validation, so a redirect landing on http://169.254.169.254/... still reaches the metadata service and the body comes back to the scanner. I'd set follow_redirects=False on that client (we control the URLs we expect, so a redirect is already a smell), or keep redirects and rerun _validate_url_host on each hop through an httpx event hook. The first is simpler and I'd lean that way.

On the zip side I'd leave extractall as it is, and I would not treat it as a security gap: on CPython (I checked 3.13) zipfile.extractall strips . and .. components before writing and does not recreate symlinks (a symlink member comes out as a regular file holding the link text), so neither a ../ traversal nor a symlink entry escapes extract_dir. The added namelist() prefix check is belt-and-suspenders rather than load-bearing, which is fine to keep. One small note, zipfile.extractall has no filter= argument (that one is tarfile), so filter="data" would throw if anyone reaches for it here.

Happy to be wrong on the redirect point, push back with the hop you think revalidates if I missed one. Otherwise the allowlist and private-IP work is good and I'd like to see it land, with follow_redirects=False (or per-hop revalidation) folded in.

…ip-slip protection

Three security hardening changes to the input handler:

1. SSRF protection: Both git clone and file download now validate the
   target hostname against an allowlist of known-safe hosts (GitHub,
   GitLab, Bitbucket, HuggingFace). URLs resolving to private/loopback
   IPs are blocked regardless of hostname.

2. Zip-slip prevention: Archive extraction now validates every member
   path against the extraction directory to prevent path traversal
   attacks (e.g., ../../etc/passwd entries).

3. Configurable allowlists: ALLOWED_GIT_HOSTS and ALLOWED_DOWNLOAD_HOSTS
   are module-level constants that can be extended for enterprise
   deployments with internal forges.

Fixes NVIDIA#147
Set follow_redirects=False on the httpx download client. The URL
allowlist (_validate_url_host) only checked the initial request URL,
so a 3xx redirect from an allowed host to a private/metadata IP
bypassed SSRF protection. Since we control the expected download
URLs (GitHub raw, HuggingFace), legitimate redirects are not needed.

Add test asserting follow_redirects=False is passed to httpx.Client.

Credit to @wernerkasselman-au for identifying the redirect gap.

Signed-off-by: mimran-khan <mohammed_imran.khan@outlook.com>
@mimran-khan mimran-khan force-pushed the fix/input-handler-ssrf-validation branch from 1d2baef to f5305b9 Compare June 23, 2026 09:41
@mimran-khan

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review @wernerkasselman-au — you're right about the redirect bypass, and the follow_redirects=False approach is the cleaner fix. Pushed:

  1. follow_redirects=False on the httpx download client — since we control the expected URLs (GitHub raw, GitLab, HuggingFace), a redirect is already a smell and we don't need to follow it.
  2. Test added asserting httpx.Client is constructed with follow_redirects=False.
  3. Zip-slip: agreed that CPython's zipfile.extractall already strips ../ — the namelist() prefix check is belt-and-suspenders as you noted, which is fine to keep. No filter= arg used (that's tarfile).

Branch has been rebased on latest main (5 other PRs merged upstream today).

@rng1995 rng1995 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: Approve — solid SSRF + zip-slip hardening that fails closed.

What's good

  • _validate_url_host (src/skillspector/input_handler.py, ~L69-88) enforces an exact-or-subdomain host allowlist and rejects hosts resolving to private/loopback/link-local/reserved IPs; _is_private_ip (~L34-49) fails closed on resolution failure (returns True). Applied to both _clone_git and _download_file.
  • Zip-slip guard checks each member before extractall (~L117-123).
  • Nice test coverage incl. the cloud-metadata endpoint 169.254.169.254.

Non-blocking nits

  • Zip-slip check uses str(member_path).startswith(str(extract_dir.resolve())) (L119) — a string-prefix match that can false-pass a sibling dir like <extract_dir>-evil. Prefer Path.is_relative_to. (No actual escape today since zipfile.extractall also strips .., but the check as written is weaker than it looks.)
  • _is_git_url still uses substring membership (any(allowed in host ...), ~L60-61), so github.com.evil.com is classified as a git URL — though the strict _validate_url_host blocks it downstream, so no exploit.

Coordination

This overlaps #164 (size bounds) and #163/#116 — all rewrite _download_file/_extract_zip. #159 adds SSRF but not size bounds; #164 adds size bounds but not SSRF. Whichever merges second must integrate the other's protection so the result has both SSRF allowlisting and ingest caps.

Tests

Good: private-IP detection, blocked internal/metadata/arbitrary hosts, allowed forges, zip-slip blocked + normal/nested zips fine. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Security] Zip Slip path traversal and SSRF via permissive Git URL validation in InputHandler

3 participants