feat add fetch_url_tool so AI chat can read direct URLs#7328
feat add fetch_url_tool so AI chat can read direct URLs#7328krushnarout wants to merge 8 commits into
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Greptile SummaryThis PR adds
Confidence Score: 4/5Safe to merge after fixing relative redirect resolution — without it the tool silently errors on any website that issues a same-origin or protocol-relative redirect. The redirect handler assigns the raw Location header value directly to url without resolving it against the current base URL. Relative locations (/new-path, //cdn.example.com/) are very common in production sites and will all fail the scheme-prefix guard on the next loop iteration, surfacing an opaque error to the user. Everything else in the PR — SSRF guard, memory cap, isolated connection pool, sanitized logging — is correctly implemented. backend/utils/retrieval/tools/web_tools.py — specifically the redirect-following logic in _fetch_page Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Claude
participant fetch_url_tool
participant _fetch_page
participant DNS
participant ExternalServer
User->>Claude: Message with URL
Claude->>fetch_url_tool: call(url)
fetch_url_tool->>_fetch_page: url, headers
loop up to 5 redirects
_fetch_page->>DNS: getaddrinfo(hostname)
DNS-->>_fetch_page: resolved IPs
_fetch_page->>_fetch_page: check private IP block-list
_fetch_page->>ExternalServer: "GET url (stream, follow_redirects=False)"
ExternalServer-->>_fetch_page: status + headers + body (capped 512 KB)
alt redirect (3xx)
_fetch_page->>_fetch_page: urljoin(current_url, Location)
else 200 OK
_fetch_page-->>fetch_url_tool: status, content_type, body
end
end
fetch_url_tool->>fetch_url_tool: _html_to_text(body) truncate to 8000 chars
fetch_url_tool-->>Claude: readable text
Claude-->>User: Summary / answer
Reviews (2): Last reviewed commit: "fix extract meta/OG tags and add URL fet..." | Re-trigger Greptile |
| if not url.startswith(('http://', 'https://')): | ||
| return "Error: URL must start with http:// or https://" | ||
|
|
||
| try: | ||
| client = get_webhook_client() | ||
| headers = { | ||
| 'User-Agent': 'Mozilla/5.0 (compatible; Omi-AI-Bot/1.0)', | ||
| 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', | ||
| 'Accept-Language': 'en-US,en;q=0.5', | ||
| } | ||
| response = await client.get(url, headers=headers, timeout=15.0, follow_redirects=True) |
There was a problem hiding this comment.
SSRF — no protection against private/metadata addresses
The only URL guard is a scheme prefix check (startswith('http://', 'https://')). Any authenticated user can craft a request to cloud metadata services (http://169.254.169.254/latest/meta-data/iam/security-credentials/ on AWS, http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/ on GCP), localhost-bound admin APIs, or RFC-1918 internal addresses, and get the response returned verbatim to their AI chat. A hostname allow/block-list or a DNS-resolve-then-check guard (e.g., blocking link-local and private IP ranges) is required before making network calls on behalf of user-supplied URLs.
| response = await client.get(url, headers=headers, timeout=15.0, follow_redirects=True) | ||
|
|
||
| if response.status_code != 200: | ||
| logger.warning(f"fetch_url_tool - HTTP {response.status_code} for {url}") | ||
| return f"Error: Could not fetch page (HTTP {response.status_code})" | ||
|
|
||
| content_type = response.headers.get('content-type', '') | ||
| if 'text/html' in content_type or 'text/plain' in content_type or not content_type: | ||
| text = _html_to_text(response.text) |
There was a problem hiding this comment.
Unbounded response body loaded into memory
response.text decodes the entire HTTP body into a Python string before the 8 000-character truncation runs. A server that returns tens or hundreds of megabytes (legitimately or as a DoS) will hold the full decoded string in memory. Consider checking response.headers.get('content-length') before decode, or stream and cap bytes with response.aread() + a size guard, so the truncation happens at the network layer rather than after full materialisation.
| logger.info(f"fetch_url_tool called - url: {url}") | ||
|
|
||
| if not url.startswith(('http://', 'https://')): | ||
| return "Error: URL must start with http:// or https://" | ||
|
|
||
| try: | ||
| client = get_webhook_client() | ||
| headers = { | ||
| 'User-Agent': 'Mozilla/5.0 (compatible; Omi-AI-Bot/1.0)', | ||
| 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', | ||
| 'Accept-Language': 'en-US,en;q=0.5', | ||
| } | ||
| response = await client.get(url, headers=headers, timeout=15.0, follow_redirects=True) | ||
|
|
||
| if response.status_code != 200: | ||
| logger.warning(f"fetch_url_tool - HTTP {response.status_code} for {url}") |
There was a problem hiding this comment.
URLs are logged verbatim at three places before any sanitization. If users include API keys, OAuth tokens, or session identifiers in query strings, those credentials will appear in the log stream. Pass the URL through
sanitize() the same way error messages already do.
| logger.info(f"fetch_url_tool called - url: {url}") | |
| if not url.startswith(('http://', 'https://')): | |
| return "Error: URL must start with http:// or https://" | |
| try: | |
| client = get_webhook_client() | |
| headers = { | |
| 'User-Agent': 'Mozilla/5.0 (compatible; Omi-AI-Bot/1.0)', | |
| 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', | |
| 'Accept-Language': 'en-US,en;q=0.5', | |
| } | |
| response = await client.get(url, headers=headers, timeout=15.0, follow_redirects=True) | |
| if response.status_code != 200: | |
| logger.warning(f"fetch_url_tool - HTTP {response.status_code} for {url}") | |
| logger.info(f"fetch_url_tool called - url: {sanitize(url)}") | |
| if not url.startswith(('http://', 'https://')): | |
| return "Error: URL must start with http:// or https://" | |
| try: | |
| client = get_webhook_client() | |
| headers = { | |
| 'User-Agent': 'Mozilla/5.0 (compatible; Omi-AI-Bot/1.0)', | |
| 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', | |
| 'Accept-Language': 'en-US,en;q=0.5', | |
| } | |
| response = await client.get(url, headers=headers, timeout=15.0, follow_redirects=True) | |
| if response.status_code != 200: | |
| logger.warning(f"fetch_url_tool - HTTP {response.status_code} for {sanitize(url)}") |
| content_type = response.headers.get('content-type', '') | ||
| if 'text/html' in content_type or 'text/plain' in content_type or not content_type: |
There was a problem hiding this comment.
The
Accept header advertises application/xhtml+xml and application/xml, but the content-type check only matches text/html and text/plain. An XHTML response would hit the "Unsupported content type" error path even though the tool requested and could parse it.
| content_type = response.headers.get('content-type', '') | |
| if 'text/html' in content_type or 'text/plain' in content_type or not content_type: | |
| content_type = response.headers.get('content-type', '') | |
| if ('text/html' in content_type or 'text/plain' in content_type | |
| or 'application/xhtml+xml' in content_type or 'application/xml' in content_type | |
| or not content_type): |
| return "Error: URL must start with http:// or https://" | ||
|
|
||
| try: | ||
| client = get_webhook_client() |
There was a problem hiding this comment.
Webhook connection pool shared with user URL fetches
get_webhook_client() is the shared pool designed for outbound webhook delivery (max_connections=64). User-initiated URL fetches will compete for those same slots, and any slow or stalled external page will hold a connection that could otherwise be used for a webhook callback. Consider adding a dedicated get_web_fetch_client() with its own connection limits (e.g., 16 connections) so URL fetching is isolated from webhook reliability.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…x content-type check Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@greptile-apps re-review |
| import re | ||
| import logging | ||
| from html.parser import HTMLParser | ||
| from urllib.parse import urlparse |
There was a problem hiding this comment.
Relative redirect URLs will raise
ValueError, so the tool silently fails on any website that returns a relative Location header (e.g. Location: /new-path or Location: //example.com/path). This is a common pattern for same-origin path redirects and protocol-relative redirects. The raw location value is assigned directly to url without being resolved against the current base URL, so the scheme-prefix check at the top of the next loop iteration immediately raises ValueError('Redirect target must use http:// or https://'), which the caller surfaces as an opaque error. Use urllib.parse.urljoin to resolve the location against the current URL before the next iteration.
| from urllib.parse import urlparse | |
| from urllib.parse import urlparse, urljoin |
…n same-origin redirects Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
kodjima33
left a comment
There was a problem hiding this comment.
thanks — fetch_url_tool is a nice add
Summary
fetch_url_toolthat fetches and strips HTML from a given URL, returning up to 8000 chars of readable textCORE_TOOLSinagentic.pyso Claude uses it when a user shares a linkDemo:
before:
after:
Test plan
web_searchstill works for non-URL queries🤖 Generated with Claude Code