Skip to content

feat add fetch_url_tool so AI chat can read direct URLs#7328

Open
krushnarout wants to merge 8 commits into
mainfrom
feat/fetch-url-tool
Open

feat add fetch_url_tool so AI chat can read direct URLs#7328
krushnarout wants to merge 8 commits into
mainfrom
feat/fetch-url-tool

Conversation

@krushnarout
Copy link
Copy Markdown
Member

@krushnarout krushnarout commented May 16, 2026

Summary

  • AI chat previously said it couldn't browse a URL when given one directly — it only searched the web
  • Adds fetch_url_tool that fetches and strips HTML from a given URL, returning up to 8000 chars of readable text
  • Wires the tool into CORE_TOOLS in agentic.py so Claude uses it when a user shares a link

Demo:

before:

IMG_4509

after:

IMG_4510

Test plan

  • Send a message in mobile AI chat with a direct URL (e.g. a news article) and verify the content is read and summarized
  • Verify web_search still works for non-URL queries
  • Verify error path: invalid URL, non-HTML content type, non-200 status

🤖 Generated with Claude Code

krushnarout and others added 3 commits May 16, 2026 12:19
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 16, 2026

Greptile Summary

This PR adds fetch_url_tool, which lets the AI chat assistant read the content of a user-supplied URL by fetching it, capping the response at 512 KB, stripping HTML, and returning up to 8 000 characters of readable text to Claude.

  • web_tools.py: Implements the tool with an SSRF guard (DNS resolution against private-IP block-list), manual redirect loop, streaming body cap, HTML→text extraction, and JSON-LD / Open Graph metadata extraction.
  • http_client.py: Adds an isolated get_web_fetch_client() pool (16 connections, 15 s timeout) so user URL fetches don't compete with webhook delivery slots.
  • agentic.py / tools/__init__.py: Wires the tool into CORE_TOOLS and adds a system-prompt block that instructs Claude to call the tool whenever the user shares a URL.

Confidence Score: 4/5

Safe to merge after fixing relative redirect resolution — without it the tool silently errors on any website that issues a same-origin or protocol-relative redirect.

The redirect handler assigns the raw Location header value directly to url without resolving it against the current base URL. Relative locations (/new-path, //cdn.example.com/) are very common in production sites and will all fail the scheme-prefix guard on the next loop iteration, surfacing an opaque error to the user. Everything else in the PR — SSRF guard, memory cap, isolated connection pool, sanitized logging — is correctly implemented.

backend/utils/retrieval/tools/web_tools.py — specifically the redirect-following logic in _fetch_page

Important Files Changed

Filename Overview
backend/utils/retrieval/tools/web_tools.py New fetch_url_tool with SSRF guard, body-size cap, and HTML extraction. Relative redirects are not resolved against the base URL, causing the tool to fail on websites that use them.
backend/utils/http_client.py Adds get_web_fetch_client() with an isolated 16-connection pool (15s timeout), correctly wired into close_all_clients().
backend/utils/retrieval/agentic.py Wires fetch_url_tool into CORE_TOOLS and adds a system-prompt instruction block that directs Claude to call the tool whenever a URL is present in the conversation.
backend/utils/retrieval/tools/init.py Exports fetch_url_tool from the tools package; straightforward bookkeeping change.

Sequence Diagram

sequenceDiagram
    participant User
    participant Claude
    participant fetch_url_tool
    participant _fetch_page
    participant DNS
    participant ExternalServer

    User->>Claude: Message with URL
    Claude->>fetch_url_tool: call(url)
    fetch_url_tool->>_fetch_page: url, headers
    loop up to 5 redirects
        _fetch_page->>DNS: getaddrinfo(hostname)
        DNS-->>_fetch_page: resolved IPs
        _fetch_page->>_fetch_page: check private IP block-list
        _fetch_page->>ExternalServer: "GET url (stream, follow_redirects=False)"
        ExternalServer-->>_fetch_page: status + headers + body (capped 512 KB)
        alt redirect (3xx)
            _fetch_page->>_fetch_page: urljoin(current_url, Location)
        else 200 OK
            _fetch_page-->>fetch_url_tool: status, content_type, body
        end
    end
    fetch_url_tool->>fetch_url_tool: _html_to_text(body) truncate to 8000 chars
    fetch_url_tool-->>Claude: readable text
    Claude-->>User: Summary / answer
Loading

Reviews (2): Last reviewed commit: "fix extract meta/OG tags and add URL fet..." | Re-trigger Greptile

Comment on lines +76 to +86
if not url.startswith(('http://', 'https://')):
return "Error: URL must start with http:// or https://"

try:
client = get_webhook_client()
headers = {
'User-Agent': 'Mozilla/5.0 (compatible; Omi-AI-Bot/1.0)',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
}
response = await client.get(url, headers=headers, timeout=15.0, follow_redirects=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 security SSRF — no protection against private/metadata addresses

The only URL guard is a scheme prefix check (startswith('http://', 'https://')). Any authenticated user can craft a request to cloud metadata services (http://169.254.169.254/latest/meta-data/iam/security-credentials/ on AWS, http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/ on GCP), localhost-bound admin APIs, or RFC-1918 internal addresses, and get the response returned verbatim to their AI chat. A hostname allow/block-list or a DNS-resolve-then-check guard (e.g., blocking link-local and private IP ranges) is required before making network calls on behalf of user-supplied URLs.

Comment on lines +86 to +94
response = await client.get(url, headers=headers, timeout=15.0, follow_redirects=True)

if response.status_code != 200:
logger.warning(f"fetch_url_tool - HTTP {response.status_code} for {url}")
return f"Error: Could not fetch page (HTTP {response.status_code})"

content_type = response.headers.get('content-type', '')
if 'text/html' in content_type or 'text/plain' in content_type or not content_type:
text = _html_to_text(response.text)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Unbounded response body loaded into memory

response.text decodes the entire HTTP body into a Python string before the 8 000-character truncation runs. A server that returns tens or hundreds of megabytes (legitimately or as a DoS) will hold the full decoded string in memory. Consider checking response.headers.get('content-length') before decode, or stream and cap bytes with response.aread() + a size guard, so the truncation happens at the network layer rather than after full materialisation.

Comment on lines +74 to +89
logger.info(f"fetch_url_tool called - url: {url}")

if not url.startswith(('http://', 'https://')):
return "Error: URL must start with http:// or https://"

try:
client = get_webhook_client()
headers = {
'User-Agent': 'Mozilla/5.0 (compatible; Omi-AI-Bot/1.0)',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
}
response = await client.get(url, headers=headers, timeout=15.0, follow_redirects=True)

if response.status_code != 200:
logger.warning(f"fetch_url_tool - HTTP {response.status_code} for {url}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 URLs are logged verbatim at three places before any sanitization. If users include API keys, OAuth tokens, or session identifiers in query strings, those credentials will appear in the log stream. Pass the URL through sanitize() the same way error messages already do.

Suggested change
logger.info(f"fetch_url_tool called - url: {url}")
if not url.startswith(('http://', 'https://')):
return "Error: URL must start with http:// or https://"
try:
client = get_webhook_client()
headers = {
'User-Agent': 'Mozilla/5.0 (compatible; Omi-AI-Bot/1.0)',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
}
response = await client.get(url, headers=headers, timeout=15.0, follow_redirects=True)
if response.status_code != 200:
logger.warning(f"fetch_url_tool - HTTP {response.status_code} for {url}")
logger.info(f"fetch_url_tool called - url: {sanitize(url)}")
if not url.startswith(('http://', 'https://')):
return "Error: URL must start with http:// or https://"
try:
client = get_webhook_client()
headers = {
'User-Agent': 'Mozilla/5.0 (compatible; Omi-AI-Bot/1.0)',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
}
response = await client.get(url, headers=headers, timeout=15.0, follow_redirects=True)
if response.status_code != 200:
logger.warning(f"fetch_url_tool - HTTP {response.status_code} for {sanitize(url)}")

Comment on lines +92 to +93
content_type = response.headers.get('content-type', '')
if 'text/html' in content_type or 'text/plain' in content_type or not content_type:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The Accept header advertises application/xhtml+xml and application/xml, but the content-type check only matches text/html and text/plain. An XHTML response would hit the "Unsupported content type" error path even though the tool requested and could parse it.

Suggested change
content_type = response.headers.get('content-type', '')
if 'text/html' in content_type or 'text/plain' in content_type or not content_type:
content_type = response.headers.get('content-type', '')
if ('text/html' in content_type or 'text/plain' in content_type
or 'application/xhtml+xml' in content_type or 'application/xml' in content_type
or not content_type):

return "Error: URL must start with http:// or https://"

try:
client = get_webhook_client()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Webhook connection pool shared with user URL fetches

get_webhook_client() is the shared pool designed for outbound webhook delivery (max_connections=64). User-initiated URL fetches will compete for those same slots, and any slow or stalled external page will hold a connection that could otherwise be used for a webhook callback. Consider adding a dedicated get_web_fetch_client() with its own connection limits (e.g., 16 connections) so URL fetching is isolated from webhook reliability.

krushnarout and others added 4 commits May 16, 2026 20:41
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…x content-type check

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@krushnarout krushnarout requested a review from mdmohsin7 May 27, 2026 17:20
@mdmohsin7
Copy link
Copy Markdown
Member

@greptile-apps re-review

import re
import logging
from html.parser import HTMLParser
from urllib.parse import urlparse
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Relative redirect URLs will raise ValueError, so the tool silently fails on any website that returns a relative Location header (e.g. Location: /new-path or Location: //example.com/path). This is a common pattern for same-origin path redirects and protocol-relative redirects. The raw location value is assigned directly to url without being resolved against the current base URL, so the scheme-prefix check at the top of the next loop iteration immediately raises ValueError('Redirect target must use http:// or https://'), which the caller surfaces as an opaque error. Use urllib.parse.urljoin to resolve the location against the current URL before the next iteration.

Suggested change
from urllib.parse import urlparse
from urllib.parse import urlparse, urljoin

…n same-origin redirects

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@kodjima33 kodjima33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks — fetch_url_tool is a nice add

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants