feat add fetch_url_tool so AI chat can read direct URLs by krushnarout · Pull Request #7328 · BasedHardware/omi

krushnarout · 2026-05-16T06:49:45Z

Summary

AI chat previously said it couldn't browse a URL when given one directly — it only searched the web
Adds fetch_url_tool that fetches and strips HTML from a given URL, returning up to 8000 chars of readable text
Wires the tool into CORE_TOOLS in agentic.py so Claude uses it when a user shares a link

Demo:

before:

after:

Test plan

Send a message in mobile AI chat with a direct URL (e.g. a news article) and verify the content is read and summarized
Verify web_search still works for non-URL queries
Verify error path: invalid URL, non-HTML content type, non-200 status

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

greptile-apps · 2026-05-16T06:53:32Z

Greptile Summary

This PR adds fetch_url_tool, which lets the AI chat assistant read the content of a user-supplied URL by fetching it, capping the response at 512 KB, stripping HTML, and returning up to 8 000 characters of readable text to Claude.

web_tools.py: Implements the tool with an SSRF guard (DNS resolution against private-IP block-list), manual redirect loop, streaming body cap, HTML→text extraction, and JSON-LD / Open Graph metadata extraction.
http_client.py: Adds an isolated get_web_fetch_client() pool (16 connections, 15 s timeout) so user URL fetches don't compete with webhook delivery slots.
agentic.py / tools/__init__.py: Wires the tool into CORE_TOOLS and adds a system-prompt block that instructs Claude to call the tool whenever the user shares a URL.

Confidence Score: 4/5

Safe to merge after fixing relative redirect resolution — without it the tool silently errors on any website that issues a same-origin or protocol-relative redirect.

The redirect handler assigns the raw Location header value directly to url without resolving it against the current base URL. Relative locations (/new-path, //cdn.example.com/) are very common in production sites and will all fail the scheme-prefix guard on the next loop iteration, surfacing an opaque error to the user. Everything else in the PR — SSRF guard, memory cap, isolated connection pool, sanitized logging — is correctly implemented.

backend/utils/retrieval/tools/web_tools.py — specifically the redirect-following logic in _fetch_page

Important Files Changed

Filename	Overview
backend/utils/retrieval/tools/web_tools.py	New fetch_url_tool with SSRF guard, body-size cap, and HTML extraction. Relative redirects are not resolved against the base URL, causing the tool to fail on websites that use them.
backend/utils/http_client.py	Adds get_web_fetch_client() with an isolated 16-connection pool (15s timeout), correctly wired into close_all_clients().
backend/utils/retrieval/agentic.py	Wires fetch_url_tool into CORE_TOOLS and adds a system-prompt instruction block that directs Claude to call the tool whenever a URL is present in the conversation.
backend/utils/retrieval/tools/init.py	Exports fetch_url_tool from the tools package; straightforward bookkeeping change.

Sequence Diagram

sequenceDiagram
    participant User
    participant Claude
    participant fetch_url_tool
    participant _fetch_page
    participant DNS
    participant ExternalServer

    User->>Claude: Message with URL
    Claude->>fetch_url_tool: call(url)
    fetch_url_tool->>_fetch_page: url, headers
    loop up to 5 redirects
        _fetch_page->>DNS: getaddrinfo(hostname)
        DNS-->>_fetch_page: resolved IPs
        _fetch_page->>_fetch_page: check private IP block-list
        _fetch_page->>ExternalServer: "GET url (stream, follow_redirects=False)"
        ExternalServer-->>_fetch_page: status + headers + body (capped 512 KB)
        alt redirect (3xx)
            _fetch_page->>_fetch_page: urljoin(current_url, Location)
        else 200 OK
            _fetch_page-->>fetch_url_tool: status, content_type, body
        end
    end
    fetch_url_tool->>fetch_url_tool: _html_to_text(body) truncate to 8000 chars
    fetch_url_tool-->>Claude: readable text
    Claude-->>User: Summary / answer

_{Reviews (2): Last reviewed commit: "fix extract meta/OG tags and add URL fet..." | Re-trigger Greptile}

greptile-apps · 2026-05-16T06:53:35Z

+    if not url.startswith(('http://', 'https://')):
+        return "Error: URL must start with http:// or https://"
+
+    try:
+        client = get_webhook_client()
+        headers = {
+            'User-Agent': 'Mozilla/5.0 (compatible; Omi-AI-Bot/1.0)',
+            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
+            'Accept-Language': 'en-US,en;q=0.5',
+        }
+        response = await client.get(url, headers=headers, timeout=15.0, follow_redirects=True)


SSRF — no protection against private/metadata addresses

The only URL guard is a scheme prefix check (startswith('http://', 'https://')). Any authenticated user can craft a request to cloud metadata services (http://169.254.169.254/latest/meta-data/iam/security-credentials/ on AWS, http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/ on GCP), localhost-bound admin APIs, or RFC-1918 internal addresses, and get the response returned verbatim to their AI chat. A hostname allow/block-list or a DNS-resolve-then-check guard (e.g., blocking link-local and private IP ranges) is required before making network calls on behalf of user-supplied URLs.

greptile-apps · 2026-05-16T06:53:36Z

+        response = await client.get(url, headers=headers, timeout=15.0, follow_redirects=True)
+
+        if response.status_code != 200:
+            logger.warning(f"fetch_url_tool - HTTP {response.status_code} for {url}")
+            return f"Error: Could not fetch page (HTTP {response.status_code})"
+
+        content_type = response.headers.get('content-type', '')
+        if 'text/html' in content_type or 'text/plain' in content_type or not content_type:
+            text = _html_to_text(response.text)


Unbounded response body loaded into memory

response.text decodes the entire HTTP body into a Python string before the 8 000-character truncation runs. A server that returns tens or hundreds of megabytes (legitimately or as a DoS) will hold the full decoded string in memory. Consider checking response.headers.get('content-length') before decode, or stream and cap bytes with response.aread() + a size guard, so the truncation happens at the network layer rather than after full materialisation.

greptile-apps · 2026-05-16T06:53:37Z

+    logger.info(f"fetch_url_tool called - url: {url}")
+
+    if not url.startswith(('http://', 'https://')):
+        return "Error: URL must start with http:// or https://"
+
+    try:
+        client = get_webhook_client()
+        headers = {
+            'User-Agent': 'Mozilla/5.0 (compatible; Omi-AI-Bot/1.0)',
+            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
+            'Accept-Language': 'en-US,en;q=0.5',
+        }
+        response = await client.get(url, headers=headers, timeout=15.0, follow_redirects=True)
+
+        if response.status_code != 200:
+            logger.warning(f"fetch_url_tool - HTTP {response.status_code} for {url}")


URLs are logged verbatim at three places before any sanitization. If users include API keys, OAuth tokens, or session identifiers in query strings, those credentials will appear in the log stream. Pass the URL through sanitize() the same way error messages already do.

Suggested change

logger.info(f"fetch_url_tool called - url: {url}")

if not url.startswith(('http://', 'https://')):

return "Error: URL must start with http:// or https://"

try:

client = get_webhook_client()

headers = {

'User-Agent': 'Mozilla/5.0 (compatible; Omi-AI-Bot/1.0)',

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

'Accept-Language': 'en-US,en;q=0.5',

}

response = await client.get(url, headers=headers, timeout=15.0, follow_redirects=True)

if response.status_code != 200:

logger.warning(f"fetch_url_tool - HTTP {response.status_code} for {url}")

logger.info(f"fetch_url_tool called - url: {sanitize(url)}")

if not url.startswith(('http://', 'https://')):

return "Error: URL must start with http:// or https://"

try:

client = get_webhook_client()

headers = {

'User-Agent': 'Mozilla/5.0 (compatible; Omi-AI-Bot/1.0)',

'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

'Accept-Language': 'en-US,en;q=0.5',

}

response = await client.get(url, headers=headers, timeout=15.0, follow_redirects=True)

if response.status_code != 200:

logger.warning(f"fetch_url_tool - HTTP {response.status_code} for {sanitize(url)}")

greptile-apps · 2026-05-16T06:53:38Z

+        content_type = response.headers.get('content-type', '')
+        if 'text/html' in content_type or 'text/plain' in content_type or not content_type:


The Accept header advertises application/xhtml+xml and application/xml, but the content-type check only matches text/html and text/plain. An XHTML response would hit the "Unsupported content type" error path even though the tool requested and could parse it.

Suggested change

content_type = response.headers.get('content-type', '')

if 'text/html' in content_type or 'text/plain' in content_type or not content_type:

content_type = response.headers.get('content-type', '')

if ('text/html' in content_type or 'text/plain' in content_type

or 'application/xhtml+xml' in content_type or 'application/xml' in content_type

or not content_type):

greptile-apps · 2026-05-16T06:53:39Z

+        return "Error: URL must start with http:// or https://"
+
+    try:
+        client = get_webhook_client()


Webhook connection pool shared with user URL fetches

get_webhook_client() is the shared pool designed for outbound webhook delivery (max_connections=64). User-initiated URL fetches will compete for those same slots, and any slow or stalled external page will hold a connection that could otherwise be used for a webhook callback. Consider adding a dedicated get_web_fetch_client() with its own connection limits (e.g., 16 connections) so URL fetching is isolated from webhook reliability.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…x content-type check Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mdmohsin7 · 2026-05-27T17:41:20Z

@greptile-apps re-review

greptile-apps · 2026-05-27T17:45:23Z

+import re
+import logging
+from html.parser import HTMLParser
+from urllib.parse import urlparse


Relative redirect URLs will raise ValueError, so the tool silently fails on any website that returns a relative Location header (e.g. Location: /new-path or Location: //example.com/path). This is a common pattern for same-origin path redirects and protocol-relative redirects. The raw location value is assigned directly to url without being resolved against the current base URL, so the scheme-prefix check at the top of the next loop iteration immediately raises ValueError('Redirect target must use http:// or https://'), which the caller surfaces as an opaque error. Use urllib.parse.urljoin to resolve the location against the current URL before the next iteration.

Suggested change

from urllib.parse import urlparse

from urllib.parse import urlparse, urljoin

…n same-origin redirects Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kodjima33

thanks — fetch_url_tool is a nice add

krushnarout and others added 3 commits May 16, 2026 12:19

feat add fetch_url_tool for reading specific web page URLs in AI chat

73a2454

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat export fetch_url_tool from tools __init__

eae612d

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat wire fetch_url_tool into CORE_TOOLS so AI chat can read direct URLs

81a187d

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

greptile-apps Bot reviewed May 16, 2026

View reviewed changes

krushnarout and others added 4 commits May 16, 2026 20:41

feat add get_web_fetch_client isolated from webhook pool for URL fetches

34a8ef9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix fetch_url_tool SSRF guard, stream body cap, sanitize URL logs, fi…

8773b83

…x content-type check Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix instruct Claude to always call fetch_url_tool when user shares a URL

a56aa42

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix extract meta/OG tags and add URL fetch instruction to system prompt

ae34109

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

krushnarout requested a review from mdmohsin7 May 27, 2026 17:20

greptile-apps Bot reviewed May 27, 2026

View reviewed changes

fix resolve relative redirect URLs with urljoin to avoid ValueError o…

992111b

…n same-origin redirects Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kodjima33 approved these changes May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat add fetch_url_tool so AI chat can read direct URLs#7328

feat add fetch_url_tool so AI chat can read direct URLs#7328
krushnarout wants to merge 8 commits into
mainfrom
feat/fetch-url-tool

krushnarout commented May 16, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 16, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot May 16, 2026

Uh oh!

greptile-apps Bot May 16, 2026

Uh oh!

greptile-apps Bot May 16, 2026

Uh oh!

greptile-apps Bot May 16, 2026

Uh oh!

greptile-apps Bot May 16, 2026

Uh oh!

mdmohsin7 commented May 27, 2026

Uh oh!

greptile-apps Bot May 27, 2026

Uh oh!

kodjima33 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		content_type = response.headers.get('content-type', '')
		if 'text/html' in content_type or 'text/plain' in content_type or not content_type:

	from urllib.parse import urlparse
	from urllib.parse import urlparse, urljoin

Conversation

krushnarout commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Demo:

before:

after:

Test plan

Uh oh!

greptile-apps Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

mdmohsin7 commented May 27, 2026

Uh oh!

greptile-apps Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

kodjima33 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

krushnarout commented May 16, 2026 •

edited

Loading

greptile-apps Bot commented May 16, 2026 •

edited

Loading