Skip to content

fix(crawler): skip getContentType network call when precrawled archive is provided#2674

Open
howwohmm wants to merge 1 commit intokarakeep-app:mainfrom
howwohmm:ohm/foss-003
Open

fix(crawler): skip getContentType network call when precrawled archive is provided#2674
howwohmm wants to merge 1 commit intokarakeep-app:mainfrom
howwohmm:ohm/foss-003

Conversation

@howwohmm
Copy link
Copy Markdown
Contributor

@howwohmm howwohmm commented Apr 8, 2026

Problem

When uploading a SingleFile HTML archive, the crawler still made an outbound HTTP GET to the original URL to determine content type — even though precrawledArchiveAssetId was already set, meaning the archive was already available locally.

This caused metadata loss for auth-gated or crawler-blocking sites: the pre-check HTTP request would fail, causing the bookmark to lose its title and other metadata, even though the uploaded archive was perfectly valid.

Root cause identified by reporter in #2579: getContentType(url, ...) is called unconditionally at line 2093 of crawlerWorker.ts, before checking whether precrawledArchiveAssetId is set.

Fix

Guard the getContentType call so it is skipped when precrawledArchiveAssetId is set:

// Before
const contentType = await getContentType(url, jobId, job.abortSignal);

// After
const contentType = precrawledArchiveAssetId
  ? null
  : await getContentType(url, jobId, job.abortSignal);

precrawledArchiveAssetId is already in scope (destructured at line 2084). A null content type safely falls through to the else branch that calls crawlAndParseUrl, which already has first-class handling for precrawled archives — it reads the uploaded HTML directly and skips all network requests.

Scope

1 file, 3-line diff. No existing test infrastructure for crawlerWorker.ts.

Fixes #2579

…e is provided

When a SingleFile HTML archive is uploaded via precrawledArchiveAssetId, the
crawler was still making an outbound HTTP GET to the original URL to determine
content type. This caused metadata loss for auth-gated or crawler-blocking
sites, since the pre-check would fail even though a valid archive was already
available.

Guard the getContentType call so it is skipped when precrawledArchiveAssetId
is set. A null contentType safely falls through to the else branch that calls
crawlAndParseUrl, which already has first-class handling for precrawled archives.

Fixes karakeep-app#2579
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 8, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 148dc442-ad5c-4ec4-ae6a-8350bc38e681

📥 Commits

Reviewing files that changed from the base of the PR and between bc14214 and ea50fab.

📒 Files selected for processing (1)
  • apps/workers/workers/crawlerWorker.ts

Walkthrough

The runCrawler function in the crawler worker now conditionally skips the getContentType network request when a precrawledArchiveAssetId is present, setting contentType to null instead of fetching it remotely.

Changes

Cohort / File(s) Summary
Crawler Worker Logic
apps/workers/workers/crawlerWorker.ts
Adds conditional logic to bypass getContentType network fetch when precrawledArchiveAssetId exists, setting contentType to null. Preserves abort signal handling regardless of the condition.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: skipping the getContentType network call when a precrawled archive is provided, which is the core fix.
Description check ✅ Passed The description thoroughly explains the problem (unconditional HTTP GET causing metadata loss), the fix (guarding the getContentType call), and the scope, all directly related to the changeset.
Linked Issues check ✅ Passed The PR fully addresses issue #2579 by guarding the getContentType call to skip when precrawledArchiveAssetId is present, preventing unnecessary network requests for auth-gated/blocking sites.
Out of Scope Changes check ✅ Passed The 3-line diff is narrowly focused on the identified root cause, with no unrelated changes beyond guarding the getContentType call as required.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Apr 8, 2026

Greptile Summary

This PR fixes an unconditional getContentType network call that fired even when a pre-crawled SingleFile archive was already on hand, causing metadata loss for auth-gated or crawler-blocking URLs. The one-line guard precrawledArchiveAssetId ? null : await getContentType(...) is correct: a null content type bypasses both the PDF and image branches and lands in the else path that calls crawlAndParseUrl, which already reads the local asset directly when precrawledArchiveAssetId is set.

Confidence Score: 5/5

Safe to merge — minimal, well-scoped fix with no regressions.

The change is a 3-line diff that adds a guard condition. The downstream crawlAndParseUrl already has first-class handling for precrawledArchiveAssetId, and the null content type correctly bypasses the PDF/image special-case branches. No new code paths or dependencies are introduced.

No files require special attention.

Vulnerabilities

No security concerns identified.

Important Files Changed

Filename Overview
apps/workers/workers/crawlerWorker.ts Guards getContentType behind a precrawledArchiveAssetId check so no outbound HTTP call is made when an archive is already available; null content type safely falls through to the else branch where crawlAndParseUrl reads the local asset directly.

Reviews (1): Last reviewed commit: "fix(crawler): skip getContentType networ..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] SingleFile uploads incorrectly trigger HTTP GET on original URL, causing missing titles and crawling failures

1 participant