fix(crawler): skip getContentType network call when precrawled archive is provided by howwohmm · Pull Request #2674 · karakeep-app/karakeep

howwohmm · 2026-04-08T21:16:39Z

Problem

When uploading a SingleFile HTML archive, the crawler still made an outbound HTTP GET to the original URL to determine content type — even though precrawledArchiveAssetId was already set, meaning the archive was already available locally.

This caused metadata loss for auth-gated or crawler-blocking sites: the pre-check HTTP request would fail, causing the bookmark to lose its title and other metadata, even though the uploaded archive was perfectly valid.

Root cause identified by reporter in #2579: getContentType(url, ...) is called unconditionally at line 2093 of crawlerWorker.ts, before checking whether precrawledArchiveAssetId is set.

Fix

Guard the getContentType call so it is skipped when precrawledArchiveAssetId is set:

// Before
const contentType = await getContentType(url, jobId, job.abortSignal);

// After
const contentType = precrawledArchiveAssetId
  ? null
  : await getContentType(url, jobId, job.abortSignal);

precrawledArchiveAssetId is already in scope (destructured at line 2084). A null content type safely falls through to the else branch that calls crawlAndParseUrl, which already has first-class handling for precrawled archives — it reads the uploaded HTML directly and skips all network requests.

Scope

1 file, 3-line diff. No existing test infrastructure for crawlerWorker.ts.

Fixes #2579

…e is provided When a SingleFile HTML archive is uploaded via precrawledArchiveAssetId, the crawler was still making an outbound HTTP GET to the original URL to determine content type. This caused metadata loss for auth-gated or crawler-blocking sites, since the pre-check would fail even though a valid archive was already available. Guard the getContentType call so it is skipped when precrawledArchiveAssetId is set. A null contentType safely falls through to the else branch that calls crawlAndParseUrl, which already has first-class handling for precrawled archives. Fixes karakeep-app#2579

coderabbitai · 2026-04-08T21:16:53Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 148dc442-ad5c-4ec4-ae6a-8350bc38e681

📥 Commits

Reviewing files that changed from the base of the PR and between bc14214 and ea50fab.

📒 Files selected for processing (1)

apps/workers/workers/crawlerWorker.ts

Walkthrough

The runCrawler function in the crawler worker now conditionally skips the getContentType network request when a precrawledArchiveAssetId is present, setting contentType to null instead of fetching it remotely.

Changes

Cohort / File(s)	Summary
Crawler Worker Logic `apps/workers/workers/crawlerWorker.ts`	Adds conditional logic to bypass `getContentType` network fetch when `precrawledArchiveAssetId` exists, setting `contentType` to `null`. Preserves abort signal handling regardless of the condition.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: skipping the getContentType network call when a precrawled archive is provided, which is the core fix.
Description check	✅ Passed	The description thoroughly explains the problem (unconditional HTTP GET causing metadata loss), the fix (guarding the getContentType call), and the scope, all directly related to the changeset.
Linked Issues check	✅ Passed	The PR fully addresses issue `#2579` by guarding the getContentType call to skip when precrawledArchiveAssetId is present, preventing unnecessary network requests for auth-gated/blocking sites.
Out of Scope Changes check	✅ Passed	The 3-line diff is narrowly focused on the identified root cause, with no unrelated changes beyond guarding the getContentType call as required.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-04-08T21:17:45Z

Greptile Summary

This PR fixes an unconditional getContentType network call that fired even when a pre-crawled SingleFile archive was already on hand, causing metadata loss for auth-gated or crawler-blocking URLs. The one-line guard precrawledArchiveAssetId ? null : await getContentType(...) is correct: a null content type bypasses both the PDF and image branches and lands in the else path that calls crawlAndParseUrl, which already reads the local asset directly when precrawledArchiveAssetId is set.

Confidence Score: 5/5

Safe to merge — minimal, well-scoped fix with no regressions.

The change is a 3-line diff that adds a guard condition. The downstream crawlAndParseUrl already has first-class handling for precrawledArchiveAssetId, and the null content type correctly bypasses the PDF/image special-case branches. No new code paths or dependencies are introduced.

No files require special attention.

Vulnerabilities

No security concerns identified.

Important Files Changed

Filename	Overview
apps/workers/workers/crawlerWorker.ts	Guards `getContentType` behind a `precrawledArchiveAssetId` check so no outbound HTTP call is made when an archive is already available; `null` content type safely falls through to the `else` branch where `crawlAndParseUrl` reads the local asset directly.

_{Reviews (1): Last reviewed commit: "fix(crawler): skip getContentType networ..." | Re-trigger Greptile}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(crawler): skip getContentType network call when precrawled archive is provided#2674

fix(crawler): skip getContentType network call when precrawled archive is provided#2674
howwohmm wants to merge 1 commit intokarakeep-app:mainfrom
howwohmm:ohm/foss-003

howwohmm commented Apr 8, 2026

Uh oh!

coderabbitai bot commented Apr 8, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

greptile-apps bot commented Apr 8, 2026

Vulnerabilities

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

howwohmm commented Apr 8, 2026

Problem

Fix

Scope

Uh oh!

coderabbitai bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

greptile-apps bot commented Apr 8, 2026

Greptile Summary

Confidence Score: 5/5

Vulnerabilities

Important Files Changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Apr 8, 2026 •

edited

Loading