fix(crawler): skip getContentType network call when precrawled archive is provided#2674
fix(crawler): skip getContentType network call when precrawled archive is provided#2674howwohmm wants to merge 1 commit intokarakeep-app:mainfrom
Conversation
…e is provided When a SingleFile HTML archive is uploaded via precrawledArchiveAssetId, the crawler was still making an outbound HTTP GET to the original URL to determine content type. This caused metadata loss for auth-gated or crawler-blocking sites, since the pre-check would fail even though a valid archive was already available. Guard the getContentType call so it is skipped when precrawledArchiveAssetId is set. A null contentType safely falls through to the else branch that calls crawlAndParseUrl, which already has first-class handling for precrawled archives. Fixes karakeep-app#2579
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughThe Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis PR fixes an unconditional Confidence Score: 5/5Safe to merge — minimal, well-scoped fix with no regressions. The change is a 3-line diff that adds a guard condition. The downstream No files require special attention.
|
| Filename | Overview |
|---|---|
| apps/workers/workers/crawlerWorker.ts | Guards getContentType behind a precrawledArchiveAssetId check so no outbound HTTP call is made when an archive is already available; null content type safely falls through to the else branch where crawlAndParseUrl reads the local asset directly. |
Reviews (1): Last reviewed commit: "fix(crawler): skip getContentType networ..." | Re-trigger Greptile
Problem
When uploading a SingleFile HTML archive, the crawler still made an outbound HTTP GET to the original URL to determine content type — even though
precrawledArchiveAssetIdwas already set, meaning the archive was already available locally.This caused metadata loss for auth-gated or crawler-blocking sites: the pre-check HTTP request would fail, causing the bookmark to lose its title and other metadata, even though the uploaded archive was perfectly valid.
Root cause identified by reporter in #2579:
getContentType(url, ...)is called unconditionally at line 2093 ofcrawlerWorker.ts, before checking whetherprecrawledArchiveAssetIdis set.Fix
Guard the
getContentTypecall so it is skipped whenprecrawledArchiveAssetIdis set:precrawledArchiveAssetIdis already in scope (destructured at line 2084). Anullcontent type safely falls through to theelsebranch that callscrawlAndParseUrl, which already has first-class handling for precrawled archives — it reads the uploaded HTML directly and skips all network requests.Scope
1 file, 3-line diff. No existing test infrastructure for
crawlerWorker.ts.Fixes #2579