fix(embed): 2D-tile wide pages so PDFs/landscape content isn't dropped (#47)#48
Open
andylizf wants to merge 1 commit into
Open
fix(embed): 2D-tile wide pages so PDFs/landscape content isn't dropped (#47)#48andylizf wants to merge 1 commit into
andylizf wants to merge 1 commit into
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
The embedder skips any chunk wider than the render width (875px), but the chunker only split tiles by height. A PDF page (e.g. US Letter at the default 200 DPI is ~1700px wide) therefore produced full-width chunks that were all skipped -> 'no embeddings produced' (#47-style). Split tiles along the width too: when a tile is wider than viewport_width, cut it into even columns each <= viewport_width and tile in 2D (row strips x columns). Per-chunk pixel budget is unchanged, so memory/token cost per chunk is bounded exactly as before — wide pages just yield more chunks instead of being dropped or downscaled. Narrow web tiles (<= viewport_width) keep their old single-column height-strip layout and identical crops, so existing indexes stay reproducible. The manifest gains x_offset and a per-column width. The embed-time >875 skip is now a defensive guard (only fires on chunks built before this change; re-chunk with --force to recover them). Adds tests/test_chunk.py covering backward-compat + 2D splitting.
3a6fa4c to
0028dd5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The embedder skips any chunk wider than the render width (875px), but the chunker only split tiles along the height. A PDF page rendered at the default 200 DPI (US Letter ~1700px wide) produced full-width 1700px chunks, all skipped ->
no embeddings produced. Building an index from any PDF (or landscape/wide page) yielded zero embeddings.Fix
Split tiles along the width as well as the height: when a tile is wider than
viewport_width, cut it into even columns (each <=viewport_width) and tile in 2D (height row-strips x width columns).Manifest gains
x_offset+ per-columnwidth. The embed>875skip is now a defensive guard (re-chunk with--forceto recover pre-fix chunks).Tests
tests/test_chunk.py: narrow-tile-unchanged (byte-identical), short-narrow copied verbatim, wide PDF-size tile -> columns, very-wide 3800px stays <=875, short-wide splits width. Verified end-to-end on a real PDF (3800px page -> 10 chunks, max width 760, 0 skipped).