Skip to content

fix(embed): 2D-tile wide pages so PDFs/landscape content isn't dropped (#47)#48

Open
andylizf wants to merge 1 commit into
mainfrom
fix/pdf-2d-tiling
Open

fix(embed): 2D-tile wide pages so PDFs/landscape content isn't dropped (#47)#48
andylizf wants to merge 1 commit into
mainfrom
fix/pdf-2d-tiling

Conversation

@andylizf

Copy link
Copy Markdown
Contributor

Problem

The embedder skips any chunk wider than the render width (875px), but the chunker only split tiles along the height. A PDF page rendered at the default 200 DPI (US Letter ~1700px wide) produced full-width 1700px chunks, all skipped -> no embeddings produced. Building an index from any PDF (or landscape/wide page) yielded zero embeddings.

Fix

Split tiles along the width as well as the height: when a tile is wider than viewport_width, cut it into even columns (each <= viewport_width) and tile in 2D (height row-strips x width columns).

  • Bounded memory — per-chunk pixel/token budget unchanged (<= viewport_width x 1024); wide pages yield more chunks, not bigger ones. No OOM, no downscale.
  • Backward compatible — narrow web tiles (<= viewport_width) keep the old single-column height-strip layout with identical crops/filenames; existing indexes stay reproducible.
  • No data loss — wide content is split, not skipped or downscaled.

Manifest gains x_offset + per-column width. The embed >875 skip is now a defensive guard (re-chunk with --force to recover pre-fix chunks).

Tests

tests/test_chunk.py: narrow-tile-unchanged (byte-identical), short-narrow copied verbatim, wide PDF-size tile -> columns, very-wide 3800px stays <=875, short-wide splits width. Verified end-to-end on a real PDF (3800px page -> 10 chunks, max width 760, 0 skipped).

@vercel

vercel Bot commented Jun 19, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
web Ready Ready Preview, Comment Jun 19, 2026 7:21am

The embedder skips any chunk wider than the render width (875px), but the
chunker only split tiles by height. A PDF page (e.g. US Letter at the default
200 DPI is ~1700px wide) therefore produced full-width chunks that were all
skipped -> 'no embeddings produced' (#47-style).

Split tiles along the width too: when a tile is wider than viewport_width, cut
it into even columns each <= viewport_width and tile in 2D (row strips x
columns). Per-chunk pixel budget is unchanged, so memory/token cost per chunk
is bounded exactly as before — wide pages just yield more chunks instead of
being dropped or downscaled. Narrow web tiles (<= viewport_width) keep their
old single-column height-strip layout and identical crops, so existing indexes
stay reproducible. The manifest gains x_offset and a per-column width.

The embed-time >875 skip is now a defensive guard (only fires on chunks built
before this change; re-chunk with --force to recover them).

Adds tests/test_chunk.py covering backward-compat + 2D splitting.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant