Skip to content

fix(index): write article_id into tile manifests, stop guessing from dir names#83

Open
andylizf wants to merge 1 commit into
mainfrom
fix/article-id-manifest
Open

fix(index): write article_id into tile manifests, stop guessing from dir names#83
andylizf wants to merge 1 commit into
mainfrom
fix/article-id-manifest

Conversation

@andylizf

Copy link
Copy Markdown
Contributor

Problem

The embed pipeline extracted article_id by parsing tile directory names ("3104240.png.tiles"int("3104240")). This worked for URLs (dirs named by position index) but broke for:

  • PDFs: directory named after filename stem (e.g. "report.png.tiles"int("report") → ValueError → GPU silently skips, CPU uses hash fallback → ID misaligned with articles.json)
  • Local files: same problem (filename stems aren't numeric)

Root cause: article_id was never explicitly communicated from the pipeline to embed — embed had to reverse-engineer it from the filesystem.

Fix

  1. Pipeline writes article_id into tile manifests (tiles.json + chunks.json) after rendering. This is the authoritative source — embed reads it from the manifest, not the directory name.

  2. render_pdf gains a stem parameter so the pipeline names PDF tile directories by position index (like URLs), making directory names consistent. Standalone pixelshot CLI still defaults to the filename stem.

  3. Backward compatible: embed falls back to directory name parsing when article_id is absent from the manifest (existing large-scale indexes like the Wikipedia corpus where dirs are already numeric).

Follows the sidecar-metadata pattern used by ColPali, LEANN (passage_id_scheme + ids.txt), and Rulin Shao's MassiveDS.

What changed

File Change
pipelines.py Write article_id into manifests after rendering; pass stem=str(idx) to render_pdf
pdf.py + render.py Add stem parameter to render_pdf
embed.py Read article_id from tiles.json/chunks.json first, fallback to dir name
embed_cpu.py Same
tests/test_article_id.py 5 tests: manifest wins, fallback works, non-numeric handled, override, multi-article

…dir names

The embed pipeline extracted article_id by parsing the tile directory name
(e.g. "3104240.png.tiles" → int("3104240")). This broke for PDFs (directory
named after the filename stem, e.g. "report.png.tiles" → int("report") fails)
and for any non-numeric directory name. GPU embed skipped the tile silently;
CPU embed used a hash fallback that produced IDs misaligned with articles.json.

Root cause: article_id was never explicitly communicated from the pipeline to
the embed stage — embed had to reverse-engineer it from the filesystem.

Fix: the pipeline now writes article_id into tiles.json and chunks.json after
rendering. Embed reads it from the manifest first, falling back to directory
name parsing for backward compatibility with existing large-scale indexes
(e.g. the Wikipedia corpus where dir names are already numeric).

Also: render_pdf gains a stem parameter so the pipeline can name PDF tile
directories by position index (like URLs), making directory names consistent
across all source types.

Follows the same sidecar-metadata pattern used by ColPali (JSONL manifest
mapping FAISS IDs to doc metadata) and LEANN (passage_id_scheme +
ids.txt/offset map). See also Rulin Shao's MassiveDS (shared shard IDs).
@vercel

vercel Bot commented Jun 23, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
web Ready Ready Preview, Comment Jun 23, 2026 1:01pm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant