fix(index): write article_id into tile manifests, stop guessing from dir names by andylizf · Pull Request #83 · StarTrail-org/PixelRAG

andylizf · 2026-06-23T13:00:41Z

Problem

The embed pipeline extracted article_id by parsing tile directory names ("3104240.png.tiles" → int("3104240")). This worked for URLs (dirs named by position index) but broke for:

PDFs: directory named after filename stem (e.g. "report.png.tiles" → int("report") → ValueError → GPU silently skips, CPU uses hash fallback → ID misaligned with articles.json)
Local files: same problem (filename stems aren't numeric)

Root cause: article_id was never explicitly communicated from the pipeline to embed — embed had to reverse-engineer it from the filesystem.

Fix

Pipeline writes article_id into tile manifests (tiles.json + chunks.json) after rendering. This is the authoritative source — embed reads it from the manifest, not the directory name.
render_pdf gains a stem parameter so the pipeline names PDF tile directories by position index (like URLs), making directory names consistent. Standalone pixelshot CLI still defaults to the filename stem.
Backward compatible: embed falls back to directory name parsing when article_id is absent from the manifest (existing large-scale indexes like the Wikipedia corpus where dirs are already numeric).

Follows the sidecar-metadata pattern used by ColPali, LEANN (passage_id_scheme + ids.txt), and Rulin Shao's MassiveDS.

What changed

File	Change
`pipelines.py`	Write `article_id` into manifests after rendering; pass `stem=str(idx)` to `render_pdf`
`pdf.py` + `render.py`	Add `stem` parameter to `render_pdf`
`embed.py`	Read `article_id` from tiles.json/chunks.json first, fallback to dir name
`embed_cpu.py`	Same
`tests/test_article_id.py`	5 tests: manifest wins, fallback works, non-numeric handled, override, multi-article

…dir names The embed pipeline extracted article_id by parsing the tile directory name (e.g. "3104240.png.tiles" → int("3104240")). This broke for PDFs (directory named after the filename stem, e.g. "report.png.tiles" → int("report") fails) and for any non-numeric directory name. GPU embed skipped the tile silently; CPU embed used a hash fallback that produced IDs misaligned with articles.json. Root cause: article_id was never explicitly communicated from the pipeline to the embed stage — embed had to reverse-engineer it from the filesystem. Fix: the pipeline now writes article_id into tiles.json and chunks.json after rendering. Embed reads it from the manifest first, falling back to directory name parsing for backward compatibility with existing large-scale indexes (e.g. the Wikipedia corpus where dir names are already numeric). Also: render_pdf gains a stem parameter so the pipeline can name PDF tile directories by position index (like URLs), making directory names consistent across all source types. Follows the same sidecar-metadata pattern used by ColPali (JSONL manifest mapping FAISS IDs to doc metadata) and LEANN (passage_id_scheme + ids.txt/offset map). See also Rulin Shao's MassiveDS (shared shard IDs).

vercel · 2026-06-23T13:00:47Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
web	Ready	Preview, Comment	Jun 23, 2026 1:01pm

vercel Bot deployed to Preview June 23, 2026 13:01 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(index): write article_id into tile manifests, stop guessing from dir names#83

fix(index): write article_id into tile manifests, stop guessing from dir names#83
andylizf wants to merge 1 commit into
mainfrom
fix/article-id-manifest

andylizf commented Jun 23, 2026

Uh oh!

vercel Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andylizf commented Jun 23, 2026

Problem

Fix

What changed

Uh oh!

vercel Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Jun 23, 2026 •

edited

Loading