feat(local-paper-index): index raw markdown corpora, not just PDFs#74
Conversation
The chunk-and-index script already ingests flat `.md` files, but it baked in a PDF-only contract: it synthesized a `.pdf` source name, tagged every chunk `pdf-index`, and stored only `extra.source_pdf`. That makes it awkward to index a corpus that was authored as markdown (notes, wikis, an investigation record) and never had a PDF. Add `--source-ext` (default `pdf`, fully backward compatible) so callers can index raw markdown/text directly by skipping PDF extraction entirely: chunk-and-index.py my-notes /data/notes --index-path ... --source-ext md - canonical `extra.source_file` is always written; legacy `extra.source_pdf` is preserved for `--source-ext pdf` so existing consumers/indexes are unaffected - secondary tag becomes `<ext>-index` (`md-index`, `txt-index`, ...; `pdf-index` unchanged for the default) - resumability now keys on source_file with a source_pdf fallback - SKILL.md documents the raw-markdown path (skip Steps 1-2) Verified: ruff check + format clean, validate-skills 15/15, smoke-indexed a markdown corpus, and confirmed the PDF default still emits source_pdf/pdf-index. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
rodneykinney
left a comment
There was a problem hiding this comment.
The url field already points to the source .md field in any case, so the extra.source_file field is redundant. It can be omitted.
For the case when the .md is downstream of a .pdf the only difference should be the addition of an extra field pointing to the upstream PDF. That can be the purpose of extra.source_pdf. If it is absent, then there is no upstream document, and url points to the original source.
Instead of a --source_ext flag, introduce a flag that points to the upstream directory of PDFs. The script can iterate over PDFs in that directory to find the PDF corresponding to each .md file instead of building a name based on assumptions. This would be an improvement over the existing logic for the pdf-to-md case.
Resolves modify/delete conflict: PR #72 relocated the local-paper-index skill to plugins/asta-preview/skills/. The markdown-indexing changes are re-applied at the new path in the following commit.
Rodney's feedback on PR #74: - Drop extra.source_file — the url already points to the source .md, so it was redundant. The indexed document *is* the markdown. - extra.source_pdf is now present only when the .md is downstream of a PDF; its absence means there is no upstream document and url is the original source. It now holds a real pointer (relative/file:// URL) to the PDF. - Replace the --source-ext flag with --pdf-dir, pointing at the upstream PDF directory. The script iterates that directory and matches each .md to the PDF actually on disk (by basename) instead of synthesizing '<stem>.pdf' — an improvement over the old pdf-to-md logic. Unmatched .md files are indexed without source_pdf and warned about. - Secondary tag is pdf-index for PDF-derived markdown (unchanged for the existing workflow) and md-index for raw markdown. Resumability now keys on url. SKILL.md Step 3 and the raw-markdown section updated to match. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thanks @rodneykinney — reworked per your feedback (pushed in 2e6f087, on top of a merge of
Secondary tag follows the same signal: Merge conflict is resolved and all CI jobs are now running/green (they couldn't run before — see below). |
|
@rodneykinney could you take another look at gas2own's latest on this? |
What
Generalize
local-paper-index'schunk-and-index.pyso it can index a corpus that is already markdown — skipping PDF extraction entirely — while staying fully backward compatible for the existing PDF workflow.Adds a
--source-extflag (defaultpdf):# index a tree of authored .md docs directly chunk-and-index.py my-notes /data/notes --index-path /data/notes/index.yaml --source-ext mdextra.source_fileis always written; legacyextra.source_pdfis still written for--source-ext pdf, so existing consumers and indexes are unaffected.<ext>-index(md-index,txt-index, …);pdf-indexunchanged for the default.source_filewith asource_pdffallback.Why
This is the enabling change for allenai/gas2own#150 — @rodneykinney recommended using
asta-documents/local-paper-indexfor retrieval over gas2own's own investigation record (a corpus of authored.md), noting the skill "could easily be refactored to index raw.mddocuments." This is that refactor. The downstream gas2own retriever PR will consume it.Verification
ruff check+ruff format --checkclean on the changed filevalidate-skills.py: all 15 SKILL.md validmd-indextag,source_filemetadata, relative URLssource_pdf+pdf-index🤖 Generated with Claude Code