Skip to content

feat(local-paper-index): index raw markdown corpora, not just PDFs#74

Merged
robe-ai2 merged 3 commits into
mainfrom
feat/local-doc-index-markdown
Jun 15, 2026
Merged

feat(local-paper-index): index raw markdown corpora, not just PDFs#74
robe-ai2 merged 3 commits into
mainfrom
feat/local-doc-index-markdown

Conversation

@gas2own

@gas2own gas2own commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

What

Generalize local-paper-index's chunk-and-index.py so it can index a corpus that is already markdown — skipping PDF extraction entirely — while staying fully backward compatible for the existing PDF workflow.

Adds a --source-ext flag (default pdf):

# index a tree of authored .md docs directly
chunk-and-index.py my-notes /data/notes --index-path /data/notes/index.yaml --source-ext md
  • Canonical extra.source_file is always written; legacy extra.source_pdf is still written for --source-ext pdf, so existing consumers and indexes are unaffected.
  • Secondary tag becomes <ext>-index (md-index, txt-index, …); pdf-index unchanged for the default.
  • Resumability keys on source_file with a source_pdf fallback.
  • SKILL.md gains an "Indexing raw markdown (skipping PDF extraction)" section.

Why

This is the enabling change for allenai/gas2own#150@rodneykinney recommended using asta-documents / local-paper-index for retrieval over gas2own's own investigation record (a corpus of authored .md), noting the skill "could easily be refactored to index raw .md documents." This is that refactor. The downstream gas2own retriever PR will consume it.

Verification

  • ruff check + ruff format --check clean on the changed file
  • validate-skills.py: all 15 SKILL.md valid
  • Smoke-indexed a 2-file markdown corpus → md-index tag, source_file metadata, relative URLs
  • Confirmed PDF default still emits source_pdf + pdf-index

🤖 Generated with Claude Code

The chunk-and-index script already ingests flat `.md` files, but it baked in
a PDF-only contract: it synthesized a `.pdf` source name, tagged every chunk
`pdf-index`, and stored only `extra.source_pdf`. That makes it awkward to index
a corpus that was authored as markdown (notes, wikis, an investigation record)
and never had a PDF.

Add `--source-ext` (default `pdf`, fully backward compatible) so callers can
index raw markdown/text directly by skipping PDF extraction entirely:

  chunk-and-index.py my-notes /data/notes --index-path ... --source-ext md

- canonical `extra.source_file` is always written; legacy `extra.source_pdf`
  is preserved for `--source-ext pdf` so existing consumers/indexes are unaffected
- secondary tag becomes `<ext>-index` (`md-index`, `txt-index`, ...; `pdf-index`
  unchanged for the default)
- resumability now keys on source_file with a source_pdf fallback
- SKILL.md documents the raw-markdown path (skip Steps 1-2)

Verified: ruff check + format clean, validate-skills 15/15, smoke-indexed a
markdown corpus, and confirmed the PDF default still emits source_pdf/pdf-index.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@gas2own gas2own requested a review from rodneykinney June 9, 2026 21:39

@rodneykinney rodneykinney left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The url field already points to the source .md field in any case, so the extra.source_file field is redundant. It can be omitted.

For the case when the .md is downstream of a .pdf the only difference should be the addition of an extra field pointing to the upstream PDF. That can be the purpose of extra.source_pdf. If it is absent, then there is no upstream document, and url points to the original source.

Instead of a --source_ext flag, introduce a flag that points to the upstream directory of PDFs. The script can iterate over PDFs in that directory to find the PDF corresponding to each .md file instead of building a name based on assumptions. This would be an improvement over the existing logic for the pdf-to-md case.

gas2own agent and others added 2 commits June 10, 2026 18:08
Resolves modify/delete conflict: PR #72 relocated the local-paper-index skill
to plugins/asta-preview/skills/. The markdown-indexing changes are re-applied
at the new path in the following commit.
Rodney's feedback on PR #74:

- Drop extra.source_file — the url already points to the source .md, so it
  was redundant. The indexed document *is* the markdown.
- extra.source_pdf is now present only when the .md is downstream of a PDF;
  its absence means there is no upstream document and url is the original
  source. It now holds a real pointer (relative/file:// URL) to the PDF.
- Replace the --source-ext flag with --pdf-dir, pointing at the upstream PDF
  directory. The script iterates that directory and matches each .md to the
  PDF actually on disk (by basename) instead of synthesizing '<stem>.pdf' —
  an improvement over the old pdf-to-md logic. Unmatched .md files are
  indexed without source_pdf and warned about.
- Secondary tag is pdf-index for PDF-derived markdown (unchanged for the
  existing workflow) and md-index for raw markdown. Resumability now keys on
  url. SKILL.md Step 3 and the raw-markdown section updated to match.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@gas2own

gas2own commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @rodneykinney — reworked per your feedback (pushed in 2e6f087, on top of a merge of main which had relocated the skill to plugins/asta-preview/skills/):

  1. Dropped extra.source_file — the url already points at the source .md, so it was redundant. The indexed document is the markdown.
  2. extra.source_pdf is now optional upstream provenance — present only when the .md is downstream of a PDF, holding a real pointer (relative / file:// URL) to it. Absent ⇒ no upstream document, url is the original source.
  3. Replaced --source-ext with --pdf-dir — the script iterates that directory and matches each .md to the PDF actually on disk (by basename), instead of synthesizing <stem>.pdf. A .md with no match is indexed without source_pdf and warned about. This is the improvement to the pdf→md path you described.

Secondary tag follows the same signal: pdf-index for PDF-derived markdown (unchanged for the existing workflow), md-index for raw markdown. Resumability now keys on url. SKILL.md Step 3 + the raw-markdown section are updated to match.

Merge conflict is resolved and all CI jobs are now running/green (they couldn't run before — see below).

@robe-ai2 robe-ai2 requested a review from rodneykinney June 10, 2026 18:17
@robe-ai2

Copy link
Copy Markdown

@rodneykinney could you take another look at gas2own's latest on this?

@rodneykinney rodneykinney left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@robe-ai2 robe-ai2 merged commit e17cb7d into main Jun 15, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants