feat(local-paper-index): index raw markdown corpora, not just PDFs by gas2own · Pull Request #74 · allenai/asta-plugins

gas2own · 2026-06-09T21:39:58Z

What

Generalize local-paper-index's chunk-and-index.py so it can index a corpus that is already markdown — skipping PDF extraction entirely — while staying fully backward compatible for the existing PDF workflow.

Adds a --source-ext flag (default pdf):

# index a tree of authored .md docs directly
chunk-and-index.py my-notes /data/notes --index-path /data/notes/index.yaml --source-ext md

Canonical extra.source_file is always written; legacy extra.source_pdf is still written for --source-ext pdf, so existing consumers and indexes are unaffected.
Secondary tag becomes <ext>-index (md-index, txt-index, …); pdf-index unchanged for the default.
Resumability keys on source_file with a source_pdf fallback.
SKILL.md gains an "Indexing raw markdown (skipping PDF extraction)" section.

Why

This is the enabling change for allenai/gas2own#150 — @rodneykinney recommended using asta-documents / local-paper-index for retrieval over gas2own's own investigation record (a corpus of authored .md), noting the skill "could easily be refactored to index raw .md documents." This is that refactor. The downstream gas2own retriever PR will consume it.

Verification

ruff check + ruff format --check clean on the changed file
validate-skills.py: all 15 SKILL.md valid
Smoke-indexed a 2-file markdown corpus → md-index tag, source_file metadata, relative URLs
Confirmed PDF default still emits source_pdf + pdf-index

🤖 Generated with Claude Code

The chunk-and-index script already ingests flat `.md` files, but it baked in a PDF-only contract: it synthesized a `.pdf` source name, tagged every chunk `pdf-index`, and stored only `extra.source_pdf`. That makes it awkward to index a corpus that was authored as markdown (notes, wikis, an investigation record) and never had a PDF. Add `--source-ext` (default `pdf`, fully backward compatible) so callers can index raw markdown/text directly by skipping PDF extraction entirely: chunk-and-index.py my-notes /data/notes --index-path ... --source-ext md - canonical `extra.source_file` is always written; legacy `extra.source_pdf` is preserved for `--source-ext pdf` so existing consumers/indexes are unaffected - secondary tag becomes `<ext>-index` (`md-index`, `txt-index`, ...; `pdf-index` unchanged for the default) - resumability now keys on source_file with a source_pdf fallback - SKILL.md documents the raw-markdown path (skip Steps 1-2) Verified: ruff check + format clean, validate-skills 15/15, smoke-indexed a markdown corpus, and confirmed the PDF default still emits source_pdf/pdf-index. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

rodneykinney

The url field already points to the source .md field in any case, so the extra.source_file field is redundant. It can be omitted.

For the case when the .md is downstream of a .pdf the only difference should be the addition of an extra field pointing to the upstream PDF. That can be the purpose of extra.source_pdf. If it is absent, then there is no upstream document, and url points to the original source.

Instead of a --source_ext flag, introduce a flag that points to the upstream directory of PDFs. The script can iterate over PDFs in that directory to find the PDF corresponding to each .md file instead of building a name based on assumptions. This would be an improvement over the existing logic for the pdf-to-md case.

Resolves modify/delete conflict: PR #72 relocated the local-paper-index skill to plugins/asta-preview/skills/. The markdown-indexing changes are re-applied at the new path in the following commit.

Rodney's feedback on PR #74: - Drop extra.source_file — the url already points to the source .md, so it was redundant. The indexed document *is* the markdown. - extra.source_pdf is now present only when the .md is downstream of a PDF; its absence means there is no upstream document and url is the original source. It now holds a real pointer (relative/file:// URL) to the PDF. - Replace the --source-ext flag with --pdf-dir, pointing at the upstream PDF directory. The script iterates that directory and matches each .md to the PDF actually on disk (by basename) instead of synthesizing '<stem>.pdf' — an improvement over the old pdf-to-md logic. Unmatched .md files are indexed without source_pdf and warned about. - Secondary tag is pdf-index for PDF-derived markdown (unchanged for the existing workflow) and md-index for raw markdown. Resumability now keys on url. SKILL.md Step 3 and the raw-markdown section updated to match. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gas2own · 2026-06-10T18:13:06Z

Thanks @rodneykinney — reworked per your feedback (pushed in 2e6f087, on top of a merge of main which had relocated the skill to plugins/asta-preview/skills/):

Dropped extra.source_file — the url already points at the source .md, so it was redundant. The indexed document is the markdown.
extra.source_pdf is now optional upstream provenance — present only when the .md is downstream of a PDF, holding a real pointer (relative / file:// URL) to it. Absent ⇒ no upstream document, url is the original source.
Replaced --source-ext with --pdf-dir — the script iterates that directory and matches each .md to the PDF actually on disk (by basename), instead of synthesizing <stem>.pdf. A .md with no match is indexed without source_pdf and warned about. This is the improvement to the pdf→md path you described.

Secondary tag follows the same signal: pdf-index for PDF-derived markdown (unchanged for the existing workflow), md-index for raw markdown. Resumability now keys on url. SKILL.md Step 3 + the raw-markdown section are updated to match.

Merge conflict is resolved and all CI jobs are now running/green (they couldn't run before — see below).

robe-ai2 · 2026-06-15T17:22:46Z

@rodneykinney could you take another look at gas2own's latest on this?

rodneykinney

Looks good!

gas2own requested a review from rodneykinney June 9, 2026 21:39

rodneykinney requested changes Jun 10, 2026

View reviewed changes

gas2own agent and others added 2 commits June 10, 2026 18:08

Merge main into feat/local-doc-index-markdown

19f3bda

Resolves modify/delete conflict: PR #72 relocated the local-paper-index skill to plugins/asta-preview/skills/. The markdown-indexing changes are re-applied at the new path in the following commit.

robe-ai2 requested a review from rodneykinney June 10, 2026 18:17

rodneykinney approved these changes Jun 15, 2026

View reviewed changes

robe-ai2 merged commit e17cb7d into main Jun 15, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(local-paper-index): index raw markdown corpora, not just PDFs#74

feat(local-paper-index): index raw markdown corpora, not just PDFs#74
robe-ai2 merged 3 commits into
mainfrom
feat/local-doc-index-markdown

gas2own commented Jun 9, 2026

Uh oh!

rodneykinney left a comment

Uh oh!

gas2own commented Jun 10, 2026

Uh oh!

robe-ai2 commented Jun 15, 2026

Uh oh!

rodneykinney left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

gas2own commented Jun 9, 2026

What

Why

Verification

Uh oh!

rodneykinney left a comment

Choose a reason for hiding this comment

Uh oh!

gas2own commented Jun 10, 2026

Uh oh!

robe-ai2 commented Jun 15, 2026

Uh oh!

rodneykinney left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants