diff --git a/plugins/asta-preview/skills/local-paper-index/SKILL.md b/plugins/asta-preview/skills/local-paper-index/SKILL.md index 17b052e..f74a465 100644 --- a/plugins/asta-preview/skills/local-paper-index/SKILL.md +++ b/plugins/asta-preview/skills/local-paper-index/SKILL.md @@ -111,22 +111,24 @@ The script: ### Step 3: Chunk and build index ```bash -uv run --with pyyaml python3 /path/to/assets/chunk-and-index.py "$COLLECTION" "$MARKDOWN_DIR" --index-path "$INDEX_PATH" +uv run --with pyyaml python3 /path/to/assets/chunk-and-index.py "$COLLECTION" "$MARKDOWN_DIR" --index-path "$INDEX_PATH" --pdf-dir "$PDF_DIR" ``` The `--index-path` argument is **required**. The script: -- Computes paths relative to the index file's directory, storing **relative paths** in the `url` field — making the index portable across machines +- Computes paths relative to the index file's directory, storing **relative paths** in the `url` field — making the index portable across machines. The indexed document *is* the markdown: its `url` points at the `.md` file. - Reads each markdown file, splits into ~2000-char chunks at paragraph/sentence boundaries - Writes all documents to the index YAML in a single pass - Preserves any existing documents in the index (appends, does not overwrite) -- Skips PDFs already indexed for this collection (safe to re-run) +- Skips markdown files already indexed for this collection (resumability keys on `url`; safe to re-run) +- Resolves the upstream PDF for each `.md` by iterating `--pdf-dir` and matching on basename (the per-PDF subdirectory name, or the flat file stem) — it finds the PDF actually on disk rather than assuming a filename. A `.md` with no matching PDF is indexed without a `source_pdf` and warned about. - Each document gets: - - **Shared PDF metadata:** `source_pdf`, `collection` (in `extra`) + - **Shared metadata (in `extra`):** `collection`, plus `source_pdf` (a relative/`file://` pointer to the upstream PDF) **only when** `--pdf-dir` is given and a matching PDF is found - **Per-chunk metadata:** `chunk_index`, `total_chunks`, `chunk_chars`, `chunk_offset`, `file_chars` (in `extra`) - - **Tags:** ``, `pdf-index` + - **Tags:** ``, plus `pdf-index` for PDF-derived markdown or `md-index` for raw markdown Options: - `--chunk-size 2000` — adjust chunk size (default 2000 chars) +- `--pdf-dir "$PDF_DIR"` — directory of upstream source PDFs. Omit it when indexing authored markdown (see [Indexing raw markdown](#indexing-raw-markdown-no-pdfs) below). ### Step 4: Warm the search cache @@ -171,6 +173,31 @@ asta documents --root "$DATASET_ROOT" search --extra=".source_pdf contains some- asta documents --root "$DATASET_ROOT" list --tags="my-papers" ``` +## Indexing raw markdown (no PDFs) + +If your corpus is **already markdown** (authored `.md` docs, exported notes, an +investigation record, a wiki), there is nothing to extract — skip Steps 1–2 and +point the chunker straight at the markdown directory. Just omit `--pdf-dir`: + +```bash +COLLECTION="my-notes" +MARKDOWN_DIR="/data/notes" # a tree of .md files (rglob, nested OK) +INDEX_PATH="/data/notes/index.yaml" + +uv run --with pyyaml python3 /path/to/assets/chunk-and-index.py \ + "$COLLECTION" "$MARKDOWN_DIR" --index-path "$INDEX_PATH" + +bash /path/to/assets/warm-cache.sh "$(dirname "$INDEX_PATH")" +asta documents --root "$(dirname "$INDEX_PATH")" search \ + --summary="your query" --tags="$COLLECTION" --show-scores +``` + +The markdown is the source: each document's `url` points at the `.md`, the +secondary tag is `md-index`, and `extra.source_pdf` is absent (there is no +upstream PDF). Chunking, relative-path URLs, and resumability are identical to +the PDF path — the only difference between the two is whether `extra.source_pdf` +is present. + ## Storage Estimates | Collection size | Approx. index size | Approx. markdown size | diff --git a/plugins/asta-preview/skills/local-paper-index/assets/chunk-and-index.py b/plugins/asta-preview/skills/local-paper-index/assets/chunk-and-index.py index 93bc93e..aaa6e44 100644 --- a/plugins/asta-preview/skills/local-paper-index/assets/chunk-and-index.py +++ b/plugins/asta-preview/skills/local-paper-index/assets/chunk-and-index.py @@ -1,5 +1,5 @@ #!/usr/bin/env python3 -"""Chunk extracted markdown files and write an asta-documents YAML index. +"""Chunk markdown files and write an asta-documents YAML index. Writes the index YAML directly (no per-chunk CLI calls), following the same schema as asta-documents: version 1.0, documents list with uuid/name/url/ @@ -8,8 +8,20 @@ Usage: python3 chunk-and-index.py --index-path -The --index-path is required. The script computes relative URLs for markdown -files relative to the directory containing the index file. Files outside that +The input is always a directory of markdown files, and the indexed document +*is* that markdown: its ``url`` points at the ``.md`` file. The markdown may be +authored directly (a corpus of notes, a wiki, an investigation record) or it +may be extraction output from PDFs. + +When the markdown was extracted from PDFs, pass ``--pdf-dir`` pointing at the +directory of source PDFs. The script then iterates that directory to find the +PDF that actually corresponds to each ``.md`` file and records a pointer to it +in ``extra.source_pdf``. When ``--pdf-dir`` is omitted (or no matching PDF is +found), there is no upstream document and ``extra.source_pdf`` is simply absent +— ``url`` is the original source. + +The --index-path is required. The script computes relative URLs for files +relative to the directory containing the index file. Files outside that directory get absolute file:// URLs. The markdown-dir can contain either: @@ -18,12 +30,9 @@ - Flat .md files: markdown/paper1.md, markdown/paper2.md, ... -The source PDF name is derived from the subdirectory name (if nested) or -the .md file stem (if flat). - -Each PDF is represented by multiple documents in the index. They share -PDF-level metadata (source_pdf, collection) with per-chunk identifiers -(chunk_index, total_chunks). +Each markdown file is represented by multiple documents in the index. They +share file-level metadata (collection, and source_pdf when applicable) with +per-chunk identifiers (chunk_index, total_chunks). """ import argparse @@ -80,13 +89,13 @@ def chunk_text(text: str, size: int = CHUNK_SIZE) -> list[tuple[str, int]]: return chunks -def make_url(md_file: Path, index_dir: Path) -> str: - """Compute a URL for a markdown file, relative to the index directory. +def make_url(path: Path, index_dir: Path) -> str: + """Compute a URL for a file, relative to the index directory. If the file is under the index directory, returns a relative path (portable, git-friendly). Otherwise returns an absolute file:// URL. """ - resolved = md_file.resolve() + resolved = path.resolve() try: rel = resolved.relative_to(index_dir) return str(rel) @@ -104,18 +113,41 @@ def load_existing_index(index_path: Path) -> dict: return {"version": "1.0", "documents": []} -def find_existing_pdfs(documents: list[dict], collection: str) -> set[str]: - """Find source_pdf values already in the index for this collection.""" +def find_existing_urls(documents: list[dict], collection: str) -> set[str]: + """Find the `url`s already indexed for this collection. + + The url (the markdown file itself) is the canonical identity of an indexed + document, so resumability keys on it. + """ seen = set() for doc in documents: - extra = doc.get("extra", {}) - if extra.get("collection") == collection: - pdf = extra.get("source_pdf", "") - if pdf: - seen.add(pdf) + if doc.get("extra", {}).get("collection") == collection: + url = doc.get("url") + if url: + seen.add(url) return seen +def build_pdf_index(pdf_dir: Path) -> dict[str, Path]: + """Map each PDF's stem to its path, for matching markdown files to sources. + + Iterates the actual PDFs under `pdf_dir` (recursively) rather than + synthesizing a filename, so the match reflects what is really on disk. + """ + pdf_index: dict[str, Path] = {} + for pdf in sorted(pdf_dir.rglob("*.pdf")): + # First writer wins; warn on an ambiguous stem collision. + if pdf.stem in pdf_index: + print( + f"WARNING: multiple PDFs share the stem '{pdf.stem}'; " + f"using {pdf_index[pdf.stem]}, ignoring {pdf}", + file=sys.stderr, + ) + continue + pdf_index[pdf.stem] = pdf + return pdf_index + + def main(): parser = argparse.ArgumentParser( description="Chunk markdown files and write asta-documents YAML index" @@ -125,7 +157,7 @@ def main(): ) parser.add_argument( "markdown_dir", - help="Directory containing PDF extraction output (subdirectories with .md + images, or flat .md files)", + help="Directory of markdown files (per-PDF subdirectories with .md + images, or flat .md files)", ) parser.add_argument( "--chunk-size", @@ -133,6 +165,15 @@ def main(): default=CHUNK_SIZE, help="Chunk size in characters (default: 2000)", ) + parser.add_argument( + "--pdf-dir", + help=( + "Directory of upstream source PDFs (when the markdown was extracted " + "from PDFs). The script iterates this directory to find the PDF " + "matching each .md file and records a pointer to it in " + "extra.source_pdf. Omit it when indexing authored markdown." + ), + ) parser.add_argument( "--index-path", required=True, @@ -149,6 +190,15 @@ def main(): print(f"Error: markdown directory not found: {md_dir}", file=sys.stderr) sys.exit(1) + pdf_index: dict[str, Path] = {} + if args.pdf_dir: + pdf_dir = Path(args.pdf_dir) + if not pdf_dir.exists(): + print(f"Error: PDF directory not found: {pdf_dir}", file=sys.stderr) + sys.exit(1) + pdf_index = build_pdf_index(pdf_dir) + print(f"Found {len(pdf_index)} source PDF(s) in {pdf_dir}") + # Find .md files: supports both per-PDF subdirectories (with images) and # flat .md files directly in markdown_dir. md_files = sorted(md_dir.rglob("*.md")) @@ -169,63 +219,85 @@ def main(): # Load existing index (preserves previously indexed documents) index_data = load_existing_index(index_path) - existing_pdfs = find_existing_pdfs(index_data["documents"], collection) + existing_urls = find_existing_urls(index_data["documents"], collection) now = datetime.now(UTC).isoformat() new_docs = 0 - pdfs_processed = 0 - pdfs_skipped_empty = 0 - pdfs_skipped_existing = 0 + docs_processed = 0 + docs_skipped_empty = 0 + docs_skipped_existing = 0 for md_file in md_files: text = md_file.read_text(encoding="utf-8") - # Derive the PDF name: if the .md is in a subdirectory of markdown_dir, - # use the subdirectory name (e.g. markdown/paper1/paper1.md -> paper1.pdf). + # Derive the basename: if the .md is in a subdirectory of markdown_dir, + # use the subdirectory name (e.g. markdown/paper1/paper1.md -> paper1). # If flat in markdown_dir, use the file stem. if md_file.parent != md_dir: basename = md_file.parent.name else: basename = md_file.stem - source_pdf = f"{basename}.pdf" if not text.strip(): print(f" [skip] {basename} (empty)") - pdfs_skipped_empty += 1 + docs_skipped_empty += 1 continue - if source_pdf in existing_pdfs: + url = make_url(md_file, index_dir) + + if url in existing_urls: print(f" [skip] {basename} (already indexed)") - pdfs_skipped_existing += 1 + docs_skipped_existing += 1 continue - url = make_url(md_file, index_dir) + # Resolve the upstream PDF, if any, by matching the basename against the + # PDFs actually present in --pdf-dir. + source_pdf_url = None + if args.pdf_dir: + pdf = pdf_index.get(basename) + if pdf is not None: + source_pdf_url = make_url(pdf, index_dir) + else: + print( + f" [warn] {basename}: no matching PDF in --pdf-dir; " + "indexing without source_pdf", + file=sys.stderr, + ) + + # Documents derived from a PDF keep the legacy `pdf-index` tag so + # existing consumers that filter on it still work; raw markdown gets + # `md-index`. + secondary_tag = "pdf-index" if source_pdf_url else "md-index" + file_size = len(text) chunks = chunk_text(text, chunk_size) for i, (chunk, offset) in enumerate(chunks, 1): + extra = { + "chunk_index": i, + "total_chunks": len(chunks), + "chunk_chars": len(chunk), + "chunk_offset": offset, + "file_chars": file_size, + "collection": collection, + } + # Present only when there is a real upstream PDF for this markdown. + if source_pdf_url: + extra["source_pdf"] = source_pdf_url doc_entry = { "uuid": generate_uuid(), "name": f"{basename} [chunk {i}/{len(chunks)}]", "mime_type": "text/markdown", "url": url, "summary": chunk, - "tags": [collection, "pdf-index"], + "tags": [collection, secondary_tag], "created_at": now, "modified_at": now, - "extra": { - "source_pdf": source_pdf, - "chunk_index": i, - "total_chunks": len(chunks), - "chunk_chars": len(chunk), - "chunk_offset": offset, - "file_chars": file_size, - "collection": collection, - }, + "extra": extra, } index_data["documents"].append(doc_entry) new_docs += 1 - pdfs_processed += 1 + docs_processed += 1 print(f" [index] {basename} ({len(chunks)} chunks) -> {url}") # Write index @@ -236,12 +308,12 @@ def main(): ) print() - print(f"PDFs processed: {pdfs_processed}") - print(f"PDFs skipped (empty): {pdfs_skipped_empty}") - print(f"PDFs skipped (exists): {pdfs_skipped_existing}") - print(f"New documents added: {new_docs}") - print(f"Total documents in idx: {len(index_data['documents'])}") - print(f"Index written to: {index_path}") + print(f"Sources processed: {docs_processed}") + print(f"Sources skipped (empty): {docs_skipped_empty}") + print(f"Sources skipped (exists):{docs_skipped_existing}") + print(f"New documents added: {new_docs}") + print(f"Total documents in idx: {len(index_data['documents'])}") + print(f"Index written to: {index_path}") if __name__ == "__main__":