From eb5fdb693f5fe4da60f49c856125ddbdecbe23d7 Mon Sep 17 00:00:00 2001 From: nickwinder Date: Wed, 27 May 2026 19:55:40 +1200 Subject: [PATCH 1/9] feat(dws): add /extraction/parse support to document-processor-api skill MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Teach the DWS skill how to call the now-GA /extraction/parse endpoint: - scripts/parse.py — single primitive that accepts a local file plus mode and output_format, calls client.parse(), and writes the result. Modes: text (1 cr/pg), structure (1.5 cr/pg, default), understand (9 cr/pg), agentic (18 cr/pg). Output shapes: spatial elements or whole-document Markdown. Billed against extraction credits (separate from processor API credits). Prints usage summary after each run. - references/parse-output-filtering.md — new reference doc showing downstream consumption patterns after a single /parse call: reading- order plain text, table-to-grid projection, key-value dict, formula LaTeX, picture alt descriptions. Includes Python snippets and jq one-liners for each pattern. - references/script-catalog.md — adds parse.py entry under a new "Data Extraction" section with mode, cost, and output-shape summary. - SKILL.md — adds a Data Extraction section covering: what /parse is (document-understanding primitive, not per-element-type calls), mode selection table keyed to user intent, default of structure+spatial for ambiguous requests, invocation examples, downstream-consumption quick-ref, and pointer to parse-output-filtering.md. Also updates skill description and task-scripts list. Python client dependency: path-install of the local branch that adds client.parse() support (file:// URL in the uv inline script header). --- .../skills/document-processor-api/SKILL.md | 78 +++++- .../references/parse-output-filtering.md | 236 ++++++++++++++++++ .../document-processor-api/scripts/parse.py | 147 +++++++++++ 3 files changed, 457 insertions(+), 4 deletions(-) create mode 100644 plugins/nutrient-dws/skills/document-processor-api/references/parse-output-filtering.md create mode 100644 plugins/nutrient-dws/skills/document-processor-api/scripts/parse.py diff --git a/plugins/nutrient-dws/skills/document-processor-api/SKILL.md b/plugins/nutrient-dws/skills/document-processor-api/SKILL.md index 7df3bd0..66bfacc 100644 --- a/plugins/nutrient-dws/skills/document-processor-api/SKILL.md +++ b/plugins/nutrient-dws/skills/document-processor-api/SKILL.md @@ -3,9 +3,11 @@ name: document-processor-api description: >- Process documents with Nutrient DWS. Use when the user wants to generate PDFs from HTML or URLs, convert Office/images/PDFs, assemble or split packets, OCR scans, extract text/tables/key-value - pairs, redact PII, watermark, sign, fill forms, optimize PDFs, or produce compliance outputs like - PDF/A or PDF/UA. Triggers include convert to PDF, merge these PDFs, OCR this scan, extract tables, - redact PII, sign this PDF, make this PDF/A, or linearize for web delivery. + pairs, parse documents into a structural model or Markdown (for RAG indexing, form/invoice + extraction, or layout-aware understanding), redact PII, watermark, sign, fill forms, optimize + PDFs, or produce compliance outputs like PDF/A or PDF/UA. Triggers include convert to PDF, merge + these PDFs, OCR this scan, extract tables, parse this document, extract for RAG, redact PII, + sign this PDF, make this PDF/A, or linearize for web delivery. license: MIT metadata: author: nutrient-sdk @@ -37,6 +39,7 @@ Use Nutrient DWS for managed document workflows where fidelity, compliance, or m - Generate PDFs from HTML templates, uploaded assets, or remote URLs. - Convert Office, HTML, image, and PDF files between supported formats. - OCR scans and extract text, tables, or key-value pairs. +- Parse a document into its structural model or whole-document Markdown for RAG indexing, form/invoice extraction, or layout-aware understanding. - Redact PII, watermark, sign, fill forms, merge, split, rotate, flatten, or encrypt PDFs. - Produce delivery targets like PDF/A, PDF/UA, optimized PDFs, or linearized PDFs. - Check credits before large, batch, or AI-heavy runs. @@ -47,7 +50,73 @@ Use Nutrient DWS for managed document workflows where fidelity, compliance, or m 3. Use the modular `references/` docs and direct API payloads for capabilities that do not yet have a dedicated helper script, especially HTML/URL generation and compliance tuning. 4. Use local PDF utilities only for lightweight inspection. Use Nutrient when output fidelity or compliance matters. +## Data Extraction (`/extraction/parse`) + +Use `scripts/parse.py` for any task involving document understanding, content extraction, +RAG indexing, form data extraction, or layout analysis. + +**`/extraction/parse` is a document-understanding primitive**: one call returns the full +structural document model — typed elements with bounding boxes, confidence scores, and +reading order — or a whole-document Markdown string. You always receive all element types +in a single call. + +### Picking a mode + +Choose based on the user's intent and acceptable credit cost. All costs are +**extraction credits per page** — a separate billing bucket from the processor API +credits consumed by `/build`, `/sign`, OCR, and other Processor API endpoints. + +| User intent | Mode | Output format | Cost | Notes | +|-------------|------|---------------|------|-------| +| RAG / search indexing / content migration — born-digital PDF | `text` | `markdown` | 1 cr/pg | Cheapest path; no OCR or AI needed | +| RAG / search indexing — scanned or image-based PDF | `structure` | `markdown` | 1.5 cr/pg | OCR required before Markdown assembly | +| Form / invoice extraction | `understand` | `spatial` | 9 cr/pg | AI classification for reliable key-value and table detection | +| Layout-aware document understanding | `understand` | `spatial` | 9 cr/pg | Semantic paragraph roles (Title, SectionHeader, etc.) | +| Deep visual understanding (charts, diagrams, alt text) | `agentic` | `spatial` | 18 cr/pg | VLM adds alt descriptions on every picture element | +| **Default / ambiguous intent** | **`structure`** | **`spatial`** | **1.5 cr/pg** | Good balance: OCR + spatial elements, low cost | + +When the user's intent is unclear, **default to `structure` mode with `spatial` output** +(1.5 extraction credits per page). Explain the cost/quality options and ask if a +different mode is preferable before running on large documents. + +### Invocation + +```bash +# Default: structure mode, spatial output +uv run scripts/parse.py --input doc.pdf --out out.json + +# Markdown for RAG (text mode — cheapest) +uv run scripts/parse.py --input doc.pdf --out out.md --output-format markdown --mode text + +# Form extraction (understand mode) +uv run scripts/parse.py --input doc.pdf --out out.json --mode understand + +# Agentic (VLM alt text on pictures) +uv run scripts/parse.py --input doc.pdf --out out.json --mode agentic +``` + +The script prints extraction-credit usage after each run so you can verify the cost. + +### Downstream consumption + +After a single `/parse` call, slice the response for common needs: + +- **Reading-order plain text**: walk `output.elements` sorted by `(page.pageIndex, readingOrder)`, join `paragraph` and `handwriting` `text` fields +- **Tables**: project `cells[]` on each `table` element into rows/columns using `cell.row` and `cell.column` +- **Key-value pairs**: read `pairs[]` on each `keyValueRegion` element — each pair has `.key.value` and `.value.value` +- **Formulas**: read `latex` on each `formula` element +- **Pictures**: read `classification` and `altDescription` (populated by `agentic` mode) on each `picture` element +- **Markdown output**: call with `--output-format markdown`; the script writes the Markdown string directly + +Full patterns with Python snippets and jq one-liners: `references/parse-output-filtering.md` + +### Input constraint + +`parse.py` only accepts **local file paths** — the underlying API endpoint is +multipart-only. For remote inputs, download the file first. + ## Single-operation scripts +- `parse.py` -> document understanding via `/extraction/parse` (structural model or whole-document Markdown) - `convert.py` -> convert between `pdf`, `pdfa`, `pdfua`, `docx`, `xlsx`, `pptx`, `png`, `jpeg`, `webp`, `html`, and `markdown` - `merge.py` -> merge multiple files into one PDF - `split.py` -> split one PDF into multiple PDFs by page ranges @@ -79,6 +148,7 @@ When the user asks for multiple operations in one run: - `split.py` requires a multi-page PDF and cannot extract ranges from a single-page document. - `delete-pages.py` must retain at least one page and cannot delete the entire document. - `sign.py` only accepts local file paths for the main PDF. +- `parse.py` only accepts local file paths (the `/extraction/parse` endpoint is multipart-only). ## Decision rules - Prefer a helper script when one already covers the requested operation cleanly. @@ -107,6 +177,7 @@ Read only what you need: - `references/generation-and-conversion.md` -> HTML/URL generation and format conversion - `references/pdf-manipulation.md` -> merge, split, page-range, rotate, and flatten workflows - `references/extraction-and-ocr.md` -> OCR, text extraction, tables, and key-value workflows +- `references/parse-output-filtering.md` -> `/extraction/parse` downstream consumption patterns (reading-order text, tables, key-values, formulas, pictures) - `references/security-signing-and-forms.md` -> redaction, watermarking, signatures, forms, and passwords - `references/compliance-and-optimization.md` -> PDF/A, PDF/UA, optimization, and linearization - `references/workflow-recipes.md` -> end-to-end sequencing patterns for common business document workflows @@ -127,4 +198,3 @@ Read only what you need: - Use process env injection at runtime (shell/export, secrets manager, or host env). - Restrict file access with `SANDBOX_PATH` to the minimum required working directory. - Before enabling MCP mode in production, verify package provenance and lock version. - diff --git a/plugins/nutrient-dws/skills/document-processor-api/references/parse-output-filtering.md b/plugins/nutrient-dws/skills/document-processor-api/references/parse-output-filtering.md new file mode 100644 index 0000000..9c7e525 --- /dev/null +++ b/plugins/nutrient-dws/skills/document-processor-api/references/parse-output-filtering.md @@ -0,0 +1,236 @@ +# Parse Output — Filtering and Downstream Patterns + +`/extraction/parse` returns a single document model in one call. You always receive all +element types at once — there is no per-type call. This document shows how to slice the +response into the shapes that downstream pipelines commonly need. + +All examples below assume you have already run `parse.py` with `--output-format spatial` +and saved the response to `out.json`. + +--- + +## Response structure + +``` +ParseResponse +├── output +│ ├── elements[] (spatial mode) — typed element list +│ └── markdown (markdown mode) — whole-document Markdown string +├── metrics +│ ├── pagesProcessed +│ └── processingTimeMs +└── usage + └── dataExtractionCredits + ├── cost — extraction credits used by this call + └── remainingCredits +``` + +### Element types (discriminated on `type`) + +| type | Key fields | Modes that produce it | +|------------------|-----------------------------------------------------------------|-------------------------------| +| `paragraph` | `text`, `role`, `words[]`, `bounds`, `readingOrder` | all | +| `table` | `rowCount`, `columnCount`, `cells[]`, `bounds`, `readingOrder` | structure / understand / agentic | +| `formula` | `latex`, `bounds` | understand / agentic | +| `picture` | `classification`, `altDescription`, `bounds` | all (agentic adds VLM alt text) | +| `keyValueRegion` | `pairs[]` (each with `key`/`value` entities + bounds) | understand / agentic | +| `handwriting` | `text`, `words[]`, `bounds` | understand / agentic | + +--- + +## Reading-order plain text + +Walk elements in `(page.pageIndex, readingOrder)` order, collect `text` from +`paragraph` and `handwriting` elements, join with newlines. + +```python +import json + +with open("out.json") as f: + response = json.load(f) + +elements = response["output"]["elements"] + +text_elements = [ + e for e in elements + if e.get("type") in ("paragraph", "handwriting") and e.get("text") +] + +text_elements.sort( + key=lambda e: (e.get("page", {}).get("pageIndex", 0), e.get("readingOrder", 0)) +) + +plain_text = "\n\n".join(e["text"] for e in text_elements) +print(plain_text) +``` + +### jq equivalent + +```bash +jq -r ' + [.output.elements[] + | select(.type == "paragraph" or .type == "handwriting") + | select(.text != null) + ] + | sort_by([.page.pageIndex // 0, .readingOrder // 0]) + | .[].text +' out.json | paste -sd '\n\n' /dev/stdin +``` + +--- + +## Tables — rows and columns dict + +Each `TableElement` carries a flat `cells[]` list. Reconstruct rows/columns by grouping +on `row` and `column` (both 0-indexed). Multi-span cells span `rowSpan` rows and +`colSpan` columns. + +```python +def table_to_grid(table: dict) -> list[list[str]]: + """Return a list-of-rows, each row a list of cell text values.""" + rows = table.get("rowCount", 0) + cols = table.get("columnCount", 0) + grid = [[""] * cols for _ in range(rows)] + for cell in table.get("cells") or []: + r, c = cell.get("row", 0), cell.get("column", 0) + if r < rows and c < cols: + grid[r][c] = cell.get("text", "") + return grid + + +tables = [e for e in elements if e.get("type") == "table"] +for i, table in enumerate(tables): + print(f"Table {i} (page {table.get('page', {}).get('pageIndex', 0)}):") + for row in table_to_grid(table): + print(" | ".join(row)) +``` + +### jq — extract all table cells as JSON + +```bash +jq '[ + .output.elements[] + | select(.type == "table") + | { + page: .page.pageIndex, + readingOrder, + rowCount, + columnCount, + rows: ( + [ .cells[]? | {row, col: .column, text} ] + | group_by(.row) + | map(sort_by(.col) | map(.text)) + ) + } +]' out.json +``` + +--- + +## Key-value regions — key/value dict + +`keyValueRegion` elements carry a `pairs[]` list. Each pair has a `key` entity and a +`value` entity, both with a `value` string field. + +```python +kv_regions = [e for e in elements if e.get("type") == "keyValueRegion"] +for region in kv_regions: + for pair in region.get("pairs") or []: + key_text = pair.get("key", {}).get("value", "") + val_text = pair.get("value", {}).get("value", "") + confidence = pair.get("relationshipConfidence") + print(f"{key_text!r}: {val_text!r} (confidence={confidence})") +``` + +### jq equivalent + +```bash +jq '[ + .output.elements[] + | select(.type == "keyValueRegion") + | .pairs[]? + | { key: .key.value, value: .value.value, confidence: .relationshipConfidence } +]' out.json +``` + +--- + +## Filtering by element type + +```python +from typing import Literal + +def filter_elements(elements: list[dict], type_: str) -> list[dict]: + return [e for e in elements if e.get("type") == type_] + +paragraphs = filter_elements(elements, "paragraph") +tables = filter_elements(elements, "table") +formulas = filter_elements(elements, "formula") +pictures = filter_elements(elements, "picture") +kv_regions = filter_elements(elements, "keyValueRegion") +handwriting = filter_elements(elements, "handwriting") +``` + +### jq + +```bash +# Count by type +jq '.output.elements | group_by(.type) | map({(.[0].type): length}) | add' out.json + +# All tables on page 0 +jq '[.output.elements[] | select(.type == "table" and .page.pageIndex == 0)]' out.json +``` + +--- + +## Formulas (LaTeX) + +```python +formulas = [e for e in elements if e.get("type") == "formula" and e.get("latex")] +for f in formulas: + print(f["latex"]) +``` + +--- + +## Pictures with alt descriptions (agentic mode) + +`agentic` mode uses a vision language model to generate `altDescription` on every +`picture` element. Other modes leave `altDescription` absent or empty. + +```python +pictures = [e for e in elements if e.get("type") == "picture"] +for pic in pictures: + print(f"[{pic.get('classification', 'unknown')}] {pic.get('altDescription', '')}") +``` + +--- + +## Checking extraction-credit cost + +```python +usage = response.get("usage", {}) +credits = usage.get("dataExtractionCredits", {}) +print(f"Cost: {credits.get('cost')} extraction credits") +print(f"Remaining: {credits.get('remainingCredits')}") +``` + +Note: `dataExtractionCredits` reflects charges from the **extraction credits** bucket, +which is separate from the **processor API credits** bucket used by `/build`, `/sign`, +OCR, and other Processor API endpoints. + +--- + +## Mode selection guide + +| Intent | Recommended mode | Cost | Why | +|--------|-----------------|------|-----| +| RAG / search indexing / content migration — born-digital PDF | `text` + `markdown` output | 1 cr/pg | No OCR needed; fastest path to a Markdown string | +| RAG / search indexing — scanned or image PDF | `structure` + `markdown` output | 1.5 cr/pg | OCR required before Markdown assembly | +| Form / invoice extraction | `understand` + `spatial` output | 9 cr/pg | AI classification needed for reliable key-value and table detection | +| Layout-aware document understanding | `understand` + `spatial` output | 9 cr/pg | Semantic classification of paragraphs (Title, SectionHeader, etc.) | +| Deep visual understanding (charts, diagrams) | `agentic` + `spatial` output | 18 cr/pg | VLM generates alt descriptions on every picture element | +| Default / unknown intent | `structure` + `spatial` output | 1.5 cr/pg | Good balance: spatial elements with OCR, low cost | + +All costs are **extraction credits per page** — a separate billing bucket from +processor API credits. diff --git a/plugins/nutrient-dws/skills/document-processor-api/scripts/parse.py b/plugins/nutrient-dws/skills/document-processor-api/scripts/parse.py new file mode 100644 index 0000000..1c87e8d --- /dev/null +++ b/plugins/nutrient-dws/skills/document-processor-api/scripts/parse.py @@ -0,0 +1,147 @@ +#!/usr/bin/env python3 +# /// script +# requires-python = ">=3.10" +# dependencies = ["nutrient-dws>=3.1.0"] +# /// +"""Parse a document using the Nutrient Data Extraction API (/extraction/parse). + +This script is the single primitive for document understanding via /extraction/parse. +One call returns the full structural document model — typed elements with bounding boxes, +confidence scores, and reading order — or a whole-document Markdown string. + +Billing note: /extraction/parse is billed against **extraction credits**, which are a +separate billing bucket from the processor API credits consumed by /build, /sign, OCR, +and other Processor API endpoints. + +Per-page extraction-credit costs by mode: + text: 1 extraction credit — fast Markdown from born-digital documents (no OCR/AI) + structure: 1.5 extraction credits — OCR + spatial elements with bounding boxes + understand: 9 extraction credits — AI layout analysis, table detection, semantic classification + agentic: 18 extraction credits — VLM-augmented; deepest visual understanding + +Output shapes: + spatial (default): response.output.elements — typed elements list + markdown: response.output.markdown — whole-document Markdown string + +Usage examples: + # Spatial elements (structure mode) — lowest-cost spatial extraction + uv run scripts/parse.py --input doc.pdf --out out.json + + # Markdown for RAG / search indexing (text mode — cheapest) + uv run scripts/parse.py --input doc.pdf --out out.md --output-format markdown --mode text + + # Form / invoice extraction (understand mode — typed elements with confidence) + uv run scripts/parse.py --input doc.pdf --out out.json --mode understand + + # Deep visual understanding (agentic mode — VLM descriptions on pictures) + uv run scripts/parse.py --input doc.pdf --out out.json --mode agentic --output-format spatial +""" + +import argparse +import asyncio +import json +import sys +from pathlib import Path + +sys.path.insert(0, str(Path(__file__).parent)) +from lib.common import create_client, handle_error + + +async def main() -> None: + parser = argparse.ArgumentParser( + description=( + "Parse a document with the Nutrient Data Extraction API and write the result. " + "Billed against extraction credits (separate from processor API credits)." + ), + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +Extraction credit costs per page: + text: 1 extraction credit (born-digital Markdown, no OCR) + structure: 1.5 extraction credits (OCR + spatial elements) [default] + understand: 9 extraction credits (AI layout + table detection) + agentic: 18 extraction credits (VLM-augmented) + +Output shapes: + spatial (default): typed element list at output.elements + markdown: whole-document Markdown at output.markdown +""", + ) + parser.add_argument( + "--input", + required=True, + help="Path to the local input document (PDF, image, or Office file).", + ) + parser.add_argument( + "--out", + required=True, + help="Output file path. Receives the full JSON response for spatial output, " + "or a .md file for markdown output.", + ) + parser.add_argument( + "--mode", + choices=["text", "structure", "understand", "agentic"], + default="structure", + help=( + "Processing mode controlling cost and quality. " + "text=1cr, structure=1.5cr (default), understand=9cr, agentic=18cr — " + "all costs are extraction credits per page." + ), + ) + parser.add_argument( + "--output-format", + dest="output_format", + choices=["spatial", "markdown"], + default="spatial", + help=( + "Shape of the output. " + "spatial: typed elements with bounds (default). " + "markdown: whole-document Markdown string." + ), + ) + args = parser.parse_args() + + # Validate input is a local file (the /extraction/parse endpoint is multipart-only) + input_path = Path(args.input) + if not input_path.exists(): + print(f"Error: input file not found: {args.input}", file=sys.stderr) + sys.exit(1) + + client = create_client() + response = await client.parse( + input_path, + mode=args.mode, + output_format=args.output_format, + ) + + out_path = Path(args.out) + out_path.parent.mkdir(parents=True, exist_ok=True) + + if args.output_format == "markdown": + markdown = response.get("output", {}).get("markdown", "") + out_path.write_text(markdown, encoding="utf-8") + print(f"Wrote {args.out}") + else: + with open(out_path, "w", encoding="utf-8") as f: + json.dump(response, f, indent=2) + print(f"Wrote {args.out}") + + # Print usage summary so callers can see credit cost without opening the output file + usage = response.get("usage", {}) + credits_info = usage.get("dataExtractionCredits", {}) + cost = credits_info.get("cost") + remaining = credits_info.get("remainingCredits") + metrics = response.get("metrics", {}) + pages = metrics.get("pagesProcessed", "?") + if cost is not None: + remaining_str = f", remaining: {remaining}" if remaining is not None else "" + print( + f"Usage: {cost} extraction credits ({pages} page(s) at {args.mode} mode" + f"{remaining_str})" + ) + + +if __name__ == "__main__": + try: + asyncio.run(main()) + except Exception as e: + handle_error(e) From 3a48851bfcdc89aa5a3925dc15928adaedf58c33 Mon Sep 17 00:00:00 2001 From: nickwinder Date: Wed, 27 May 2026 22:09:02 +1200 Subject: [PATCH 2/9] feat(dws): split /extraction/parse into a dedicated document-extraction-api skill MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit DWS Extract is a separate product from DWS Processor — different API key, different credit pool, different billing. Splitting the parse primitive into its own skill removes the conflation and lets agents pick the right product upfront. - New skill: plugins/nutrient-dws/skills/document-extraction-api - parse.py + references/parse-output-filtering.md moved over via git mv - SKILL.md focused on the Data Extraction product, mode/output table, downstream consumption patterns, and the separate NUTRIENT_EXTRACT_API_KEY - Local lib/common.py with create_client() that reads NUTRIENT_EXTRACT_API_KEY (falls back to NUTRIENT_API_KEY for tenants on global keys) and constructs NutrientClient(api_key=..., extract_api_key=...) - Pinned to nutrient-dws>=3.1.0 in the script's PEP 723 metadata - document-processor-api: removed the Data Extraction section, the parse.py entry, and the parse-output-filtering reference map row. Cross-link to the sibling skill in the frontmatter description and "When to use" section. - AGENTS.md: advertise the new skill alongside the existing two. - Fix latent bug in parse.py: was reading usage.dataExtractionCredits (camelCase) but the API returns data_extraction_credits (snake_case), so the credit-usage summary was silently skipped on every call. Confirmed end-to-end via live smoke (6-page PDF, structure/spatial mode, 9 credits, ~46KB JSON, usage summary now prints correctly). --- AGENTS.md | 3 +- .../skills/document-extraction-api/.gitignore | 2 + .../skills/document-extraction-api/SKILL.md | 128 ++++++++++++++++++ .../references/parse-output-filtering.md | 0 .../scripts/lib/common.py | 103 ++++++++++++++ .../scripts/parse.py | 6 +- .../skills/document-processor-api/SKILL.md | 89 ++---------- 7 files changed, 252 insertions(+), 79 deletions(-) create mode 100644 plugins/nutrient-dws/skills/document-extraction-api/.gitignore create mode 100644 plugins/nutrient-dws/skills/document-extraction-api/SKILL.md rename plugins/nutrient-dws/skills/{document-processor-api => document-extraction-api}/references/parse-output-filtering.md (100%) create mode 100644 plugins/nutrient-dws/skills/document-extraction-api/scripts/lib/common.py rename plugins/nutrient-dws/skills/{document-processor-api => document-extraction-api}/scripts/parse.py (95%) diff --git a/AGENTS.md b/AGENTS.md index 923ead4..2c47453 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -6,5 +6,6 @@ Each skill lives under `plugins//skills//SKILL.md`. Rea ## Available Skills -- **nutrient-dws / document-processor-api** — Convert, extract, transform, and secure documents via the Nutrient Document Web Services API (Python scripts via `uv`). +- **nutrient-dws / document-processor-api** — Convert, transform, redact, sign, watermark, OCR, and secure documents via the Nutrient DWS Processor API (Python scripts via `uv`). +- **nutrient-dws / document-extraction-api** — Parse documents into a structural model (typed elements with bounds) or whole-document Markdown via the Nutrient DWS Data Extraction API (`/extraction/parse`). Use for RAG ingestion, layout analysis, and form/invoice extraction. - **pdf-to-markdown / pdf-to-markdown** — Extract text from PDFs as structured, semantic Markdown. Use when converting a PDF to Markdown, extracting text from a PDF, or processing one or more PDFs into Markdown output. diff --git a/plugins/nutrient-dws/skills/document-extraction-api/.gitignore b/plugins/nutrient-dws/skills/document-extraction-api/.gitignore new file mode 100644 index 0000000..7a60b85 --- /dev/null +++ b/plugins/nutrient-dws/skills/document-extraction-api/.gitignore @@ -0,0 +1,2 @@ +__pycache__/ +*.pyc diff --git a/plugins/nutrient-dws/skills/document-extraction-api/SKILL.md b/plugins/nutrient-dws/skills/document-extraction-api/SKILL.md new file mode 100644 index 0000000..817d6b1 --- /dev/null +++ b/plugins/nutrient-dws/skills/document-extraction-api/SKILL.md @@ -0,0 +1,128 @@ +--- +name: document-extraction-api +description: >- + Parse documents into a structural model or whole-document Markdown via the Nutrient Data + Extraction API (`/extraction/parse`). Use when the user wants to extract layout, tables, + key-value pairs, formulas, or images with bounding boxes; build a RAG ingestion pipeline; + produce Markdown for search indexing or content migration; or run layout-aware document + understanding. Triggers include parse this document, extract layout, RAG pipeline, document + understanding, form/invoice extraction, layout analysis, or whole-document Markdown. +license: MIT +metadata: + author: nutrient-sdk + version: "1.0" + homepage: "https://www.nutrient.io/api/" + repository: "https://github.com/PSPDFKit-labs/nutrient-skills" + compatibility: "Requires Python 3.10+, uv, and internet. Works with Claude Code, Codex CLI, Gemini CLI, OpenCode, Cursor, Windsurf, GitHub Copilot, Amp, or any Agent Skills-compatible product." + short-description: "Parse documents into a structural model or Markdown via Nutrient Data Extraction" +--- + +# Nutrient Data Extraction + +Use Nutrient DWS Extract for document-understanding workflows where you need typed +elements (paragraphs, tables, formulas, pictures, key-value regions, handwriting) with +bounding boxes — or a clean Markdown representation of the whole document. + +## When to use + +- Build a RAG ingestion pipeline: PDF -> Markdown -> chunks -> embeddings. +- Index content for search or migrate documents into a new CMS. +- Extract structured fields from forms and invoices (key/value pairs, tables, semantic regions). +- Reconstruct page layout for downstream rendering or comparison. +- Run layout-aware document understanding (semantic paragraph roles, table cell spans, + formulas in LaTeX, picture classification and alt descriptions). + +This skill is **only** for `/extraction/parse`. For PDF generation, conversion, OCR, +redaction, signing, watermarking, or any `/build`-based workflow, use the sibling +`document-processor-api` skill. + +## Setup + +DWS Extract is a separate product from DWS Processor and has its own API key. + +- Get a Nutrient DWS Extract API key at . +- Export it as `NUTRIENT_EXTRACT_API_KEY`: + ```bash + export NUTRIENT_EXTRACT_API_KEY="pdf_live_..." + ``` +- Scripts live in `scripts/` relative to this SKILL.md. Use the directory containing this + SKILL.md as the working directory: + ```bash + cd && uv run scripts/