PSPDFKit-labs · nickwinder · May 29, 2026 · May 26, 2026 · May 26, 2026 · May 26, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,7 +7,31 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
-_Nothing yet._
+### Added
+
+- First-class client support for the Data Extraction API (`POST /extraction/parse`).
+  - `NutrientClient` accepts an `extractApiKey` option (string or async getter)
+    that `parse()` uses in place of `apiKey`. Data Extraction is a separate
+    product with its own credit pool, so the Processor key returns 403 against
+    `/extraction/parse`. When `extractApiKey` is omitted, `parse()` falls back
+    to `apiKey`, which works on tenants with global DWS keys.
+  - `NutrientClient.parse(input, options?)` — full request/response surface with
+    typed support for all four modes (`text`, `structure`, `understand`, `agentic`)
+    and both output formats (`spatial`, `markdown`).
+  - `NutrientClient.parseToMarkdown(input, mode?)` — convenience wrapper returning
+    the whole-document Markdown string directly.
+  - `NutrientClient.parseElements(input, mode?, includeWords?)` — convenience
+    wrapper returning the spatial elements array directly.
+  - Public types: hand-composed `ParseOutputOptions`, `ParseInstructions`,
+    `ParseOptions`, `ParseResponse`, `ParseResponseSpatial`, `ParseResponseMarkdown`,
+    and `ExtractionCredits`. The spec primitives (`Mode`, `Element` and the six
+    subtypes, `Bounds`, `PageRef`, `Word`, `Metrics`, `Usage`, `Configuration`,
+    `ParseErrorResponse`, etc.) are accessible via the `extractComponents`
+    namespace re-export — same pattern as `components` for the Processor spec.
+  - Billing note: `/extraction/parse` debits the account's **extraction
+    credits** bucket, which is separate from the **processor API credits** used
+    by the rest of `NutrientClient`. The response surfaces this explicitly in
+    `usage.data_extraction_credits`.
 
 ## [2.0.0] - 2026-01-27
 

diff --git a/LLM_DOC.md b/LLM_DOC.md
@@ -461,6 +461,64 @@ if (kvps && kvps.length > 0) {
 }
 ```
 
+#### parse(input, options?)
+Extracts structured content from a document via the Data Extraction API (`POST /extraction/parse`).
+
+Billed against **extraction credits** (a separate bucket from processor API credits used by every other method). Mode costs per page:
+- `text` — 1 extraction credit (Markdown only)
+- `structure` — 1.5 extraction credits (spatial elements)
+- `understand` — 9 extraction credits (default)
+- `agentic` — 18 extraction credits
+
+Data Extraction is a separate product with its own API key. Pass it as `extractApiKey` on the client constructor:
+
+```typescript
+const client = new NutrientClient({
+  apiKey: process.env.NUTRIENT_API_KEY!,
+  extractApiKey: process.env.NUTRIENT_EXTRACT_API_KEY!,
+});
+```
+
+Falls back to `apiKey` when `extractApiKey` is omitted (only works on tenants with global DWS keys).
+
+```typescript
+// Full call: spatial elements with bounding boxes, confidence, reading order
+const result = await client.parse('invoice.pdf', {
+  mode: 'understand',
+  output: { format: 'spatial', includeWords: true },
+  language: ['eng', 'spa'],
+});
+
+if (result.output.elements !== undefined) {
+  for (const el of result.output.elements) {
+    if (el.type === 'paragraph') console.log(el.text);
+  }
+}
+
+// Extraction-credit accounting (separate from processor credits):
+console.log(result.usage?.data_extraction_credits?.cost);
+
+// URL input (server fetches the URL):
+const remote = await client.parse('https://example.com/doc.pdf', { mode: 'text' });
+```
+
+#### parseToMarkdown(input, mode?)
+Convenience wrapper that returns just the whole-document Markdown string. Defaults to `mode='text'` (cheapest, 1 extraction credit/page).
+
+```typescript
+const markdown = await client.parseToMarkdown('document.pdf');
+const richer = await client.parseToMarkdown('scan.pdf', 'understand');
+```
+
+#### parseElements(input, mode?, includeWords?)
+Convenience wrapper that returns just the array of spatial elements. Defaults to `mode='structure'`. Cannot use `mode='text'`.
+
+```typescript
+const elements = await client.parseElements('document.pdf');
+const tables = elements.filter(e => e.type === 'table');
+const withWords = await client.parseElements('scan.pdf', 'understand', true);
+```
+
 #### flatten(file, annotationIds?)
 Flattens annotations in a PDF document.
 

diff --git a/README.md b/README.md
@@ -133,6 +133,177 @@ const mergedPdf = await client.merge(['doc1.pdf', 'doc2.pdf', 'doc3.pdf']);
 
 For a complete list of available methods with examples, see the [Methods Documentation](docs/METHODS.md).
 
+## Data Extraction (`/extraction/parse`)
+
+`client.parse()` exposes Nutrient's Data Extraction API. It's designed for
+**content-extraction workflows** where you need to feed document content into a
+downstream pipeline rather than render or transform the document itself:
+
+- **RAG / search indexing / content migration** — pull a clean Markdown
+  representation of a document for chunking, embedding, and indexing in a
+  vector store or search engine.
+- **Form and invoice extraction** — pull structured fields (key/value pairs,
+  tables, semantic regions) out of business documents with bounding boxes and
+  confidence scores attached to every element.
+- **Layout-aware document understanding** — get a typed, page-anchored element
+  list (paragraphs with semantic roles, tables with cell spans, formulas in
+  LaTeX, pictures, handwriting) suitable for building document-comprehension
+  tooling, including agentic workflows.
+
+The endpoint accepts PDFs, Office documents (Word, Excel, PowerPoint), and
+images. Unlike `sign()`, it is not restricted to PDFs.
+
+### Choosing an output format
+
+| Format              | Best for                                                                    | Shape                                                           |
+| ------------------- | --------------------------------------------------------------------------- | --------------------------------------------------------------- |
+| `markdown`          | RAG, search indexing, content migration — anywhere structured text beats spatial data | `response.output.markdown` — a single Markdown string          |
+| `spatial` (default) | Form/invoice extraction, layout reconstruction, flows that need per-element confidence | `response.output.elements` — flat array of typed elements       |
+
+### Setup — separate Extract API key
+
+Data Extraction is a separate product from the DWS Processor with its own
+credit pool and its own API key. Pass both keys when constructing the client:
+
+```typescript
+const client = new NutrientClient({
+  apiKey: process.env.NUTRIENT_API_KEY!,          // Processor key
+  extractApiKey: process.env.NUTRIENT_EXTRACT_API_KEY!, // Data Extraction key
+});
+```
+
+`extractApiKey` is consulted only by `parse()`, `parseToMarkdown()`, and
+`parseElements()`. Every other method on the client (`convert`, `sign`, `ocr`,
+`merge`, …) keeps using `apiKey`. If you omit `extractApiKey`, the parse
+methods fall back to `apiKey` — that fallback only works on tenants whose
+single DWS key authorises both products.
+
+### Quick start
+
+```typescript
+import { NutrientClient } from '@nutrient-sdk/dws-client-typescript';
+
+const client = new NutrientClient({
+  apiKey: process.env.NUTRIENT_API_KEY!,
+  extractApiKey: process.env.NUTRIENT_EXTRACT_API_KEY!,
+});
+
+// Spatial elements (default) — paragraphs, tables, key-value regions, etc.
+const result = await client.parse('contract.pdf', { mode: 'understand' });
+if (result.output.elements !== undefined) {
+  for (const el of result.output.elements) {
+    if (el.type === 'table') console.log(`${el.rowCount}x${el.columnCount} table`);
+  }
+}
+
+// Whole-document Markdown from a born-digital PDF.
+const mdResult = await client.parse('report.pdf', { mode: 'text' });
+if (mdResult.output.markdown !== undefined) {
+  console.log(mdResult.output.markdown);
+}
+```
+
+### Modes — when to use which
+
+| Mode         | Credits / page | When to use                                                                                                               |
+| ------------ | -------------- | ------------------------------------------------------------------------------------------------------------------------- |
+| `text`       | 1              | Born-digital documents only. No OCR, no AI. Fastest and cheapest path to Markdown.                                       |
+| `structure`  | 1.5            | OCR-based segmentation with bounding boxes. Handles scanned documents, images, and any input that requires OCR.          |
+| `understand` | 9              | Full pipeline with AI augmentation on top of OCR. Most accurate for tables, multi-column layouts, formulas, and forms.   |
+| `agentic`    | 18             | Builds on `understand` and adds a vision-language model. Best for image descriptions and complex visual layouts.         |
+
+### Recipes
+
+**RAG ingestion** — PDF → Markdown → chunks → embeddings → vector store:
+
+```typescript
+const result = await client.parse('whitepaper.pdf', { mode: 'text' });
+const markdown = result.output.markdown!;
+// Then: chunk on headings, embed, push to your vector store.
+```
+
+For born-digital PDFs, `mode: 'text'` is the cheapest path (1 credit/page).
+For scanned PDFs or images, switch to `mode: 'structure'` so OCR runs.
+
+Or use the convenience wrapper:
+
+```typescript
+const markdown = await client.parseToMarkdown('whitepaper.pdf');
+```
+
+**Form/invoice extraction** — PDF → spatial elements → structured object:
+
+```typescript
+const result = await client.parse('invoice.pdf', { mode: 'understand' });
+const elements = result.output.elements!;
+
+// Pull key/value pairs from form regions.
+const fields: Record<string, unknown> = {};
+for (const el of elements) {
+  if (el.type === 'keyValueRegion') {
+    for (const pair of el.pairs) {
+      if (pair.key && pair.value) {
+        fields[String(pair.key.value)] = pair.value.value;
+      }
+    }
+  }
+}
+
+// Walk tables — each cell carries row/col indices and span counts.
+for (const el of elements) {
+  if (el.type === 'table') {
+    console.log(`Table: ${el.rowCount}×${el.columnCount}`);
+    for (const cell of el.cells) {
+      console.log(`  [${cell.row}][${cell.column}] ${cell.text}`);
+    }
+  }
+}
+```
+
+For complex documents that mix dense images with text, step up to
+`mode: 'agentic'` so the VLM produces image descriptions and semantic
+classifications (18 credits/page).
+
+Or use the convenience wrapper to skip output-format discrimination entirely:
+
+```typescript
+const elements = await client.parseElements('invoice.pdf', 'understand');
+```
+
+### Billing — extraction credits vs processor credits
+
+`/extraction/parse` is billed against **extraction credits**, a separate
+billing bucket from the **processor API credits** consumed by `convert`,
+`ocr`, `sign`, `merge`, and every other endpoint on this client. The two
+buckets never debit each other.
+
+Extraction-credit accounting is returned per request:
+
+```typescript
+const result = await client.parse('document.pdf', { mode: 'structure' });
+const usage = result.usage?.data_extraction_credits;
+console.log(`Cost: ${usage?.cost} extraction credits`);
+console.log(`Remaining: ${usage?.remainingCredits} extraction credits`);
+```
+
+The hand-composed types (`ExtractionCredits`, `ParseOptions`, `ParseInstructions`,
+`ParseResponse`, `ParseResponseSpatial`, `ParseResponseMarkdown`,
+`ParseOutputOptions`) are exported from the package root. The spec primitives —
+`Mode`, `Element` and the six element subtypes, `Bounds`, `PageRef`, `Word`,
+`TableCell`, `KeyValuePair`, `KeyValueEntity`, `Metrics`, `Usage`,
+`Configuration`, `ParseErrorResponse`, etc. — live under the `extractComponents`
+namespace:
+
+```typescript
+import type { extractComponents } from '@nutrient-sdk/dws-client-typescript';
+
+type ParagraphElement = extractComponents['schemas']['ParagraphElement'];
+type TableElement = extractComponents['schemas']['TableElement'];
+```
+
+This mirrors how the Processor types are exposed via the existing `components`
+namespace.
+
 
 ## Workflow System
 

diff --git a/docs/METHODS.md b/docs/METHODS.md
@@ -455,6 +455,112 @@ if (kvps && kvps.length > 0) {
 }
 ```
 
+##### parse(input, options?)
+Calls the Data Extraction API (`POST /extraction/parse`) to extract structured
+content from a document. Designed for **RAG ingestion**, **search indexing**,
+**content migration**, and **form/invoice extraction** workflows where the goal
+is to feed document content into a downstream pipeline rather than render or
+transform the document itself.
+
+Accepts PDFs, Office documents (Word, Excel, PowerPoint), and images as input.
+
+Billed against **extraction credits** — a separate billing bucket from the
+processor API credits consumed by every other method on this client. See the
+[README's Data Extraction section](../README.md#data-extraction-extractionparse)
+for the full positioning, the per-mode comparison table, and worked recipes.
+
+Requires a Data Extraction API key — pass it as `extractApiKey` on the client
+constructor (see [Setup — separate Extract API key](../README.md#setup--separate-extract-api-key)).
+Falls back to `apiKey` if `extractApiKey` is omitted.
+
+**Parameters**:
+- `input: FileInputWithUrl` — The document to parse. Accepts local files (paths,
+  Buffers, streams), a URL string, or a `{ type: 'url', url: '...' }` object.
+  The endpoint accepts PDFs, Office documents, and images.
+- `options?: ParseOptions` — Optional configuration:
+  - `mode` — `'text'` (1 cr/page, born-digital, Markdown only),
+    `'structure'` (1.5 cr/page, OCR + spatial layout),
+    `'understand'` (9 cr/page, AI-augmented, default),
+    or `'agentic'` (18 cr/page, VLM-augmented).
+  - `output.format` — `'spatial'` (typed elements with bounds
+    and confidence) or `'markdown'` (whole-document Markdown string).
+  - `output.includeWords` — include word-level OCR data inside elements.
+  - `language` — OCR language hint (`'eng'`, `'deu'`, `['eng', 'spa']`, etc.).
+  - `apiVersion` — optional `x-nutrient-api-version` header override.
+
+**Returns**: `ParseResponse` — full response envelope with `output`, `metrics`,
+`configuration`, and `usage.data_extraction_credits` (cost + remaining balance).
+
+```typescript
+// RAG ingestion — born-digital PDF → Markdown (1 extraction credit/page).
+const result = await client.parse('whitepaper.pdf', { mode: 'text' });
+if (result.output.markdown !== undefined) {
+  console.log(result.output.markdown);
+}
+
+// Form extraction — typed spatial elements with bounds and confidence.
+const invoice = await client.parse('invoice.pdf', { mode: 'understand' });
+if (invoice.output.elements !== undefined) {
+  for (const el of invoice.output.elements) {
+    if (el.type === 'keyValueRegion') {
+      for (const pair of el.pairs) {
+        console.log(pair.key?.value, '→', pair.value?.value);
+      }
+    }
+  }
+}
+
+// OCR-backed extraction with word-level data and multilingual hint.
+const scan = await client.parse('scan.pdf', {
+  mode: 'structure',
+  output: { format: 'spatial', includeWords: true },
+  language: ['eng', 'spa'],
+});
+
+// URL input — the server fetches the document, no client-side download.
+const remote = await client.parse('https://example.com/document.pdf');
+
+// Billing — extraction credits, not processor credits.
+const usage = remote.usage?.data_extraction_credits;
+console.log(`Cost: ${usage?.cost} extraction credits`);
+console.log(`Remaining: ${usage?.remainingCredits} extraction credits`);
+```
+
+##### parseToMarkdown(input, mode?)
+Convenience wrapper that calls `parse()` with `output.format = 'markdown'` and
+returns the Markdown string directly. Defaults to `mode='text'` (1 extraction
+credit/page) — the cheapest path for born-digital PDFs. Switch to
+`mode='structure'` for scanned documents or images so OCR runs.
+
+```typescript
+// Born-digital PDF → Markdown (cheapest).
+const markdown = await client.parseToMarkdown('document.pdf');
+
+// Scanned document or image → OCR-backed Markdown.
+const scanned = await client.parseToMarkdown('scan.pdf', 'structure');
+
+// AI-augmented Markdown for complex layouts.
+const rich = await client.parseToMarkdown('report.pdf', 'understand');
+```
+
+##### parseElements(input, mode?, includeWords?)
+Convenience wrapper that calls `parse()` with `output.format = 'spatial'` and
+returns the spatial elements array directly. Defaults to `mode='structure'`
+(1.5 extraction credits/page). Passing `mode='text'` is rejected at compile
+time — `text` mode only produces Markdown, not spatial elements.
+
+```typescript
+// OCR-backed spatial elements.
+const elements = await client.parseElements('document.pdf');
+
+// AI-augmented extraction with word-level OCR data.
+const withWords = await client.parseElements('invoice.pdf', 'understand', true);
+
+// Filter by element type.
+const tables = elements.filter(e => e.type === 'table');
+const kvRegions = elements.filter(e => e.type === 'keyValueRegion');
+```
+
 ##### flatten(file, annotationIds?)
 Flattens annotations in a PDF document.