Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 25 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,31 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

_Nothing yet._
### Added

- First-class client support for the Data Extraction API (`POST /extraction/parse`).
- `NutrientClient` accepts an `extractApiKey` option (string or async getter)
that `parse()` uses in place of `apiKey`. Data Extraction is a separate
product with its own credit pool, so the Processor key returns 403 against
`/extraction/parse`. When `extractApiKey` is omitted, `parse()` falls back
to `apiKey`, which works on tenants with global DWS keys.
- `NutrientClient.parse(input, options?)` — full request/response surface with
typed support for all four modes (`text`, `structure`, `understand`, `agentic`)
and both output formats (`spatial`, `markdown`).
- `NutrientClient.parseToMarkdown(input, mode?)` — convenience wrapper returning
the whole-document Markdown string directly.
- `NutrientClient.parseElements(input, mode?, includeWords?)` — convenience
wrapper returning the spatial elements array directly.
- Public types: hand-composed `ParseOutputOptions`, `ParseInstructions`,
`ParseOptions`, `ParseResponse`, `ParseResponseSpatial`, `ParseResponseMarkdown`,
and `ExtractionCredits`. The spec primitives (`Mode`, `Element` and the six
subtypes, `Bounds`, `PageRef`, `Word`, `Metrics`, `Usage`, `Configuration`,
`ParseErrorResponse`, etc.) are accessible via the `extractComponents`
namespace re-export — same pattern as `components` for the Processor spec.
- Billing note: `/extraction/parse` debits the account's **extraction
credits** bucket, which is separate from the **processor API credits** used
by the rest of `NutrientClient`. The response surfaces this explicitly in
`usage.data_extraction_credits`.

## [2.0.0] - 2026-01-27

Expand Down
58 changes: 58 additions & 0 deletions LLM_DOC.md
Original file line number Diff line number Diff line change
Expand Up @@ -461,6 +461,64 @@ if (kvps && kvps.length > 0) {
}
```

#### parse(input, options?)
Extracts structured content from a document via the Data Extraction API (`POST /extraction/parse`).

Billed against **extraction credits** (a separate bucket from processor API credits used by every other method). Mode costs per page:
- `text` — 1 extraction credit (Markdown only)
- `structure` — 1.5 extraction credits (spatial elements)
- `understand` — 9 extraction credits (default)
- `agentic` — 18 extraction credits

Data Extraction is a separate product with its own API key. Pass it as `extractApiKey` on the client constructor:

```typescript
const client = new NutrientClient({
apiKey: process.env.NUTRIENT_API_KEY!,
extractApiKey: process.env.NUTRIENT_EXTRACT_API_KEY!,
});
```

Falls back to `apiKey` when `extractApiKey` is omitted (only works on tenants with global DWS keys).

```typescript
// Full call: spatial elements with bounding boxes, confidence, reading order
const result = await client.parse('invoice.pdf', {
mode: 'understand',
output: { format: 'spatial', includeWords: true },
language: ['eng', 'spa'],
});

if (result.output.elements !== undefined) {
for (const el of result.output.elements) {
if (el.type === 'paragraph') console.log(el.text);
}
}

// Extraction-credit accounting (separate from processor credits):
console.log(result.usage?.data_extraction_credits?.cost);

// URL input (server fetches the URL):
const remote = await client.parse('https://example.com/doc.pdf', { mode: 'text' });
```

#### parseToMarkdown(input, mode?)
Convenience wrapper that returns just the whole-document Markdown string. Defaults to `mode='text'` (cheapest, 1 extraction credit/page).

```typescript
const markdown = await client.parseToMarkdown('document.pdf');
const richer = await client.parseToMarkdown('scan.pdf', 'understand');
```

#### parseElements(input, mode?, includeWords?)
Convenience wrapper that returns just the array of spatial elements. Defaults to `mode='structure'`. Cannot use `mode='text'`.

```typescript
const elements = await client.parseElements('document.pdf');
const tables = elements.filter(e => e.type === 'table');
const withWords = await client.parseElements('scan.pdf', 'understand', true);
```

#### flatten(file, annotationIds?)
Flattens annotations in a PDF document.

Expand Down
171 changes: 171 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,177 @@ const mergedPdf = await client.merge(['doc1.pdf', 'doc2.pdf', 'doc3.pdf']);

For a complete list of available methods with examples, see the [Methods Documentation](docs/METHODS.md).

## Data Extraction (`/extraction/parse`)

`client.parse()` exposes Nutrient's Data Extraction API. It's designed for
**content-extraction workflows** where you need to feed document content into a
downstream pipeline rather than render or transform the document itself:

- **RAG / search indexing / content migration** — pull a clean Markdown
representation of a document for chunking, embedding, and indexing in a
vector store or search engine.
- **Form and invoice extraction** — pull structured fields (key/value pairs,
tables, semantic regions) out of business documents with bounding boxes and
confidence scores attached to every element.
- **Layout-aware document understanding** — get a typed, page-anchored element
list (paragraphs with semantic roles, tables with cell spans, formulas in
LaTeX, pictures, handwriting) suitable for building document-comprehension
tooling, including agentic workflows.

The endpoint accepts PDFs, Office documents (Word, Excel, PowerPoint), and
images. Unlike `sign()`, it is not restricted to PDFs.

### Choosing an output format

| Format | Best for | Shape |
| ------------------- | --------------------------------------------------------------------------- | --------------------------------------------------------------- |
| `markdown` | RAG, search indexing, content migration — anywhere structured text beats spatial data | `response.output.markdown` — a single Markdown string |
| `spatial` (default) | Form/invoice extraction, layout reconstruction, flows that need per-element confidence | `response.output.elements` — flat array of typed elements |

### Setup — separate Extract API key

Data Extraction is a separate product from the DWS Processor with its own
credit pool and its own API key. Pass both keys when constructing the client:

```typescript
const client = new NutrientClient({
apiKey: process.env.NUTRIENT_API_KEY!, // Processor key
extractApiKey: process.env.NUTRIENT_EXTRACT_API_KEY!, // Data Extraction key
});
```

`extractApiKey` is consulted only by `parse()`, `parseToMarkdown()`, and
`parseElements()`. Every other method on the client (`convert`, `sign`, `ocr`,
`merge`, …) keeps using `apiKey`. If you omit `extractApiKey`, the parse
methods fall back to `apiKey` — that fallback only works on tenants whose
single DWS key authorises both products.

### Quick start

```typescript
import { NutrientClient } from '@nutrient-sdk/dws-client-typescript';

const client = new NutrientClient({
apiKey: process.env.NUTRIENT_API_KEY!,
extractApiKey: process.env.NUTRIENT_EXTRACT_API_KEY!,
});

// Spatial elements (default) — paragraphs, tables, key-value regions, etc.
const result = await client.parse('contract.pdf', { mode: 'understand' });
if (result.output.elements !== undefined) {
for (const el of result.output.elements) {
if (el.type === 'table') console.log(`${el.rowCount}x${el.columnCount} table`);
}
}

// Whole-document Markdown from a born-digital PDF.
const mdResult = await client.parse('report.pdf', { mode: 'text' });
if (mdResult.output.markdown !== undefined) {
console.log(mdResult.output.markdown);
}
```

### Modes — when to use which

| Mode | Credits / page | When to use |
| ------------ | -------------- | ------------------------------------------------------------------------------------------------------------------------- |
| `text` | 1 | Born-digital documents only. No OCR, no AI. Fastest and cheapest path to Markdown. |
| `structure` | 1.5 | OCR-based segmentation with bounding boxes. Handles scanned documents, images, and any input that requires OCR. |
| `understand` | 9 | Full pipeline with AI augmentation on top of OCR. Most accurate for tables, multi-column layouts, formulas, and forms. |
| `agentic` | 18 | Builds on `understand` and adds a vision-language model. Best for image descriptions and complex visual layouts. |

### Recipes

**RAG ingestion** — PDF → Markdown → chunks → embeddings → vector store:

```typescript
const result = await client.parse('whitepaper.pdf', { mode: 'text' });
const markdown = result.output.markdown!;
// Then: chunk on headings, embed, push to your vector store.
```

For born-digital PDFs, `mode: 'text'` is the cheapest path (1 credit/page).
For scanned PDFs or images, switch to `mode: 'structure'` so OCR runs.

Or use the convenience wrapper:

```typescript
const markdown = await client.parseToMarkdown('whitepaper.pdf');
```

**Form/invoice extraction** — PDF → spatial elements → structured object:

```typescript
const result = await client.parse('invoice.pdf', { mode: 'understand' });
const elements = result.output.elements!;

// Pull key/value pairs from form regions.
const fields: Record<string, unknown> = {};
for (const el of elements) {
if (el.type === 'keyValueRegion') {
for (const pair of el.pairs) {
if (pair.key && pair.value) {
fields[String(pair.key.value)] = pair.value.value;
}
}
}
}

// Walk tables — each cell carries row/col indices and span counts.
for (const el of elements) {
if (el.type === 'table') {
console.log(`Table: ${el.rowCount}×${el.columnCount}`);
for (const cell of el.cells) {
console.log(` [${cell.row}][${cell.column}] ${cell.text}`);
}
}
}
```

For complex documents that mix dense images with text, step up to
`mode: 'agentic'` so the VLM produces image descriptions and semantic
classifications (18 credits/page).

Or use the convenience wrapper to skip output-format discrimination entirely:

```typescript
const elements = await client.parseElements('invoice.pdf', 'understand');
```

### Billing — extraction credits vs processor credits

`/extraction/parse` is billed against **extraction credits**, a separate
billing bucket from the **processor API credits** consumed by `convert`,
`ocr`, `sign`, `merge`, and every other endpoint on this client. The two
buckets never debit each other.

Extraction-credit accounting is returned per request:

```typescript
const result = await client.parse('document.pdf', { mode: 'structure' });
const usage = result.usage?.data_extraction_credits;
console.log(`Cost: ${usage?.cost} extraction credits`);
console.log(`Remaining: ${usage?.remainingCredits} extraction credits`);
```

The hand-composed types (`ExtractionCredits`, `ParseOptions`, `ParseInstructions`,
`ParseResponse`, `ParseResponseSpatial`, `ParseResponseMarkdown`,
`ParseOutputOptions`) are exported from the package root. The spec primitives —
`Mode`, `Element` and the six element subtypes, `Bounds`, `PageRef`, `Word`,
`TableCell`, `KeyValuePair`, `KeyValueEntity`, `Metrics`, `Usage`,
`Configuration`, `ParseErrorResponse`, etc. — live under the `extractComponents`
namespace:

```typescript
import type { extractComponents } from '@nutrient-sdk/dws-client-typescript';

type ParagraphElement = extractComponents['schemas']['ParagraphElement'];
type TableElement = extractComponents['schemas']['TableElement'];
```

This mirrors how the Processor types are exposed via the existing `components`
namespace.


## Workflow System

Expand Down
106 changes: 106 additions & 0 deletions docs/METHODS.md
Original file line number Diff line number Diff line change
Expand Up @@ -455,6 +455,112 @@ if (kvps && kvps.length > 0) {
}
```

##### parse(input, options?)
Calls the Data Extraction API (`POST /extraction/parse`) to extract structured
content from a document. Designed for **RAG ingestion**, **search indexing**,
**content migration**, and **form/invoice extraction** workflows where the goal
is to feed document content into a downstream pipeline rather than render or
transform the document itself.

Accepts PDFs, Office documents (Word, Excel, PowerPoint), and images as input.

Billed against **extraction credits** — a separate billing bucket from the
processor API credits consumed by every other method on this client. See the
[README's Data Extraction section](../README.md#data-extraction-extractionparse)
for the full positioning, the per-mode comparison table, and worked recipes.

Requires a Data Extraction API key — pass it as `extractApiKey` on the client
constructor (see [Setup — separate Extract API key](../README.md#setup--separate-extract-api-key)).
Falls back to `apiKey` if `extractApiKey` is omitted.

**Parameters**:
- `input: FileInputWithUrl` — The document to parse. Accepts local files (paths,
Buffers, streams), a URL string, or a `{ type: 'url', url: '...' }` object.
The endpoint accepts PDFs, Office documents, and images.
- `options?: ParseOptions` — Optional configuration:
- `mode` — `'text'` (1 cr/page, born-digital, Markdown only),
`'structure'` (1.5 cr/page, OCR + spatial layout),
`'understand'` (9 cr/page, AI-augmented, default),
or `'agentic'` (18 cr/page, VLM-augmented).
- `output.format` — `'spatial'` (typed elements with bounds
and confidence) or `'markdown'` (whole-document Markdown string).
- `output.includeWords` — include word-level OCR data inside elements.
- `language` — OCR language hint (`'eng'`, `'deu'`, `['eng', 'spa']`, etc.).
- `apiVersion` — optional `x-nutrient-api-version` header override.

**Returns**: `ParseResponse` — full response envelope with `output`, `metrics`,
`configuration`, and `usage.data_extraction_credits` (cost + remaining balance).

```typescript
// RAG ingestion — born-digital PDF → Markdown (1 extraction credit/page).
const result = await client.parse('whitepaper.pdf', { mode: 'text' });
if (result.output.markdown !== undefined) {
console.log(result.output.markdown);
}

// Form extraction — typed spatial elements with bounds and confidence.
const invoice = await client.parse('invoice.pdf', { mode: 'understand' });
if (invoice.output.elements !== undefined) {
for (const el of invoice.output.elements) {
if (el.type === 'keyValueRegion') {
for (const pair of el.pairs) {
console.log(pair.key?.value, '→', pair.value?.value);
}
}
}
}

// OCR-backed extraction with word-level data and multilingual hint.
const scan = await client.parse('scan.pdf', {
mode: 'structure',
output: { format: 'spatial', includeWords: true },
language: ['eng', 'spa'],
});

// URL input — the server fetches the document, no client-side download.
const remote = await client.parse('https://example.com/document.pdf');

// Billing — extraction credits, not processor credits.
const usage = remote.usage?.data_extraction_credits;
console.log(`Cost: ${usage?.cost} extraction credits`);
console.log(`Remaining: ${usage?.remainingCredits} extraction credits`);
```

##### parseToMarkdown(input, mode?)
Convenience wrapper that calls `parse()` with `output.format = 'markdown'` and
returns the Markdown string directly. Defaults to `mode='text'` (1 extraction
credit/page) — the cheapest path for born-digital PDFs. Switch to
`mode='structure'` for scanned documents or images so OCR runs.

```typescript
// Born-digital PDF → Markdown (cheapest).
const markdown = await client.parseToMarkdown('document.pdf');

// Scanned document or image → OCR-backed Markdown.
const scanned = await client.parseToMarkdown('scan.pdf', 'structure');

// AI-augmented Markdown for complex layouts.
const rich = await client.parseToMarkdown('report.pdf', 'understand');
```

##### parseElements(input, mode?, includeWords?)
Convenience wrapper that calls `parse()` with `output.format = 'spatial'` and
returns the spatial elements array directly. Defaults to `mode='structure'`
(1.5 extraction credits/page). Passing `mode='text'` is rejected at compile
time — `text` mode only produces Markdown, not spatial elements.

```typescript
// OCR-backed spatial elements.
const elements = await client.parseElements('document.pdf');

// AI-augmented extraction with word-level OCR data.
const withWords = await client.parseElements('invoice.pdf', 'understand', true);

// Filter by element type.
const tables = elements.filter(e => e.type === 'table');
const kvRegions = elements.filter(e => e.type === 'keyValueRegion');
```

##### flatten(file, annotationIds?)
Flattens annotations in a PDF document.

Expand Down
Loading
Loading