diff --git a/CHANGELOG.md b/CHANGELOG.md index 39e5b9c..600ed04 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,7 +7,31 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] -_Nothing yet._ +### Added + +- First-class client support for the Data Extraction API (`POST /extraction/parse`). + - `NutrientClient` accepts an `extractApiKey` option (string or async getter) + that `parse()` uses in place of `apiKey`. Data Extraction is a separate + product with its own credit pool, so the Processor key returns 403 against + `/extraction/parse`. When `extractApiKey` is omitted, `parse()` falls back + to `apiKey`, which works on tenants with global DWS keys. + - `NutrientClient.parse(input, options?)` — full request/response surface with + typed support for all four modes (`text`, `structure`, `understand`, `agentic`) + and both output formats (`spatial`, `markdown`). + - `NutrientClient.parseToMarkdown(input, mode?)` — convenience wrapper returning + the whole-document Markdown string directly. + - `NutrientClient.parseElements(input, mode?, includeWords?)` — convenience + wrapper returning the spatial elements array directly. + - Public types: hand-composed `ParseOutputOptions`, `ParseInstructions`, + `ParseOptions`, `ParseResponse`, `ParseResponseSpatial`, `ParseResponseMarkdown`, + and `ExtractionCredits`. The spec primitives (`Mode`, `Element` and the six + subtypes, `Bounds`, `PageRef`, `Word`, `Metrics`, `Usage`, `Configuration`, + `ParseErrorResponse`, etc.) are accessible via the `extractComponents` + namespace re-export — same pattern as `components` for the Processor spec. + - Billing note: `/extraction/parse` debits the account's **extraction + credits** bucket, which is separate from the **processor API credits** used + by the rest of `NutrientClient`. The response surfaces this explicitly in + `usage.data_extraction_credits`. ## [2.0.0] - 2026-01-27 diff --git a/LLM_DOC.md b/LLM_DOC.md index 620364d..5369e0b 100644 --- a/LLM_DOC.md +++ b/LLM_DOC.md @@ -461,6 +461,64 @@ if (kvps && kvps.length > 0) { } ``` +#### parse(input, options?) +Extracts structured content from a document via the Data Extraction API (`POST /extraction/parse`). + +Billed against **extraction credits** (a separate bucket from processor API credits used by every other method). Mode costs per page: +- `text` — 1 extraction credit (Markdown only) +- `structure` — 1.5 extraction credits (spatial elements) +- `understand` — 9 extraction credits (default) +- `agentic` — 18 extraction credits + +Data Extraction is a separate product with its own API key. Pass it as `extractApiKey` on the client constructor: + +```typescript +const client = new NutrientClient({ + apiKey: process.env.NUTRIENT_API_KEY!, + extractApiKey: process.env.NUTRIENT_EXTRACT_API_KEY!, +}); +``` + +Falls back to `apiKey` when `extractApiKey` is omitted (only works on tenants with global DWS keys). + +```typescript +// Full call: spatial elements with bounding boxes, confidence, reading order +const result = await client.parse('invoice.pdf', { + mode: 'understand', + output: { format: 'spatial', includeWords: true }, + language: ['eng', 'spa'], +}); + +if (result.output.elements !== undefined) { + for (const el of result.output.elements) { + if (el.type === 'paragraph') console.log(el.text); + } +} + +// Extraction-credit accounting (separate from processor credits): +console.log(result.usage?.data_extraction_credits?.cost); + +// URL input (server fetches the URL): +const remote = await client.parse('https://example.com/doc.pdf', { mode: 'text' }); +``` + +#### parseToMarkdown(input, mode?) +Convenience wrapper that returns just the whole-document Markdown string. Defaults to `mode='text'` (cheapest, 1 extraction credit/page). + +```typescript +const markdown = await client.parseToMarkdown('document.pdf'); +const richer = await client.parseToMarkdown('scan.pdf', 'understand'); +``` + +#### parseElements(input, mode?, includeWords?) +Convenience wrapper that returns just the array of spatial elements. Defaults to `mode='structure'`. Cannot use `mode='text'`. + +```typescript +const elements = await client.parseElements('document.pdf'); +const tables = elements.filter(e => e.type === 'table'); +const withWords = await client.parseElements('scan.pdf', 'understand', true); +``` + #### flatten(file, annotationIds?) Flattens annotations in a PDF document. diff --git a/README.md b/README.md index a760c6d..bcdeea0 100644 --- a/README.md +++ b/README.md @@ -133,6 +133,177 @@ const mergedPdf = await client.merge(['doc1.pdf', 'doc2.pdf', 'doc3.pdf']); For a complete list of available methods with examples, see the [Methods Documentation](docs/METHODS.md). +## Data Extraction (`/extraction/parse`) + +`client.parse()` exposes Nutrient's Data Extraction API. It's designed for +**content-extraction workflows** where you need to feed document content into a +downstream pipeline rather than render or transform the document itself: + +- **RAG / search indexing / content migration** — pull a clean Markdown + representation of a document for chunking, embedding, and indexing in a + vector store or search engine. +- **Form and invoice extraction** — pull structured fields (key/value pairs, + tables, semantic regions) out of business documents with bounding boxes and + confidence scores attached to every element. +- **Layout-aware document understanding** — get a typed, page-anchored element + list (paragraphs with semantic roles, tables with cell spans, formulas in + LaTeX, pictures, handwriting) suitable for building document-comprehension + tooling, including agentic workflows. + +The endpoint accepts PDFs, Office documents (Word, Excel, PowerPoint), and +images. Unlike `sign()`, it is not restricted to PDFs. + +### Choosing an output format + +| Format | Best for | Shape | +| ------------------- | --------------------------------------------------------------------------- | --------------------------------------------------------------- | +| `markdown` | RAG, search indexing, content migration — anywhere structured text beats spatial data | `response.output.markdown` — a single Markdown string | +| `spatial` (default) | Form/invoice extraction, layout reconstruction, flows that need per-element confidence | `response.output.elements` — flat array of typed elements | + +### Setup — separate Extract API key + +Data Extraction is a separate product from the DWS Processor with its own +credit pool and its own API key. Pass both keys when constructing the client: + +```typescript +const client = new NutrientClient({ + apiKey: process.env.NUTRIENT_API_KEY!, // Processor key + extractApiKey: process.env.NUTRIENT_EXTRACT_API_KEY!, // Data Extraction key +}); +``` + +`extractApiKey` is consulted only by `parse()`, `parseToMarkdown()`, and +`parseElements()`. Every other method on the client (`convert`, `sign`, `ocr`, +`merge`, …) keeps using `apiKey`. If you omit `extractApiKey`, the parse +methods fall back to `apiKey` — that fallback only works on tenants whose +single DWS key authorises both products. + +### Quick start + +```typescript +import { NutrientClient } from '@nutrient-sdk/dws-client-typescript'; + +const client = new NutrientClient({ + apiKey: process.env.NUTRIENT_API_KEY!, + extractApiKey: process.env.NUTRIENT_EXTRACT_API_KEY!, +}); + +// Spatial elements (default) — paragraphs, tables, key-value regions, etc. +const result = await client.parse('contract.pdf', { mode: 'understand' }); +if (result.output.elements !== undefined) { + for (const el of result.output.elements) { + if (el.type === 'table') console.log(`${el.rowCount}x${el.columnCount} table`); + } +} + +// Whole-document Markdown from a born-digital PDF. +const mdResult = await client.parse('report.pdf', { mode: 'text' }); +if (mdResult.output.markdown !== undefined) { + console.log(mdResult.output.markdown); +} +``` + +### Modes — when to use which + +| Mode | Credits / page | When to use | +| ------------ | -------------- | ------------------------------------------------------------------------------------------------------------------------- | +| `text` | 1 | Born-digital documents only. No OCR, no AI. Fastest and cheapest path to Markdown. | +| `structure` | 1.5 | OCR-based segmentation with bounding boxes. Handles scanned documents, images, and any input that requires OCR. | +| `understand` | 9 | Full pipeline with AI augmentation on top of OCR. Most accurate for tables, multi-column layouts, formulas, and forms. | +| `agentic` | 18 | Builds on `understand` and adds a vision-language model. Best for image descriptions and complex visual layouts. | + +### Recipes + +**RAG ingestion** — PDF → Markdown → chunks → embeddings → vector store: + +```typescript +const result = await client.parse('whitepaper.pdf', { mode: 'text' }); +const markdown = result.output.markdown!; +// Then: chunk on headings, embed, push to your vector store. +``` + +For born-digital PDFs, `mode: 'text'` is the cheapest path (1 credit/page). +For scanned PDFs or images, switch to `mode: 'structure'` so OCR runs. + +Or use the convenience wrapper: + +```typescript +const markdown = await client.parseToMarkdown('whitepaper.pdf'); +``` + +**Form/invoice extraction** — PDF → spatial elements → structured object: + +```typescript +const result = await client.parse('invoice.pdf', { mode: 'understand' }); +const elements = result.output.elements!; + +// Pull key/value pairs from form regions. +const fields: Record = {}; +for (const el of elements) { + if (el.type === 'keyValueRegion') { + for (const pair of el.pairs) { + if (pair.key && pair.value) { + fields[String(pair.key.value)] = pair.value.value; + } + } + } +} + +// Walk tables — each cell carries row/col indices and span counts. +for (const el of elements) { + if (el.type === 'table') { + console.log(`Table: ${el.rowCount}×${el.columnCount}`); + for (const cell of el.cells) { + console.log(` [${cell.row}][${cell.column}] ${cell.text}`); + } + } +} +``` + +For complex documents that mix dense images with text, step up to +`mode: 'agentic'` so the VLM produces image descriptions and semantic +classifications (18 credits/page). + +Or use the convenience wrapper to skip output-format discrimination entirely: + +```typescript +const elements = await client.parseElements('invoice.pdf', 'understand'); +``` + +### Billing — extraction credits vs processor credits + +`/extraction/parse` is billed against **extraction credits**, a separate +billing bucket from the **processor API credits** consumed by `convert`, +`ocr`, `sign`, `merge`, and every other endpoint on this client. The two +buckets never debit each other. + +Extraction-credit accounting is returned per request: + +```typescript +const result = await client.parse('document.pdf', { mode: 'structure' }); +const usage = result.usage?.data_extraction_credits; +console.log(`Cost: ${usage?.cost} extraction credits`); +console.log(`Remaining: ${usage?.remainingCredits} extraction credits`); +``` + +The hand-composed types (`ExtractionCredits`, `ParseOptions`, `ParseInstructions`, +`ParseResponse`, `ParseResponseSpatial`, `ParseResponseMarkdown`, +`ParseOutputOptions`) are exported from the package root. The spec primitives — +`Mode`, `Element` and the six element subtypes, `Bounds`, `PageRef`, `Word`, +`TableCell`, `KeyValuePair`, `KeyValueEntity`, `Metrics`, `Usage`, +`Configuration`, `ParseErrorResponse`, etc. — live under the `extractComponents` +namespace: + +```typescript +import type { extractComponents } from '@nutrient-sdk/dws-client-typescript'; + +type ParagraphElement = extractComponents['schemas']['ParagraphElement']; +type TableElement = extractComponents['schemas']['TableElement']; +``` + +This mirrors how the Processor types are exposed via the existing `components` +namespace. + ## Workflow System diff --git a/docs/METHODS.md b/docs/METHODS.md index 7fbd578..8017b6a 100644 --- a/docs/METHODS.md +++ b/docs/METHODS.md @@ -455,6 +455,112 @@ if (kvps && kvps.length > 0) { } ``` +##### parse(input, options?) +Calls the Data Extraction API (`POST /extraction/parse`) to extract structured +content from a document. Designed for **RAG ingestion**, **search indexing**, +**content migration**, and **form/invoice extraction** workflows where the goal +is to feed document content into a downstream pipeline rather than render or +transform the document itself. + +Accepts PDFs, Office documents (Word, Excel, PowerPoint), and images as input. + +Billed against **extraction credits** — a separate billing bucket from the +processor API credits consumed by every other method on this client. See the +[README's Data Extraction section](../README.md#data-extraction-extractionparse) +for the full positioning, the per-mode comparison table, and worked recipes. + +Requires a Data Extraction API key — pass it as `extractApiKey` on the client +constructor (see [Setup — separate Extract API key](../README.md#setup--separate-extract-api-key)). +Falls back to `apiKey` if `extractApiKey` is omitted. + +**Parameters**: +- `input: FileInputWithUrl` — The document to parse. Accepts local files (paths, + Buffers, streams), a URL string, or a `{ type: 'url', url: '...' }` object. + The endpoint accepts PDFs, Office documents, and images. +- `options?: ParseOptions` — Optional configuration: + - `mode` — `'text'` (1 cr/page, born-digital, Markdown only), + `'structure'` (1.5 cr/page, OCR + spatial layout), + `'understand'` (9 cr/page, AI-augmented, default), + or `'agentic'` (18 cr/page, VLM-augmented). + - `output.format` — `'spatial'` (typed elements with bounds + and confidence) or `'markdown'` (whole-document Markdown string). + - `output.includeWords` — include word-level OCR data inside elements. + - `language` — OCR language hint (`'eng'`, `'deu'`, `['eng', 'spa']`, etc.). + - `apiVersion` — optional `x-nutrient-api-version` header override. + +**Returns**: `ParseResponse` — full response envelope with `output`, `metrics`, +`configuration`, and `usage.data_extraction_credits` (cost + remaining balance). + +```typescript +// RAG ingestion — born-digital PDF → Markdown (1 extraction credit/page). +const result = await client.parse('whitepaper.pdf', { mode: 'text' }); +if (result.output.markdown !== undefined) { + console.log(result.output.markdown); +} + +// Form extraction — typed spatial elements with bounds and confidence. +const invoice = await client.parse('invoice.pdf', { mode: 'understand' }); +if (invoice.output.elements !== undefined) { + for (const el of invoice.output.elements) { + if (el.type === 'keyValueRegion') { + for (const pair of el.pairs) { + console.log(pair.key?.value, '→', pair.value?.value); + } + } + } +} + +// OCR-backed extraction with word-level data and multilingual hint. +const scan = await client.parse('scan.pdf', { + mode: 'structure', + output: { format: 'spatial', includeWords: true }, + language: ['eng', 'spa'], +}); + +// URL input — the server fetches the document, no client-side download. +const remote = await client.parse('https://example.com/document.pdf'); + +// Billing — extraction credits, not processor credits. +const usage = remote.usage?.data_extraction_credits; +console.log(`Cost: ${usage?.cost} extraction credits`); +console.log(`Remaining: ${usage?.remainingCredits} extraction credits`); +``` + +##### parseToMarkdown(input, mode?) +Convenience wrapper that calls `parse()` with `output.format = 'markdown'` and +returns the Markdown string directly. Defaults to `mode='text'` (1 extraction +credit/page) — the cheapest path for born-digital PDFs. Switch to +`mode='structure'` for scanned documents or images so OCR runs. + +```typescript +// Born-digital PDF → Markdown (cheapest). +const markdown = await client.parseToMarkdown('document.pdf'); + +// Scanned document or image → OCR-backed Markdown. +const scanned = await client.parseToMarkdown('scan.pdf', 'structure'); + +// AI-augmented Markdown for complex layouts. +const rich = await client.parseToMarkdown('report.pdf', 'understand'); +``` + +##### parseElements(input, mode?, includeWords?) +Convenience wrapper that calls `parse()` with `output.format = 'spatial'` and +returns the spatial elements array directly. Defaults to `mode='structure'` +(1.5 extraction credits/page). Passing `mode='text'` is rejected at compile +time — `text` mode only produces Markdown, not spatial elements. + +```typescript +// OCR-backed spatial elements. +const elements = await client.parseElements('document.pdf'); + +// AI-augmented extraction with word-level OCR data. +const withWords = await client.parseElements('invoice.pdf', 'understand', true); + +// Filter by element type. +const tables = elements.filter(e => e.type === 'table'); +const kvRegions = elements.filter(e => e.type === 'keyValueRegion'); +``` + ##### flatten(file, annotationIds?) Flattens annotations in a PDF document. diff --git a/dws-data-extraction-spec.yml b/dws-data-extraction-spec.yml new file mode 100644 index 0000000..c67aa12 --- /dev/null +++ b/dws-data-extraction-spec.yml @@ -0,0 +1,1132 @@ +openapi: 3.1.0 +info: + version: '2026-05-25' + title: Nutrient Data Extraction API + description: |- + Nutrient Data Extraction API is an HTTP API that extracts structured content from documents. + Upload a PDF or image and receive extracted text elements, tables, formulas, key-value pairs, + and other structural elements with spatial data — or get a whole-document markdown representation. + + Four processing modes are available: choose `text` for plain text, `structure` for OCR-backed structure, + `understand` for deeper document analysis, or `agentic` for complex documents that need visual reasoning + and self-correction. + + # API Versioning + + Nutrient Data Extraction API is versioned using date-based versions in the form `YYYY-MM-DD` (for example, `2026-05-25`). + + Requests can override the API key's default version by sending the `x-nutrient-api-version` header. If the header is omitted, the request uses the latest version that was available when the API key was created. + + Supported API versions: + + | Version | Status | Notes | + | ------------ | ------- | ------------------------------------------------------------------ | + | `2026-05-25` | Current | Initial Data Extraction versioned contract. No older versions yet. | + contact: + name: Nutrient Data Extraction API + url: https://www.nutrient.io/api/ + license: + name: End User License Agreement + url: https://www.nutrient.io/api/terms/ +servers: + - url: https://api.nutrient.io + description: Base URL for Nutrient Data Extraction API endpoints. +security: + - BearerToken: [] +tags: + - name: Authorization + description: |- + Nutrient Data Extraction API uses an HTTP authorization header to map each request to the user making the request. You're required to provide your API token in the authorization header with each request. Otherwise, the API will return an error. + + The authorization header has the following shape: + + ``` + Authorization: Bearer pdf_live + ``` + + `pdf_live` is an API key that can be retrieved by logging in to the [Data Extraction API dashboard](https://www.nutrient.io/data-extraction/api_keys/). + + Because this API allows full access to credits you purchased for the Data Extraction API, it's only meant to be used by your backend services, which we assume are fully trusted. + - name: Data Extraction + description: Extract structured content from documents. + - name: File Type Support + description: |- + DWS Data Extraction API supports importing documents in different file formats: + + * PDFs + * Image documents + * Office files (Word, Excel, PowerPoint etc.) + + The following table shows the allowed file extensions and their MIME types: + + | File Extension | MIME Type | + | -------------- | ------------------------------------------------------------------------- | + | PDF | application/pdf | + | DOC | application/msword | + | DOCX | application/vnd.openxmlformats-officedocument.wordprocessingml.document | + | DOCM | application/vnd.ms-word.document.macroEnabled.12 | + | DOTX | application/vnd.openxmlformats-officedocument.wordprocessingml.template | + | DOTM | application/vnd.ms-word.template.macroEnabled.12 | + | XLS | application/vnd.ms-excel | + | XLSX | application/vnd.openxmlformats-officedocument.spreadsheetml.sheet | + | XLSM | application/vnd.ms-excel.sheet.macroEnabled.12 | + | XLSB | application/vnd.ms-excel.sheet.binary.macroEnabled.12 | + | XLTX | application/vnd.openxmlformats-officedocument.spreadsheetml.template | + | XLTM | application/vnd.ms-excel.template.macroEnabled.12 | + | XLAM | application/vnd.ms-excel.addin.macroEnabled.12 | + | PPT, PPS | application/vnd.ms-powerpoint | + | PPTX | application/vnd.openxmlformats-officedocument.presentationml.presentation | + | PPTM | application/vnd.ms-powerpoint.presentation.macroEnabled.12 | + | PPSX | application/vnd.openxmlformats-officedocument.presentationml.slideshow | + | PPSM | application/vnd.ms-powerpoint.slideshow.macroEnabled.12 | + | POTX | application/vnd.openxmlformats-officedocument.presentationml.template | + | POTM | application/vnd.ms-powerpoint.template.macroEnabled.12 | + | PPAM | application/vnd.ms-powerpoint.addin.macroEnabled.12 | + | RTF | application/rtf | + | ODT | application/vnd.oasis.opendocument.text | + | TXT | text/plain | + | BMP | image/bmp | + | JPG/JPEG | image/jpeg | + | PNG | image/png | + | TIFF | image/tiff | + | HEIC | image/heic | + | GIF | image/gif | + | WEBP | image/webp | + | SVG | image/svg+xml | + | TGA | image/x-tga | + | EPS | image/postscript | + - name: OCR Language Support + description: |- + The OCR action supports a wide range of languages for text extraction. You can specify languages using either: + - **Full language name** (lowercase, e.g., `english`, `german`) - available for commonly used languages + - **ISO 639-2 language code** (e.g., `eng`, `deu`) - available for all languages + - **ISO 639-2 language code with variant** (e.g., `chi_sim_vert` or `deu_frak`) + + | Description | Language code | Full language name | + | -------------------------------- | -------------- | ------------------ | + | Afrikaans | `afr` | | + | Albanian | `sqi` | | + | Amharic | `amh` | | + | Arabic | `ara` | | + | Armenian | `hye` | | + | Assamese | `asm` | | + | Azerbaijani | `aze` | | + | Azerbaijani - Cyrillic | `aze_cyrl` | | + | Basque | `eus` | | + | Belarusian | `bel` | | + | Bengali | `ben` | | + | Bosnian | `bos` | | + | Breton | `bre` | | + | Bulgarian | `bul` | | + | Burmese | `mya` | | + | Catalan; Valencian | `cat` | | + | Cebuano | `ceb` | | + | Central Khmer | `khm` | | + | Cherokee | `chr` | | + | Chinese - Simplified | `chi_sim` | | + | Chinese - Simplified (Vertical) | `chi_sim_vert` | | + | Chinese - Traditional | `chi_tra` | | + | Chinese - Traditional (Vertical) | `chi_tra_vert` | | + | Corsican | `cos` | | + | Croatian | `hrv` | `croatian` | + | Czech | `ces` | `czech` | + | Danish | `dan` | `danish` | + | Danish - Fraktur | `dan_frak` | | + | Dhivehi; Maldivian | `div` | | + | Dutch; Flemish | `nld` | `dutch` | + | Dzongkha | `dzo` | | + | English | `eng` | `english` | + | English, Middle (1100-1500) | `enm` | | + | Esperanto | `epo` | | + | Estonian | `est` | | + | Faroese | `fao` | | + | Filipino | `fil` | | + | Finnish | `fin` | `finnish` | + | French | `fra` | `french` | + | French, Middle (ca. 1400-1600) | `frm` | | + | Galician | `glg` | | + | Georgian | `kat` | | + | Georgian - Old | `kat_old` | | + | German | `deu` | `german` | + | German - Fraktur | `deu_frak` | | + | German Fraktur | `frk` | | + | Greek, Ancient | `grc` | | + | Greek, Modern | `ell` | | + | Gujarati | `guj` | | + | Haitian; Haitian Creole | `hat` | | + | Hebrew | `heb` | | + | Hindi | `hin` | | + | Hungarian | `hun` | | + | Icelandic | `isl` | | + | Indonesian | `ind` | `indonesian` | + | Inuktitut | `iku` | | + | Irish | `gle` | | + | Italian | `ita` | `italian` | + | Italian - Old | `ita_old` | | + | Japanese | `jpn` | | + | Japanese (Vertical) | `jpn_vert` | | + | Javanese | `jav` | | + | Kannada | `kan` | | + | Kazakh | `kaz` | | + | Kirghiz; Kyrgyz | `kir` | | + | Korean | `kor` | | + | Korean (Vertical) | `kor_vert` | | + | Kurdish | `kur` | | + | Kurmanji (Kurdish) | `kmr` | | + | Lao | `lao` | | + | Latin | `lat` | | + | Latvian | `lav` | | + | Lithuanian | `lit` | | + | Luxembourgish | `ltz` | | + | Macedonian | `mkd` | | + | Malay | `msa` | `malay` | + | Malayalam | `mal` | | + | Maltese | `mlt` | | + | Maori | `mri` | | + | Marathi | `mar` | | + | Math/Equation detection | `equ` | | + | Mongolian | `mon` | | + | Nepali | `nep` | | + | Norwegian | `nor` | `norwegian` | + | Occitan | `oci` | | + | Oriya | `ori` | | + | Panjabi; Punjabi | `pan` | | + | Persian | `fas` | | + | Polish | `pol` | `polish` | + | Portuguese | `por` | `portuguese` | + | Pushto; Pashto | `pus` | | + | Quechua | `que` | | + | Romanian; Moldavian | `ron` | | + | Russian | `rus` | | + | Sanskrit | `san` | | + | Scottish Gaelic | `gla` | | + | Serbian | `srp` | `serbian` | + | Serbian - Latin | `srp_latn` | | + | Sindhi | `snd` | | + | Sinhala; Sinhalese | `sin` | | + | Slovak | `slk` | `slovak` | + | Slovak - Fraktur | `slk_frak` | | + | Slovenian | `slv` | `slovenian` | + | Spanish; Castilian | `spa` | `spanish` | + | Spanish - Old | `spa_old` | | + | Sundanese | `sun` | | + | Swahili | `swa` | | + | Swedish | `swe` | `swedish` | + | Syriac | `syr` | | + | Tagalog | `tgl` | | + | Tajik | `tgk` | | + | Tamil | `tam` | | + | Tatar | `tat` | | + | Telugu | `tel` | | + | Thai | `tha` | | + | Tibetan | `bod` | | + | Tigrinya | `tir` | | + | Tonga | `ton` | | + | Turkish | `tur` | `turkish` | + | Uighur; Uyghur | `uig` | | + | Ukrainian | `ukr` | | + | Urdu | `urd` | | + | Uzbek | `uzb` | | + | Uzbek - Cyrillic | `uzb_cyrl` | | + | Vietnamese | `vie` | | + | Welsh | `cym` | | + | Western Frisian | `fry` | | + | Yiddish | `yid` | | + | Yoruba | `yor` | | +externalDocs: + description: Nutrient Data Extraction API guides + url: https://www.nutrient.io/guides/data-extraction/ +paths: + /extraction/parse: + post: + summary: Extract data from a document + operationId: extraction-parse + parameters: + - $ref: '#/components/parameters/NutrientApiVersion' + description: |- + Extract structured content from a document. Returns either typed document elements + with spatial data or a whole-document markdown representation. + + Four processing modes are available: + - **text** — Plain text extraction powered by Document Engine. Only supports `markdown` output. + - **structure** — OCR-backed structured extraction with spatial element output. + - **understand** — Deeper document analysis with structured extraction and semantic enrichment. + - **agentic** — AI-powered analysis for complex documents that need visual reasoning and self-correction. + + You can provide the input document in three ways: + - **Multipart form upload** via `multipart/form-data` with a `file` field and optional JSON `instructions`. + - **URL-based input** via `application/json` with a `url` field pointing to a remote document. + - **Raw binary upload** via `application/pdf` (or other supported content types). + requestBody: + content: + multipart/form-data: + encoding: + file: + contentType: application/pdf, application/msword, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/vnd.ms-excel, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, application/vnd.ms-powerpoint, application/vnd.openxmlformats-officedocument.presentationml.presentation, application/rtf, image/png, image/jpeg, image/tiff, image/bmp, image/gif, image/webp + instructions: + contentType: application/json + schema: + type: object + properties: + file: + type: string + format: binary + description: The document to parse. + example: + instructions: + description: |- + JSON-serialized processing instructions. Omit to use all defaults + (`mode: "understand"` with spatial element output). + type: object + properties: + url: + type: string + format: uri + description: |- + URL of a remote document to parse. Use this instead of the `file` field + to process a document hosted at a public URL. + example: https://storage.example.com/invoice.pdf + mode: + $ref: '#/components/schemas/Mode' + output: + $ref: '#/components/schemas/OutputOptions' + options: + $ref: '#/components/schemas/ProcessingOptions' + application/json: + schema: + type: object + required: + - url + properties: + url: + type: string + format: uri + description: URL of a remote document to parse. + example: https://storage.example.com/invoice.pdf + mode: + $ref: '#/components/schemas/Mode' + output: + $ref: '#/components/schemas/OutputOptions' + options: + $ref: '#/components/schemas/ProcessingOptions' + example: + url: https://storage.example.com/invoice.pdf + mode: understand + output: + format: spatial + application/pdf: + schema: + type: string + format: binary + description: Raw PDF document for direct upload. + image/png: + schema: + type: string + format: binary + description: Raw PNG image for direct upload. + image/jpeg: + schema: + type: string + format: binary + description: Raw JPEG image for direct upload. + image/tiff: + schema: + type: string + format: binary + description: Raw TIFF image for direct upload. + responses: + '200': + description: Extraction completed successfully. + content: + application/json: + schema: + $ref: '#/components/schemas/ParseResponse' + examples: + spatialElements: + summary: Spatial elements output (structure mode) + value: + status: 200 + requestId: req_e5f6g7h8 + output: + elements: + - id: a1b2c3d4-1111-4000-8000-000000000001 + type: paragraph + role: Title + text: Quarterly Report + confidence: 0.95 + readingOrder: 0 + bounds: + x: 200 + 'y': 139 + width: 1111 + height: 97 + page: + pageIndex: 0 + pageNumber: 1 + width: 1700 + height: 2200 + - id: a1b2c3d4-2222-4000-8000-000000000002 + type: paragraph + role: Text + text: Revenue grew 15% year-over-year. + confidence: 0.97 + readingOrder: 1 + bounds: + x: 200 + 'y': 278 + width: 1300 + height: 69 + page: + pageIndex: 0 + pageNumber: 1 + width: 1700 + height: 2200 + metrics: + processingTimeMs: 4200 + pagesProcessed: 1 + usage: + data_extraction_credits: + cost: 1.5 + remainingCredits: 850 + configuration: + mode: structure + outputFormat: spatial + markdownOutput: + summary: Markdown output (text mode) + value: + status: 200 + requestId: req_a1b2c3d4 + output: + markdown: |- + # Document Title + + First paragraph of text... + + ## Section Two + + More content here... + metrics: + processingTimeMs: 312 + pagesProcessed: 1 + usage: + data_extraction_credits: + cost: 1 + remainingCredits: 850 + configuration: + mode: text + outputFormat: markdown + '400': + description: The request is malformed. Invalid parameters, unsupported file format, or missing required fields. + content: + application/json: + schema: + $ref: '#/components/schemas/ParseErrorResponse' + example: + status: 400 + requestId: req_err_001 + errorMessage: The request is malformed + errorDetails: + source: request + code: invalid_request + failingPaths: + - path: $.mode + details: 'invalid mode: ''vlm''. Expected: text, structure, understand, agentic' + '401': + description: You are unauthorized. Sent when no API token is specified, or when the API token you specified isn't valid. + '402': + description: Insufficient credits for this request. + content: + application/json: + schema: + $ref: '#/components/schemas/ParseErrorResponse' + example: + status: 402 + requestId: req_err_002 + errorMessage: Insufficient credits. This request requires 2 credits, 0 remaining. + '413': + description: The uploaded file exceeds the maximum allowed size for your plan. + '429': + description: Too many requests. You have exceeded the rate limit for your subscription. + '500': + description: An internal processing error occurred. Please retry or contact support with the `requestId`. + content: + application/json: + schema: + $ref: '#/components/schemas/ParseErrorResponse' + example: + status: 500 + requestId: req_err_003 + errorMessage: Processing failed. Please retry or contact support with the requestId. + errorDetails: + source: maestro + code: maestro_error + '503': + description: The processing backend is temporarily unavailable. Please retry later. + tags: + - Data Extraction +components: + parameters: + NutrientApiVersion: + name: x-nutrient-api-version + in: header + required: false + description: |- + Optional API version override for this request. + + If omitted, the request uses the latest version that was available when the API key was created. + + See the [API Versioning](#description/api-versioning) section for the list of supported versions. + schema: + type: string + enum: + - '2026-05-25' + example: '2026-05-25' + securitySchemes: + BearerToken: + type: http + scheme: bearer + schemas: + Mode: + type: string + description: |- + Processing pipeline. + - `text` — Plain text extraction powered by Document Engine. Only supports `markdown` output format. + - `structure` — OCR-backed structured extraction with spatial element output. + - `understand` — Deeper document analysis with structured extraction and semantic enrichment. + - `agentic` — AI-powered analysis for complex documents that need visual reasoning and self-correction. + enum: + - text + - structure + - understand + - agentic + default: understand + example: understand + OutputOptions: + type: object + description: |- + Output configuration. When provided, `format` is required. + Default format depends on the mode: `text` defaults to `markdown`; + `structure`, `understand`, and `agentic` default to `spatial`. + required: + - format + properties: + format: + type: string + description: |- + The output format. + - `spatial` — Flat typed elements with bounding boxes, confidence scores, reading order, and page references. + Not available with `text` mode. + - `markdown` — Whole-document markdown representation. + enum: + - spatial + - markdown + example: spatial + includeWords: + type: boolean + description: |- + Include word-level OCR data nested inside paragraph and table cell elements. + Only applicable when `format` is `spatial`. + default: false + example: false + ProcessingOptions: + type: object + description: Additional processing options. + properties: + language: + description: |- + OCR language hint. Only supported for `structure`, `understand`, and `agentic` modes. + Accepts lowercase language names (`"english"`, `"german"`) + or ISO 639-2 language codes (`"eng"`, `"deu"`). Multilingual OCR can be expressed + as an array (`["eng", "spa"]`) or a `+`-joined string (`"eng+spa"`). + default: eng + oneOf: + - type: string + example: english + - type: array + items: + type: string + example: + - eng + - spa + ParseResponse: + type: object + required: + - status + - requestId + - output + - metrics + - configuration + properties: + status: + type: integer + description: HTTP status code. + enum: + - 200 + example: 200 + requestId: + type: string + description: Unique request identifier for debugging and support. + example: req_e5f6g7h8 + output: + $ref: '#/components/schemas/ParseOutput' + metrics: + $ref: '#/components/schemas/Metrics' + usage: + $ref: '#/components/schemas/Usage' + configuration: + $ref: '#/components/schemas/Configuration' + ParseOutput: + type: object + description: |- + Extracted content. Contains either `elements` (for spatial format) or `markdown` + (for markdown format), never both. + properties: + elements: + type: array + description: |- + Flat list of document elements across all pages, ordered by reading order. + Present when `output.format` is `spatial`. + items: + $ref: '#/components/schemas/Element' + markdown: + type: string + description: |- + Whole-document markdown content. + Present when `output.format` is `markdown`. + example: |- + # Document Title + + First paragraph of text... + Metrics: + type: object + required: + - processingTimeMs + - pagesProcessed + properties: + processingTimeMs: + type: number + description: Total processing time in milliseconds. + example: 4200 + pagesProcessed: + type: integer + description: Number of pages processed. + example: 1 + Usage: + type: object + properties: + data_extraction_credits: + type: object + required: + - cost + - remainingCredits + properties: + cost: + type: number + description: Credits consumed by this request. + example: 2 + remainingCredits: + type: number + description: Remaining credits in the account. + example: 850 + Configuration: + type: object + required: + - mode + - outputFormat + properties: + mode: + $ref: '#/components/schemas/Mode' + outputFormat: + type: string + description: The output format that was used for this request. + enum: + - spatial + - markdown + example: spatial + ParseErrorResponse: + type: object + required: + - status + - requestId + - errorMessage + properties: + status: + type: integer + description: HTTP status code. + example: 400 + requestId: + type: string + description: Unique request identifier for debugging and support. + example: req_err_001 + errorMessage: + type: string + description: Human-readable error summary. + example: The request is malformed + errorDetails: + type: object + description: Structured error details. Present on validation and processing errors. + properties: + source: + type: string + description: |- + Error origin. + - `request` — Validation errors (invalid parameters, unsupported format). + - `processing` — Backend processing failures. + - `maestro` — Maestro engine failures. + example: request + code: + type: string + description: Machine-readable error code stable enough for client branching. + example: invalid_request + failingPaths: + type: array + description: List of invalid fields. Present on validation errors. + items: + type: object + properties: + path: + type: string + description: JSON path to the invalid field. + example: $.mode + details: + type: string + description: Human-readable validation error. + example: 'invalid mode: ''vlm''. Expected: text, structure, understand, agentic' + Element: + oneOf: + - $ref: '#/components/schemas/ParagraphElement' + - $ref: '#/components/schemas/FormulaElement' + - $ref: '#/components/schemas/PictureElement' + - $ref: '#/components/schemas/TableElement' + - $ref: '#/components/schemas/KeyValueRegionElement' + - $ref: '#/components/schemas/HandwritingElement' + discriminator: + propertyName: type + mapping: + paragraph: '#/components/schemas/ParagraphElement' + formula: '#/components/schemas/FormulaElement' + picture: '#/components/schemas/PictureElement' + table: '#/components/schemas/TableElement' + keyValueRegion: '#/components/schemas/KeyValueRegionElement' + handwriting: '#/components/schemas/HandwritingElement' + ElementBase: + type: object + required: + - id + - type + - bounds + - confidence + - readingOrder + - page + properties: + id: + type: string + format: uuid + description: Unique element identifier. + example: a1b2c3d4-1111-4000-8000-000000000001 + bounds: + $ref: '#/components/schemas/Bounds' + confidence: + type: number + minimum: 0 + maximum: 1 + description: Detection confidence score. + example: 0.95 + readingOrder: + type: integer + minimum: 0 + description: Reading order index within the page. + example: 0 + page: + $ref: '#/components/schemas/PageRef' + ParagraphElement: + allOf: + - $ref: '#/components/schemas/ElementBase' + - type: object + required: + - type + - text + properties: + type: + type: string + enum: + - paragraph + role: + description: Semantic role of the paragraph. Null when the role is undetermined. + oneOf: + - type: string + enum: + - Text + - Title + - SectionHeader + - Header + - Footer + - Caption + - Footnote + - ListItem + - PageNumber + - Code + - CheckboxSelected + - CheckboxUnselected + - type: 'null' + example: Text + text: + type: string + description: Extracted text content. + example: Revenue grew 15% year-over-year. + words: + description: Word-level OCR data. Present when `includeWords` is `true`. + oneOf: + - type: array + items: + $ref: '#/components/schemas/Word' + - type: 'null' + FormulaElement: + allOf: + - $ref: '#/components/schemas/ElementBase' + - type: object + required: + - type + - latex + properties: + type: + type: string + enum: + - formula + latex: + type: string + description: LaTeX representation of the formula. + example: r = r_0 e^{kt} + PictureElement: + allOf: + - $ref: '#/components/schemas/ElementBase' + - type: object + required: + - type + - classification + - classificationConfidence + - altDescription + properties: + type: + type: string + enum: + - picture + classification: + type: string + description: Image classification category (chart, photo, diagram, etc.). + example: chart + classificationConfidence: + type: number + minimum: 0 + maximum: 1 + description: Confidence score for the classification. + example: 0.91 + altDescription: + type: string + description: AI-generated alternative text description. + example: Bar chart showing quarterly revenue growth across regions + captionIds: + description: IDs of associated caption paragraph elements. + oneOf: + - type: array + items: + type: string + format: uuid + - type: 'null' + footnoteIds: + description: IDs of associated footnote paragraph elements. + oneOf: + - type: array + items: + type: string + format: uuid + - type: 'null' + TableElement: + allOf: + - $ref: '#/components/schemas/ElementBase' + - type: object + required: + - type + - rowCount + - columnCount + - cells + properties: + type: + type: string + enum: + - table + rowCount: + type: integer + minimum: 0 + description: Number of rows in the table. + example: 3 + columnCount: + type: integer + minimum: 0 + description: Number of columns in the table. + example: 3 + cells: + type: array + description: Cell-level data. + items: + $ref: '#/components/schemas/TableCell' + captionIds: + description: IDs of associated caption paragraph elements. + oneOf: + - type: array + items: + type: string + format: uuid + - type: 'null' + footnoteIds: + description: IDs of associated footnote paragraph elements. + oneOf: + - type: array + items: + type: string + format: uuid + - type: 'null' + KeyValueRegionElement: + allOf: + - $ref: '#/components/schemas/ElementBase' + - type: object + required: + - type + - pairs + properties: + type: + type: string + enum: + - keyValueRegion + pairs: + type: array + description: Detected key-value pairs. + items: + $ref: '#/components/schemas/KeyValuePair' + HandwritingElement: + allOf: + - $ref: '#/components/schemas/ElementBase' + - type: object + required: + - type + - text + properties: + type: + type: string + enum: + - handwriting + text: + type: string + description: Extracted handwritten text content. + example: John Doe + words: + description: Word-level OCR data. Present when `includeWords` is `true`. + oneOf: + - type: array + items: + $ref: '#/components/schemas/Word' + - type: 'null' + Bounds: + type: object + required: + - x + - 'y' + - width + - height + description: |- + Bounding box of an element on the page. `(x, y)` is the top-left corner of the box. + Origin is the top-left of the page, with x increasing right and y increasing down. + + Coordinates are always expressed in render-space pixels. + `page.width` and `page.height` describe the same pixel canvas as every element, + word, and table cell bound on that page. + properties: + x: + type: number + description: X coordinate of the top-left corner (distance from page left edge). + example: 100 + 'y': + type: number + description: Y coordinate of the top-left corner (distance from page top edge). + example: 50 + width: + type: number + description: Width of the bounding box. + example: 400 + height: + type: number + description: Height of the bounding box. + example: 35 + Word: + type: object + required: + - text + - bounds + - confidence + description: Word-level OCR result. + properties: + text: + type: string + description: The word text. + example: Revenue + bounds: + $ref: '#/components/schemas/Bounds' + confidence: + type: number + minimum: 0 + maximum: 1 + description: OCR confidence score. + example: 0.95 + PageRef: + type: object + required: + - pageIndex + - pageNumber + - width + - height + description: |- + Source page reference. Provides the page index and the full page dimensions, + which define the coordinate space that all element bounds on this page are relative to. + properties: + pageIndex: + type: integer + minimum: 0 + description: 0-based page index. + example: 0 + pageNumber: + type: integer + minimum: 1 + description: 1-based page number. + example: 1 + width: + type: number + description: Page width in render-space pixels. + example: 1700 + height: + type: number + description: Page height in render-space pixels. + example: 2200 + TableCell: + type: object + required: + - id + - bounds + - confidence + - row + - column + - rowSpan + - colSpan + - text + properties: + id: + type: string + description: Unique cell identifier. + example: c-001 + bounds: + $ref: '#/components/schemas/Bounds' + confidence: + type: number + minimum: 0 + maximum: 1 + description: Detection confidence score. + example: 0.94 + row: + type: integer + minimum: 0 + description: 0-indexed row. + example: 0 + column: + type: integer + minimum: 0 + description: 0-indexed column. + example: 0 + rowSpan: + type: integer + minimum: 1 + description: Number of rows this cell spans. + default: 1 + example: 1 + colSpan: + type: integer + minimum: 1 + description: Number of columns this cell spans. + default: 1 + example: 1 + text: + type: string + description: Extracted text content. + example: Region + words: + description: Word-level OCR data. Present when `includeWords` is `true`. + oneOf: + - type: array + items: + $ref: '#/components/schemas/Word' + - type: 'null' + KeyValuePair: + type: object + required: + - id + properties: + id: + type: string + description: Unique identifier for the pair. + example: kvp-001 + key: + description: The key/question entity. Null when only a value was detected. + oneOf: + - $ref: '#/components/schemas/KeyValueEntity' + - type: 'null' + value: + description: The value/answer entity. Null when only a key was detected. + oneOf: + - $ref: '#/components/schemas/KeyValueEntity' + - type: 'null' + relationshipConfidence: + description: Confidence for the key-value relationship. + oneOf: + - type: number + minimum: 0 + maximum: 1 + - type: 'null' + example: 0.93 + KeyValueEntity: + type: object + required: + - id + - bounds + - confidence + - entityType + - value + properties: + id: + type: string + description: Unique entity identifier. + example: kve-001 + bounds: + $ref: '#/components/schemas/Bounds' + confidence: + type: number + minimum: 0 + maximum: 1 + description: Detection confidence score. + example: 0.92 + entityType: + type: string + description: Entity type. + enum: + - QUESTION + - ANSWER + - '' + example: QUESTION + value: + description: Extracted value. + example: Invoice Number +x-tagGroups: + - name: Endpoints + tags: + - Data Extraction + - name: Reference + tags: + - File Type Support + - OCR Language Support diff --git a/package.json b/package.json index 69cdbed..e6b9d48 100644 --- a/package.json +++ b/package.json @@ -85,7 +85,8 @@ "format:check": "prettier --check \"src/**/*.ts\"", "typecheck": "tsc --noEmit", "prepublishOnly": "npm run build && npm run test", - "generate:types": "openapi-typescript dws-api-spec.yml -o src/generated/api-types.ts" + "generate:types": "openapi-typescript dws-api-spec.yml -o src/generated/api-types.ts", + "generate:types:extract": "openapi-typescript dws-data-extraction-spec.yml -o src/generated/extract-types.ts" }, "dependencies": { "axios": "^1.13.2", diff --git a/src/__tests__/unit/client.test.ts b/src/__tests__/unit/client.test.ts index 7773062..bc2eff1 100644 --- a/src/__tests__/unit/client.test.ts +++ b/src/__tests__/unit/client.test.ts @@ -232,6 +232,47 @@ describe('NutrientClient', () => { ).toThrow('Base URL must be a string'); }); + it('should accept a string extractApiKey', () => { + const client = new NutrientClient({ + apiKey: 'processor-key', + extractApiKey: 'extract-key', + }); + expect(client).toBeDefined(); + }); + + it('should accept an async extractApiKey getter', () => { + const client = new NutrientClient({ + apiKey: 'processor-key', + extractApiKey: (): Promise => Promise.resolve('extract-key'), + }); + expect(client).toBeDefined(); + }); + + it('should throw ValidationError for invalid extractApiKey type', () => { + expect( + () => + new NutrientClient({ + apiKey: 'processor-key', + extractApiKey: 123 as unknown as string, + }), + ).toThrow(ValidationError); + expect( + () => + new NutrientClient({ + apiKey: 'processor-key', + extractApiKey: 123 as unknown as string, + }), + ).toThrow('Extract API key must be a string or a function that returns a Promise'); + }); + + it('should throw ValidationError for empty-string extractApiKey', () => { + expect( + () => new NutrientClient({ apiKey: 'processor-key', extractApiKey: '' }), + ).toThrow(ValidationError); + expect( + () => new NutrientClient({ apiKey: 'processor-key', extractApiKey: '' }), + ).toThrow('Extract API key must not be an empty string'); + }); }); describe('workflow()', () => { diff --git a/src/__tests__/unit/http.test.ts b/src/__tests__/unit/http.test.ts index d60bc2f..a13619a 100644 --- a/src/__tests__/unit/http.test.ts +++ b/src/__tests__/unit/http.test.ts @@ -318,6 +318,35 @@ describe('HTTP Layer', () => { }); }); + it('should surface the camelCase errorMessage field from Extract API errors', async () => { + // DWS Extract returns `errorMessage` (camelCase) on every 4xx/5xx, not `message`. + const mockResponse = { + data: { + status: 400, + requestId: 'req_err_001', + errorMessage: "invalid mode: 'vlm'. Expected: text, structure, understand, agentic", + errorDetails: { source: 'request', code: 'invalid_request' }, + }, + status: 400, + statusText: 'Bad Request', + headers: {}, + }; + + mockedAxios.mockResolvedValueOnce(mockResponse); + + const config: RequestConfig<'POST', '/extraction/parse'> = { + endpoint: '/extraction/parse', + method: 'POST', + data: { instructions: {} }, + }; + + await expect(sendRequest(config, mockClientOptions, 'json')).rejects.toMatchObject({ + name: 'ValidationError', + message: "invalid mode: 'vlm'. Expected: text, structure, understand, agentic", + statusCode: 400, + }); + }); + it('should handle network errors', async () => { const networkError = { isAxiosError: true, diff --git a/src/__tests__/unit/parse.test.ts b/src/__tests__/unit/parse.test.ts new file mode 100644 index 0000000..76f5bb6 --- /dev/null +++ b/src/__tests__/unit/parse.test.ts @@ -0,0 +1,492 @@ +import { NutrientClient } from '../../client'; +import type { ParseResponseMarkdown, ParseResponseSpatial, extractComponents } from '../../types'; +import { NutrientError, ValidationError } from '../../errors'; + +type ParagraphElement = extractComponents['schemas']['ParagraphElement']; +type TableElement = extractComponents['schemas']['TableElement']; +import * as inputsModule from '../../inputs'; +import * as httpModule from '../../http'; + +jest.mock('../../inputs'); +jest.mock('../../http'); + +const mockSendRequest = httpModule.sendRequest as jest.MockedFunction< + typeof httpModule.sendRequest +>; +const mockProcessFileInput = inputsModule.processFileInput as jest.MockedFunction< + typeof inputsModule.processFileInput +>; +const mockGetRemoteUrl = inputsModule.getRemoteUrl as jest.MockedFunction< + typeof inputsModule.getRemoteUrl +>; + +const sampleSpatialResponse: ParseResponseSpatial = { + status: 200, + requestId: 'req_e5f6g7h8', + output: { + elements: [ + { + id: 'a1b2c3d4-1111-4000-8000-000000000001', + type: 'paragraph', + role: 'Title', + text: 'Quarterly Report', + confidence: 0.95, + readingOrder: 0, + bounds: { x: 200, y: 139, width: 1111, height: 97 }, + page: { pageIndex: 0, pageNumber: 1, width: 1700, height: 2200 }, + } satisfies ParagraphElement, + { + id: 'a1b2c3d4-2222-4000-8000-000000000002', + type: 'table', + rowCount: 2, + columnCount: 2, + cells: [ + { + id: 'c-001', + bounds: { x: 100, y: 200, width: 200, height: 50 }, + confidence: 0.92, + row: 0, + column: 0, + rowSpan: 1, + colSpan: 1, + text: 'Region', + }, + ], + confidence: 0.92, + readingOrder: 1, + bounds: { x: 100, y: 200, width: 600, height: 200 }, + page: { pageIndex: 0, pageNumber: 1, width: 1700, height: 2200 }, + } satisfies TableElement, + ], + }, + metrics: { processingTimeMs: 4200, pagesProcessed: 1 }, + usage: { data_extraction_credits: { cost: 1.5, remainingCredits: 850 } }, + configuration: { mode: 'structure', outputFormat: 'spatial' }, +}; + +const sampleMarkdownResponse: ParseResponseMarkdown = { + status: 200, + requestId: 'req_a1b2c3d4', + output: { markdown: '# Document Title\n\nFirst paragraph.' }, + metrics: { processingTimeMs: 312, pagesProcessed: 1 }, + usage: { data_extraction_credits: { cost: 1, remainingCredits: 849 } }, + configuration: { mode: 'text', outputFormat: 'markdown' }, +}; + +const normalizedFile = { + data: Buffer.from('%PDF-1.4 fake'), + filename: 'doc.pdf', +}; + +function makeClient(): NutrientClient { + return new NutrientClient({ apiKey: 'test-key' }); +} + +describe('NutrientClient.parse()', () => { + beforeEach(() => { + jest.clearAllMocks(); + mockProcessFileInput.mockResolvedValue(normalizedFile); + mockGetRemoteUrl.mockReturnValue(null); + }); + + describe('request shape', () => { + it('sends a multipart POST to /extraction/parse for a local file', async () => { + mockSendRequest.mockResolvedValue({ + data: sampleSpatialResponse, + status: 200, + statusText: 'OK', + headers: {}, + } as never); + + await makeClient().parse('document.pdf', { mode: 'structure' }); + + expect(mockSendRequest).toHaveBeenCalledTimes(1); + const call = mockSendRequest.mock.calls[0]?.[0] as { + method: string; + endpoint: string; + data: { instructions: { mode?: string }; file?: unknown }; + }; + expect(call.method).toBe('POST'); + expect(call.endpoint).toBe('/extraction/parse'); + expect(call.data.file).toBe(normalizedFile); + expect(call.data.instructions).toEqual({ mode: 'structure' }); + }); + + it('sends a JSON POST to /extraction/parse for a URL input', async () => { + mockGetRemoteUrl.mockReturnValue('https://example.com/doc.pdf'); + mockSendRequest.mockResolvedValue({ + data: sampleMarkdownResponse, + status: 200, + statusText: 'OK', + headers: {}, + } as never); + + await makeClient().parse('https://example.com/doc.pdf', { mode: 'text' }); + + const call = mockSendRequest.mock.calls[0]?.[0] as { + method: string; + endpoint: string; + data: { instructions: { url?: string; mode?: string }; file?: unknown }; + }; + expect(call.endpoint).toBe('/extraction/parse'); + expect(call.data.file).toBeUndefined(); + expect(call.data.instructions).toEqual({ + mode: 'text', + url: 'https://example.com/doc.pdf', + }); + expect(mockProcessFileInput).not.toHaveBeenCalled(); + }); + + it('forwards the apiVersion option as x-nutrient-api-version', async () => { + mockSendRequest.mockResolvedValue({ + data: sampleSpatialResponse, + status: 200, + statusText: 'OK', + headers: {}, + } as never); + + await makeClient().parse('document.pdf', { + mode: 'understand', + apiVersion: '2026-05-25', + }); + + const call = mockSendRequest.mock.calls[0]?.[0] as { + headers?: Record; + }; + expect(call.headers).toEqual({ 'x-nutrient-api-version': '2026-05-25' }); + }); + + it('serialises language and output options into instructions', async () => { + mockSendRequest.mockResolvedValue({ + data: sampleSpatialResponse, + status: 200, + statusText: 'OK', + headers: {}, + } as never); + + await makeClient().parse('document.pdf', { + mode: 'understand', + output: { format: 'spatial', includeWords: true }, + language: ['eng', 'spa'], + }); + + const call = mockSendRequest.mock.calls[0]?.[0] as { + data: { + instructions: { + mode?: string; + output?: { format: string; includeWords?: boolean }; + options?: { language?: unknown }; + }; + }; + }; + expect(call.data.instructions).toEqual({ + mode: 'understand', + output: { format: 'spatial', includeWords: true }, + options: { language: ['eng', 'spa'] }, + }); + }); + + it('omits optional fields when not provided', async () => { + mockSendRequest.mockResolvedValue({ + data: sampleSpatialResponse, + status: 200, + statusText: 'OK', + headers: {}, + } as never); + + await makeClient().parse('document.pdf'); + + const call = mockSendRequest.mock.calls[0]?.[0] as { + data: { instructions: object }; + headers?: Record; + }; + expect(call.data.instructions).toEqual({}); + expect(call.headers).toBeUndefined(); + }); + }); + + describe('mode coverage', () => { + const modes = ['text', 'structure', 'understand', 'agentic'] as const; + + it.each(modes)('serialises mode=%s into instructions', async (mode) => { + mockSendRequest.mockResolvedValue({ + data: + mode === 'text' + ? sampleMarkdownResponse + : { ...sampleSpatialResponse, configuration: { mode, outputFormat: 'spatial' } }, + status: 200, + statusText: 'OK', + headers: {}, + } as never); + + const result = await makeClient().parse('document.pdf', { mode }); + + const call = mockSendRequest.mock.calls[0]?.[0] as { + data: { instructions: { mode?: string } }; + }; + expect(call.data.instructions.mode).toBe(mode); + // The mocked response echoes the configured mode, so result.configuration.mode + // round-trips correctly for downstream branching. + expect(result.configuration.mode).toBe(mode); + }); + }); + + describe('output-shape coverage', () => { + it('returns spatial elements when configuration.outputFormat is spatial', async () => { + mockSendRequest.mockResolvedValue({ + data: sampleSpatialResponse, + status: 200, + statusText: 'OK', + headers: {}, + } as never); + + const result = (await makeClient().parse('document.pdf', { + mode: 'structure', + output: { format: 'spatial' }, + })) as ParseResponseSpatial; + + expect(result.configuration.outputFormat).toBe('spatial'); + expect(Array.isArray(result.output.elements)).toBe(true); + expect(result.output.elements[0]?.type).toBe('paragraph'); + }); + + it('returns whole-document Markdown when configuration.outputFormat is markdown', async () => { + mockSendRequest.mockResolvedValue({ + data: sampleMarkdownResponse, + status: 200, + statusText: 'OK', + headers: {}, + } as never); + + const result = (await makeClient().parse('document.pdf', { + mode: 'text', + output: { format: 'markdown' }, + })) as ParseResponseMarkdown; + + expect(result.configuration.outputFormat).toBe('markdown'); + expect(result.output.markdown.startsWith('# ')).toBe(true); + }); + + it('surfaces extraction-credit usage (not processor credits)', async () => { + mockSendRequest.mockResolvedValue({ + data: sampleSpatialResponse, + status: 200, + statusText: 'OK', + headers: {}, + } as never); + + const result = await makeClient().parse('document.pdf'); + // The field name `data_extraction_credits` is the explicit billing-bucket + // marker so callers cannot confuse it with processor API credits. + expect(result.usage?.data_extraction_credits?.cost).toBe(1.5); + expect(result.usage?.data_extraction_credits?.remainingCredits).toBe(850); + }); + }); + + describe('error paths', () => { + it('propagates ValidationError from the HTTP layer (e.g. 400 invalid mode)', async () => { + mockSendRequest.mockRejectedValue( + new ValidationError('The request is malformed', { + errorDetails: { source: 'request', code: 'invalid_request' }, + }), + ); + + await expect( + makeClient().parse('document.pdf', { mode: 'understand' }), + ).rejects.toBeInstanceOf(ValidationError); + }); + + it('propagates errors raised by the file input layer', async () => { + mockProcessFileInput.mockRejectedValue( + new ValidationError('File not found: missing.pdf', { filePath: 'missing.pdf' }), + ); + + await expect(makeClient().parse('missing.pdf')).rejects.toBeInstanceOf(ValidationError); + expect(mockSendRequest).not.toHaveBeenCalled(); + }); + + it("rejects mode='text' + output.format='spatial' before the network call", async () => { + // text mode emits markdown only — the server returns 400 for this combination. + // The client should surface a ValidationError without a network round-trip. + await expect( + makeClient().parse('document.pdf', { mode: 'text', output: { format: 'spatial' } }), + ).rejects.toBeInstanceOf(ValidationError); + expect(mockSendRequest).not.toHaveBeenCalled(); + expect(mockProcessFileInput).not.toHaveBeenCalled(); + }); + }); + + describe('Data Extraction API key routing', () => { + it('routes parse() via extractApiKey when set, leaving apiKey untouched', async () => { + mockSendRequest.mockResolvedValue({ + data: sampleSpatialResponse, + status: 200, + statusText: 'OK', + headers: {}, + } as never); + + const client = new NutrientClient({ + apiKey: 'processor-key', + extractApiKey: 'extract-key', + }); + await client.parse('document.pdf'); + + const passedOptions = mockSendRequest.mock.calls[0]?.[1]; + expect(passedOptions?.apiKey).toBe('extract-key'); + // Original client options must not be mutated. + expect(client['options'].apiKey).toBe('processor-key'); + expect(client['options'].extractApiKey).toBe('extract-key'); + }); + + it('falls back to apiKey when extractApiKey is not provided', async () => { + mockSendRequest.mockResolvedValue({ + data: sampleSpatialResponse, + status: 200, + statusText: 'OK', + headers: {}, + } as never); + + const client = new NutrientClient({ apiKey: 'processor-key' }); + await client.parse('document.pdf'); + + const passedOptions = mockSendRequest.mock.calls[0]?.[1]; + expect(passedOptions?.apiKey).toBe('processor-key'); + }); + + it('forwards an extractApiKey getter unchanged so http.ts resolves it lazily', async () => { + mockSendRequest.mockResolvedValue({ + data: sampleSpatialResponse, + status: 200, + statusText: 'OK', + headers: {}, + } as never); + + const extractGetter = jest.fn(() => Promise.resolve('lazy-extract-key')); + const client = new NutrientClient({ + apiKey: 'processor-key', + extractApiKey: extractGetter, + }); + await client.parse('document.pdf'); + + const passedOptions = mockSendRequest.mock.calls[0]?.[1]; + expect(passedOptions?.apiKey).toBe(extractGetter); + // The client itself does not invoke the getter — that's http.ts's job. + expect(extractGetter).not.toHaveBeenCalled(); + }); + + it('uses extractApiKey for URL inputs too', async () => { + mockGetRemoteUrl.mockReturnValue('https://example.com/doc.pdf'); + mockSendRequest.mockResolvedValue({ + data: sampleMarkdownResponse, + status: 200, + statusText: 'OK', + headers: {}, + } as never); + + const client = new NutrientClient({ + apiKey: 'processor-key', + extractApiKey: 'extract-key', + }); + await client.parse('https://example.com/doc.pdf', { mode: 'text' }); + + const passedOptions = mockSendRequest.mock.calls[0]?.[1]; + expect(passedOptions?.apiKey).toBe('extract-key'); + }); + }); +}); + +describe('NutrientClient.parseToMarkdown()', () => { + beforeEach(() => { + jest.clearAllMocks(); + mockProcessFileInput.mockResolvedValue(normalizedFile); + mockGetRemoteUrl.mockReturnValue(null); + }); + + it('returns the markdown string and defaults to mode=text', async () => { + mockSendRequest.mockResolvedValue({ + data: sampleMarkdownResponse, + status: 200, + statusText: 'OK', + headers: {}, + } as never); + + const md = await makeClient().parseToMarkdown('document.pdf'); + expect(md).toBe('# Document Title\n\nFirst paragraph.'); + + const call = mockSendRequest.mock.calls[0]?.[0] as { + data: { instructions: { mode?: string; output?: { format: string } } }; + }; + expect(call.data.instructions.mode).toBe('text'); + expect(call.data.instructions.output).toEqual({ format: 'markdown' }); + }); + + it('throws NutrientError on output mismatch (defensive)', async () => { + mockSendRequest.mockResolvedValue({ + data: sampleSpatialResponse, // server returned spatial despite our markdown ask + status: 200, + statusText: 'OK', + headers: {}, + } as never); + + await expect(makeClient().parseToMarkdown('document.pdf')).rejects.toBeInstanceOf( + NutrientError, + ); + }); +}); + +describe('NutrientClient.parseElements()', () => { + beforeEach(() => { + jest.clearAllMocks(); + mockProcessFileInput.mockResolvedValue(normalizedFile); + mockGetRemoteUrl.mockReturnValue(null); + }); + + it('returns the elements array and defaults to mode=structure', async () => { + mockSendRequest.mockResolvedValue({ + data: sampleSpatialResponse, + status: 200, + statusText: 'OK', + headers: {}, + } as never); + + const elements = await makeClient().parseElements('document.pdf'); + expect(elements).toHaveLength(2); + expect(elements[0]?.type).toBe('paragraph'); + expect(elements[1]?.type).toBe('table'); + + const call = mockSendRequest.mock.calls[0]?.[0] as { + data: { + instructions: { mode?: string; output?: { format: string; includeWords?: boolean } }; + }; + }; + expect(call.data.instructions.mode).toBe('structure'); + expect(call.data.instructions.output).toEqual({ format: 'spatial', includeWords: false }); + }); + + it('forwards includeWords=true into the request', async () => { + mockSendRequest.mockResolvedValue({ + data: sampleSpatialResponse, + status: 200, + statusText: 'OK', + headers: {}, + } as never); + + await makeClient().parseElements('document.pdf', 'understand', true); + + const call = mockSendRequest.mock.calls[0]?.[0] as { + data: { instructions: { output?: { includeWords?: boolean } } }; + }; + expect(call.data.instructions.output?.includeWords).toBe(true); + }); + + it('throws NutrientError when the server returned markdown instead of spatial', async () => { + mockSendRequest.mockResolvedValue({ + data: sampleMarkdownResponse, + status: 200, + statusText: 'OK', + headers: {}, + } as never); + + await expect(makeClient().parseElements('document.pdf')).rejects.toBeInstanceOf(NutrientError); + }); +}); diff --git a/src/client.ts b/src/client.ts index 815fcc9..c63b808 100644 --- a/src/client.ts +++ b/src/client.ts @@ -8,12 +8,16 @@ import type { OutputTypeMap, WorkflowResult, UrlInput, + ParseInstructions, + ParseOptions, + ParseResponse, } from './types'; import { ValidationError, NutrientError } from './errors'; import { workflow } from './workflow'; import type { components, operations } from './generated/api-types'; +import type { components as extractComponents } from './generated/extract-types'; import { BuildActions } from './build'; -import { processFileInput, isRemoteFileInput } from './inputs'; +import { processFileInput, isRemoteFileInput, getRemoteUrl } from './inputs'; import { sendRequest } from './http'; import type { NormalizedFileData } from './inputs'; import type { ApplicableAction } from './builders/workflow'; @@ -67,6 +71,13 @@ function normalizePageParams( * return token; * } * }); + * + * // Data Extraction (`parse()`) needs its own key — it's a separate product + * // with its own credit pool. Pass both: + * const client = new NutrientClient({ + * apiKey: 'your-processor-key', + * extractApiKey: 'your-extract-key', + * }); * ``` */ export class NutrientClient { @@ -111,6 +122,20 @@ export class NutrientClient { if (options.baseUrl && typeof options.baseUrl !== 'string') { throw new ValidationError('Base URL must be a string'); } + + if (options.extractApiKey !== undefined) { + if ( + typeof options.extractApiKey !== 'string' && + typeof options.extractApiKey !== 'function' + ) { + throw new ValidationError( + 'Extract API key must be a string or a function that returns a Promise', + ); + } + if (options.extractApiKey === '') { + throw new ValidationError('Extract API key must not be an empty string'); + } + } } /** @@ -1704,10 +1729,7 @@ export class NutrientClient { * fs.writeFileSync('modified-document.pdf', Buffer.from(result.buffer)); * ``` */ - async deletePages( - pdf: FileInputWithUrl, - pageIndices: number[], - ): Promise { + async deletePages(pdf: FileInputWithUrl, pageIndices: number[]): Promise { if (!pageIndices || pageIndices.length === 0) { throw new ValidationError('At least one page index is required for deletion'); } @@ -1808,4 +1830,228 @@ export class NutrientClient { .execute(); return this.processTypedWorkflowResult(result); } + + /** + * Extracts structured content from a document via the Nutrient Data Extraction API + * (`POST /extraction/parse`). + * + * Designed for **content-extraction workflows** where the goal is to feed document + * content into a downstream pipeline rather than render or transform the document: + * + * - **RAG / search indexing / content migration** — use `output.format: 'markdown'` + * to get a whole-document Markdown string ready for chunking, embedding, and + * indexing in a vector store or search engine. + * - **Form and invoice extraction** — use `output.format: 'spatial'` (default) to + * get a typed element list (paragraphs, tables, keyValueRegions, etc.) with + * bounding boxes and confidence scores per element. + * - **Layout-aware document understanding** — combine `mode: 'understand'` or + * `mode: 'agentic'` with spatial output for deep layout reconstruction and + * semantic classification, including agentic workflows. + * + * See the README's Data Extraction section for per-mode positioning, a + * "when to use which mode" table, and worked recipes (RAG ingestion, + * form/invoice extraction). + * + * **Billing**: billed against **extraction credits**, a separate bucket from the + * **processor API credits** used by every other method on this client. Per-page + * costs: `text` 1 cr, `structure` 1.5 cr, `understand` 9 cr, `agentic` 18 cr. + * + * **Authentication**: Data Extraction is a separate product with its own API + * key. Pass it via `new NutrientClient({ apiKey, extractApiKey })`. If + * `extractApiKey` is omitted, this method falls back to `apiKey`, which only + * succeeds when the key is a global DWS key authorised for both products. + * + * @param input - The document to parse. Accepts local files (paths, Buffers, + * streams), or a URL string / `{ type: 'url', url: '...' }` object. The endpoint + * accepts a range of document formats — PDFs, Office documents (Word, Excel, + * PowerPoint), and images. Unlike `sign()`, parsing is not restricted to PDFs. + * @param options - Optional parse configuration: + * - `mode` — processing pipeline (`'text'` | `'structure'` | `'understand'` | `'agentic'`). + * - `output.format` — `'spatial'` for typed elements or `'markdown'` for Markdown. + * - `output.includeWords` — include word-level OCR data inside elements. + * - `language` — OCR language hint. Accepts a lowercase language name + * (`'english'`, `'german'`), an ISO 639-2 code (`'eng'`, `'deu'`), an + * array (`['eng', 'spa']`), or a `+`-joined multilingual string (`'eng+spa'`). + * - `apiVersion` — optional API-version header override. + * @returns Promise resolving to the full `/extraction/parse` response envelope. + * Narrow on `output.markdown` / `output.elements` for type-safe field access, + * or read `configuration.outputFormat` for the server-resolved value. + * Extraction-credit accounting is at `usage.data_extraction_credits`. + * + * @example + * ```typescript + * // RAG ingestion — born-digital PDF → Markdown, cheapest path (1 cr/page). + * const md = await client.parse('whitepaper.pdf', { mode: 'text' }); + * if (md.output.markdown !== undefined) { + * console.log(md.output.markdown); + * } + * + * // Form/invoice extraction — spatial elements with bounds and confidence. + * const spatial = await client.parse('invoice.pdf', { mode: 'understand' }); + * if (spatial.output.elements !== undefined) { + * for (const el of spatial.output.elements) { + * if (el.type === 'keyValueRegion') { + * for (const pair of el.pairs) { + * console.log(pair.key?.value, '→', pair.value?.value); + * } + * } + * } + * } + * + * // OCR-backed extraction with word-level data and multilingual hint. + * const scan = await client.parse('scan.pdf', { + * mode: 'structure', + * output: { format: 'spatial', includeWords: true }, + * language: ['eng', 'spa'], + * }); + * + * // URL input — the server fetches the document, no client-side download. + * const remote = await client.parse('https://example.com/doc.pdf'); + * + * // Extraction-credit accounting (separate from processor API credits). + * console.log('Credits used:', remote.usage?.data_extraction_credits?.cost); + * console.log('Credits left:', remote.usage?.data_extraction_credits?.remainingCredits); + * + * // Convenience wrappers skip output-format discrimination entirely: + * const markdown = await client.parseToMarkdown('whitepaper.pdf'); + * const elements = await client.parseElements('invoice.pdf', 'understand'); + * ``` + */ + async parse(input: FileInputWithUrl, options?: ParseOptions): Promise { + // `text` mode emits markdown only — the server rejects this combination + // with a 400. Reject client-side so the caller gets a clear error without + // a network round-trip. Note: `parseElements()` blocks this at the type + // level via `Exclude`, but the low-level + // `parse()` accepts any combination, so the runtime guard is needed here. + if (options?.mode === 'text' && options?.output?.format === 'spatial') { + throw new ValidationError( + "mode='text' is not supported with output.format='spatial'. " + + "Use output.format='markdown', or switch to mode='structure' / 'understand' / 'agentic' for spatial elements.", + ); + } + + const instructions: ParseInstructions = {}; + if (options?.mode !== undefined) instructions.mode = options.mode; + if (options?.output !== undefined) instructions.output = options.output; + if (options?.language !== undefined) { + instructions.options = { language: options.language }; + } + + const headers: Record | undefined = + options?.apiVersion !== undefined + ? { 'x-nutrient-api-version': options.apiVersion } + : undefined; + + // Data Extraction is a separate product with its own API key. Route the + // request via a per-call options copy so the rest of the client (which + // talks to the Processor API) keeps using the main key. Falls back to + // apiKey when extractApiKey is unset. + const parseOptions: NutrientClientOptions = + this.options.extractApiKey !== undefined + ? { ...this.options, apiKey: this.options.extractApiKey } + : this.options; + + // URL input → JSON body + const remoteUrl = getRemoteUrl(input); + if (remoteUrl !== null) { + instructions.url = remoteUrl; + const response = await sendRequest( + { + method: 'POST', + endpoint: '/extraction/parse', + data: { instructions }, + ...(headers ? { headers } : {}), + }, + parseOptions, + 'json', + ); + return response.data; + } + + // Local file input → multipart upload + const normalizedFile = await processFileInput(input as FileInput); + const response = await sendRequest( + { + method: 'POST', + endpoint: '/extraction/parse', + data: { instructions, file: normalizedFile }, + ...(headers ? { headers } : {}), + }, + parseOptions, + 'json', + ); + return response.data; + } + + /** + * Convenience wrapper around {@link NutrientClient.parse} that returns the + * whole-document Markdown directly. Billed against **extraction credits** + * (1 credit/page for `text`, 1.5 for `structure`, 9 for `understand`, 18 for + * `agentic`). + * + * @param input - The document to parse. + * @param mode - Processing mode (defaults to `'text'` for cheapest Markdown extraction). + * @returns Promise resolving to the Markdown string. + * + * @example + * ```typescript + * const markdown = await client.parseToMarkdown('document.pdf'); + * console.log(markdown); + * ``` + */ + async parseToMarkdown( + input: FileInputWithUrl, + mode: ParseOptions['mode'] = 'text', + ): Promise { + const result = await this.parse(input, { + mode, + output: { format: 'markdown' }, + }); + if (result.output.markdown === undefined) { + throw new NutrientError( + 'parseToMarkdown expected markdown output, server returned ' + + result.configuration.outputFormat, + 'PARSE_OUTPUT_MISMATCH', + { configuration: result.configuration as unknown as Record }, + ); + } + return result.output.markdown; + } + + /** + * Convenience wrapper around {@link NutrientClient.parse} that returns the + * spatial elements array directly. Not available with `mode: 'text'`. + * Billed against **extraction credits** (1.5/page for `structure`, 9 for + * `understand`, 18 for `agentic`). + * + * @param input - The document to parse. + * @param mode - Processing mode (defaults to `'structure'`). Must not be `'text'`. + * @param includeWords - Include word-level OCR data inside paragraphs and table cells. + * @returns Promise resolving to the array of spatial elements. + * + * @example + * ```typescript + * const elements = await client.parseElements('scan.pdf', 'understand'); + * const tables = elements.filter(e => e.type === 'table'); + * ``` + */ + async parseElements( + input: FileInputWithUrl, + mode: Exclude = 'structure', + includeWords = false, + ): Promise { + const result = await this.parse(input, { + mode, + output: { format: 'spatial', includeWords }, + }); + if (result.output.elements === undefined) { + throw new NutrientError( + 'parseElements expected spatial output, server returned ' + + result.configuration.outputFormat, + 'PARSE_OUTPUT_MISMATCH', + { configuration: result.configuration as unknown as Record }, + ); + } + return result.output.elements; + } } diff --git a/src/generated/extract-types.ts b/src/generated/extract-types.ts new file mode 100644 index 0000000..6fc62fb --- /dev/null +++ b/src/generated/extract-types.ts @@ -0,0 +1,689 @@ +/** + * This file was auto-generated by openapi-typescript. + * Do not make direct changes to the file. + */ + +export interface paths { + "/extraction/parse": { + parameters: { + query?: never; + header?: never; + path?: never; + cookie?: never; + }; + get?: never; + put?: never; + /** + * Extract data from a document + * @description Extract structured content from a document. Returns either typed document elements + * with spatial data or a whole-document markdown representation. + * + * Four processing modes are available: + * - **text** — Plain text extraction powered by Document Engine. Only supports `markdown` output. + * - **structure** — OCR-backed structured extraction with spatial element output. + * - **understand** — Deeper document analysis with structured extraction and semantic enrichment. + * - **agentic** — AI-powered analysis for complex documents that need visual reasoning and self-correction. + * + * You can provide the input document in three ways: + * - **Multipart form upload** via `multipart/form-data` with a `file` field and optional JSON `instructions`. + * - **URL-based input** via `application/json` with a `url` field pointing to a remote document. + * - **Raw binary upload** via `application/pdf` (or other supported content types). + */ + post: operations["extraction-parse"]; + delete?: never; + options?: never; + head?: never; + patch?: never; + trace?: never; + }; +} +export type webhooks = Record; +export interface components { + schemas: { + /** + * @description Processing pipeline. + * - `text` — Plain text extraction powered by Document Engine. Only supports `markdown` output format. + * - `structure` — OCR-backed structured extraction with spatial element output. + * - `understand` — Deeper document analysis with structured extraction and semantic enrichment. + * - `agentic` — AI-powered analysis for complex documents that need visual reasoning and self-correction. + * @default understand + * @example understand + * @enum {string} + */ + Mode: "text" | "structure" | "understand" | "agentic"; + /** + * @description Output configuration. When provided, `format` is required. + * Default format depends on the mode: `text` defaults to `markdown`; + * `structure`, `understand`, and `agentic` default to `spatial`. + */ + OutputOptions: { + /** + * @description The output format. + * - `spatial` — Flat typed elements with bounding boxes, confidence scores, reading order, and page references. + * Not available with `text` mode. + * - `markdown` — Whole-document markdown representation. + * @example spatial + * @enum {string} + */ + format: "spatial" | "markdown"; + /** + * @description Include word-level OCR data nested inside paragraph and table cell elements. + * Only applicable when `format` is `spatial`. + * @default false + * @example false + */ + includeWords: boolean; + }; + /** @description Additional processing options. */ + ProcessingOptions: { + /** + * @description OCR language hint. Only supported for `structure`, `understand`, and `agentic` modes. + * Accepts lowercase language names (`"english"`, `"german"`) + * or ISO 639-2 language codes (`"eng"`, `"deu"`). Multilingual OCR can be expressed + * as an array (`["eng", "spa"]`) or a `+`-joined string (`"eng+spa"`). + * @default eng + */ + language: string | string[]; + }; + ParseResponse: { + /** + * @description HTTP status code. + * @example 200 + * @enum {integer} + */ + status: 200; + /** + * @description Unique request identifier for debugging and support. + * @example req_e5f6g7h8 + */ + requestId: string; + output: components["schemas"]["ParseOutput"]; + metrics: components["schemas"]["Metrics"]; + usage?: components["schemas"]["Usage"]; + configuration: components["schemas"]["Configuration"]; + }; + /** + * @description Extracted content. Contains either `elements` (for spatial format) or `markdown` + * (for markdown format), never both. + */ + ParseOutput: { + /** + * @description Flat list of document elements across all pages, ordered by reading order. + * Present when `output.format` is `spatial`. + */ + elements?: components["schemas"]["Element"][]; + /** + * @description Whole-document markdown content. + * Present when `output.format` is `markdown`. + * @example # Document Title + * + * First paragraph of text... + */ + markdown?: string; + }; + Metrics: { + /** + * @description Total processing time in milliseconds. + * @example 4200 + */ + processingTimeMs: number; + /** + * @description Number of pages processed. + * @example 1 + */ + pagesProcessed: number; + }; + Usage: { + data_extraction_credits?: { + /** + * @description Credits consumed by this request. + * @example 2 + */ + cost: number; + /** + * @description Remaining credits in the account. + * @example 850 + */ + remainingCredits: number; + }; + }; + Configuration: { + mode: components["schemas"]["Mode"]; + /** + * @description The output format that was used for this request. + * @example spatial + * @enum {string} + */ + outputFormat: "spatial" | "markdown"; + }; + ParseErrorResponse: { + /** + * @description HTTP status code. + * @example 400 + */ + status: number; + /** + * @description Unique request identifier for debugging and support. + * @example req_err_001 + */ + requestId: string; + /** + * @description Human-readable error summary. + * @example The request is malformed + */ + errorMessage: string; + /** @description Structured error details. Present on validation and processing errors. */ + errorDetails?: { + /** + * @description Error origin. + * - `request` — Validation errors (invalid parameters, unsupported format). + * - `processing` — Backend processing failures. + * - `maestro` — Maestro engine failures. + * @example request + */ + source?: string; + /** + * @description Machine-readable error code stable enough for client branching. + * @example invalid_request + */ + code?: string; + /** @description List of invalid fields. Present on validation errors. */ + failingPaths?: { + /** + * @description JSON path to the invalid field. + * @example $.mode + */ + path?: string; + /** + * @description Human-readable validation error. + * @example invalid mode: 'vlm'. Expected: text, structure, understand, agentic + */ + details?: string; + }[]; + }; + }; + Element: components["schemas"]["ParagraphElement"] | components["schemas"]["FormulaElement"] | components["schemas"]["PictureElement"] | components["schemas"]["TableElement"] | components["schemas"]["KeyValueRegionElement"] | components["schemas"]["HandwritingElement"]; + ElementBase: { + /** + * Format: uuid + * @description Unique element identifier. + * @example a1b2c3d4-1111-4000-8000-000000000001 + */ + id: string; + bounds: components["schemas"]["Bounds"]; + /** + * @description Detection confidence score. + * @example 0.95 + */ + confidence: number; + /** + * @description Reading order index within the page. + * @example 0 + */ + readingOrder: number; + page: components["schemas"]["PageRef"]; + }; + ParagraphElement: components["schemas"]["ElementBase"] & { + /** @enum {string} */ + type: "paragraph"; + /** + * @description Semantic role of the paragraph. Null when the role is undetermined. + * @example Text + */ + role?: ("Text" | "Title" | "SectionHeader" | "Header" | "Footer" | "Caption" | "Footnote" | "ListItem" | "PageNumber" | "Code" | "CheckboxSelected" | "CheckboxUnselected") | null; + /** + * @description Extracted text content. + * @example Revenue grew 15% year-over-year. + */ + text: string; + /** @description Word-level OCR data. Present when `includeWords` is `true`. */ + words?: components["schemas"]["Word"][] | null; + } & { + /** + * @description discriminator enum property added by openapi-typescript + * @enum {string} + */ + type: "paragraph"; + }; + FormulaElement: components["schemas"]["ElementBase"] & { + /** @enum {string} */ + type: "formula"; + /** + * @description LaTeX representation of the formula. + * @example r = r_0 e^{kt} + */ + latex: string; + } & { + /** + * @description discriminator enum property added by openapi-typescript + * @enum {string} + */ + type: "formula"; + }; + PictureElement: components["schemas"]["ElementBase"] & { + /** @enum {string} */ + type: "picture"; + /** + * @description Image classification category (chart, photo, diagram, etc.). + * @example chart + */ + classification: string; + /** + * @description Confidence score for the classification. + * @example 0.91 + */ + classificationConfidence: number; + /** + * @description AI-generated alternative text description. + * @example Bar chart showing quarterly revenue growth across regions + */ + altDescription: string; + /** @description IDs of associated caption paragraph elements. */ + captionIds?: string[] | null; + /** @description IDs of associated footnote paragraph elements. */ + footnoteIds?: string[] | null; + } & { + /** + * @description discriminator enum property added by openapi-typescript + * @enum {string} + */ + type: "picture"; + }; + TableElement: components["schemas"]["ElementBase"] & { + /** @enum {string} */ + type: "table"; + /** + * @description Number of rows in the table. + * @example 3 + */ + rowCount: number; + /** + * @description Number of columns in the table. + * @example 3 + */ + columnCount: number; + /** @description Cell-level data. */ + cells: components["schemas"]["TableCell"][]; + /** @description IDs of associated caption paragraph elements. */ + captionIds?: string[] | null; + /** @description IDs of associated footnote paragraph elements. */ + footnoteIds?: string[] | null; + } & { + /** + * @description discriminator enum property added by openapi-typescript + * @enum {string} + */ + type: "table"; + }; + KeyValueRegionElement: components["schemas"]["ElementBase"] & { + /** @enum {string} */ + type: "keyValueRegion"; + /** @description Detected key-value pairs. */ + pairs: components["schemas"]["KeyValuePair"][]; + } & { + /** + * @description discriminator enum property added by openapi-typescript + * @enum {string} + */ + type: "keyValueRegion"; + }; + HandwritingElement: components["schemas"]["ElementBase"] & { + /** @enum {string} */ + type: "handwriting"; + /** + * @description Extracted handwritten text content. + * @example John Doe + */ + text: string; + /** @description Word-level OCR data. Present when `includeWords` is `true`. */ + words?: components["schemas"]["Word"][] | null; + } & { + /** + * @description discriminator enum property added by openapi-typescript + * @enum {string} + */ + type: "handwriting"; + }; + /** + * @description Bounding box of an element on the page. `(x, y)` is the top-left corner of the box. + * Origin is the top-left of the page, with x increasing right and y increasing down. + * + * Coordinates are always expressed in render-space pixels. + * `page.width` and `page.height` describe the same pixel canvas as every element, + * word, and table cell bound on that page. + */ + Bounds: { + /** + * @description X coordinate of the top-left corner (distance from page left edge). + * @example 100 + */ + x: number; + /** + * @description Y coordinate of the top-left corner (distance from page top edge). + * @example 50 + */ + y: number; + /** + * @description Width of the bounding box. + * @example 400 + */ + width: number; + /** + * @description Height of the bounding box. + * @example 35 + */ + height: number; + }; + /** @description Word-level OCR result. */ + Word: { + /** + * @description The word text. + * @example Revenue + */ + text: string; + bounds: components["schemas"]["Bounds"]; + /** + * @description OCR confidence score. + * @example 0.95 + */ + confidence: number; + }; + /** + * @description Source page reference. Provides the page index and the full page dimensions, + * which define the coordinate space that all element bounds on this page are relative to. + */ + PageRef: { + /** + * @description 0-based page index. + * @example 0 + */ + pageIndex: number; + /** + * @description 1-based page number. + * @example 1 + */ + pageNumber: number; + /** + * @description Page width in render-space pixels. + * @example 1700 + */ + width: number; + /** + * @description Page height in render-space pixels. + * @example 2200 + */ + height: number; + }; + TableCell: { + /** + * @description Unique cell identifier. + * @example c-001 + */ + id: string; + bounds: components["schemas"]["Bounds"]; + /** + * @description Detection confidence score. + * @example 0.94 + */ + confidence: number; + /** + * @description 0-indexed row. + * @example 0 + */ + row: number; + /** + * @description 0-indexed column. + * @example 0 + */ + column: number; + /** + * @description Number of rows this cell spans. + * @default 1 + * @example 1 + */ + rowSpan: number; + /** + * @description Number of columns this cell spans. + * @default 1 + * @example 1 + */ + colSpan: number; + /** + * @description Extracted text content. + * @example Region + */ + text: string; + /** @description Word-level OCR data. Present when `includeWords` is `true`. */ + words?: components["schemas"]["Word"][] | null; + }; + KeyValuePair: { + /** + * @description Unique identifier for the pair. + * @example kvp-001 + */ + id: string; + /** @description The key/question entity. Null when only a value was detected. */ + key?: components["schemas"]["KeyValueEntity"] | null; + /** @description The value/answer entity. Null when only a key was detected. */ + value?: components["schemas"]["KeyValueEntity"] | null; + /** + * @description Confidence for the key-value relationship. + * @example 0.93 + */ + relationshipConfidence?: number | null; + }; + KeyValueEntity: { + /** + * @description Unique entity identifier. + * @example kve-001 + */ + id: string; + bounds: components["schemas"]["Bounds"]; + /** + * @description Detection confidence score. + * @example 0.92 + */ + confidence: number; + /** + * @description Entity type. + * @example QUESTION + * @enum {string} + */ + entityType: "QUESTION" | "ANSWER" | ""; + /** + * @description Extracted value. + * @example Invoice Number + */ + value: unknown; + }; + }; + responses: never; + parameters: { + /** + * @description Optional API version override for this request. + * + * If omitted, the request uses the latest version that was available when the API key was created. + * + * See the [API Versioning](#description/api-versioning) section for the list of supported versions. + * @example 2026-05-25 + */ + NutrientApiVersion: "2026-05-25"; + }; + requestBodies: never; + headers: never; + pathItems: never; +} +export type $defs = Record; +export interface operations { + "extraction-parse": { + parameters: { + query?: never; + header?: { + /** + * @description Optional API version override for this request. + * + * If omitted, the request uses the latest version that was available when the API key was created. + * + * See the [API Versioning](#description/api-versioning) section for the list of supported versions. + * @example 2026-05-25 + */ + "x-nutrient-api-version"?: components["parameters"]["NutrientApiVersion"]; + }; + path?: never; + cookie?: never; + }; + requestBody?: { + content: { + "multipart/form-data": { + /** + * Format: binary + * @description The document to parse. + * @example + */ + file?: string; + /** + * @description JSON-serialized processing instructions. Omit to use all defaults + * (`mode: "understand"` with spatial element output). + */ + instructions?: { + /** + * Format: uri + * @description URL of a remote document to parse. Use this instead of the `file` field + * to process a document hosted at a public URL. + * @example https://storage.example.com/invoice.pdf + */ + url?: string; + mode?: components["schemas"]["Mode"]; + output?: components["schemas"]["OutputOptions"]; + options?: components["schemas"]["ProcessingOptions"]; + }; + }; + /** + * @example { + * "url": "https://storage.example.com/invoice.pdf", + * "mode": "understand", + * "output": { + * "format": "spatial" + * } + * } + */ + "application/json": { + /** + * Format: uri + * @description URL of a remote document to parse. + * @example https://storage.example.com/invoice.pdf + */ + url: string; + mode?: components["schemas"]["Mode"]; + output?: components["schemas"]["OutputOptions"]; + options?: components["schemas"]["ProcessingOptions"]; + }; + "application/pdf": string; + "image/png": string; + "image/jpeg": string; + "image/tiff": string; + }; + }; + responses: { + /** @description Extraction completed successfully. */ + 200: { + headers: { + [name: string]: unknown; + }; + content: { + "application/json": components["schemas"]["ParseResponse"]; + }; + }; + /** @description The request is malformed. Invalid parameters, unsupported file format, or missing required fields. */ + 400: { + headers: { + [name: string]: unknown; + }; + content: { + /** + * @example { + * "status": 400, + * "requestId": "req_err_001", + * "errorMessage": "The request is malformed", + * "errorDetails": { + * "source": "request", + * "code": "invalid_request", + * "failingPaths": [ + * { + * "path": "$.mode", + * "details": "invalid mode: 'vlm'. Expected: text, structure, understand, agentic" + * } + * ] + * } + * } + */ + "application/json": components["schemas"]["ParseErrorResponse"]; + }; + }; + /** @description You are unauthorized. Sent when no API token is specified, or when the API token you specified isn't valid. */ + 401: { + headers: { + [name: string]: unknown; + }; + content?: never; + }; + /** @description Insufficient credits for this request. */ + 402: { + headers: { + [name: string]: unknown; + }; + content: { + /** + * @example { + * "status": 402, + * "requestId": "req_err_002", + * "errorMessage": "Insufficient credits. This request requires 2 credits, 0 remaining." + * } + */ + "application/json": components["schemas"]["ParseErrorResponse"]; + }; + }; + /** @description The uploaded file exceeds the maximum allowed size for your plan. */ + 413: { + headers: { + [name: string]: unknown; + }; + content?: never; + }; + /** @description Too many requests. You have exceeded the rate limit for your subscription. */ + 429: { + headers: { + [name: string]: unknown; + }; + content?: never; + }; + /** @description An internal processing error occurred. Please retry or contact support with the `requestId`. */ + 500: { + headers: { + [name: string]: unknown; + }; + content: { + /** + * @example { + * "status": 500, + * "requestId": "req_err_003", + * "errorMessage": "Processing failed. Please retry or contact support with the requestId.", + * "errorDetails": { + * "source": "maestro", + * "code": "maestro_error" + * } + * } + */ + "application/json": components["schemas"]["ParseErrorResponse"]; + }; + }; + /** @description The processing backend is temporarily unavailable. Please retry later. */ + 503: { + headers: { + [name: string]: unknown; + }; + content?: never; + }; + }; + }; +} diff --git a/src/http.ts b/src/http.ts index 2dfb051..4704c71 100644 --- a/src/http.ts +++ b/src/http.ts @@ -158,6 +158,34 @@ function prepareRequestBody; + const { file, instructions } = typedConfig.data; + + if (file) { + // Multipart upload: file + JSON instructions + const formData = new FormData(); + appendFileToFormData(formData, 'file', file); + if (instructions && Object.keys(instructions).length > 0) { + formData.append('instructions', JSON.stringify(instructions), { + contentType: 'application/json', + }); + } + axiosConfig.data = formData; + axiosConfig.headers = { + ...axiosConfig.headers, + ...formData.getHeaders(), + }; + } else { + // URL-only request → JSON body + axiosConfig.data = instructions; + axiosConfig.headers = { + ...axiosConfig.headers, + 'Content-Type': 'application/json', + }; + } + return axiosConfig; } } @@ -256,6 +284,10 @@ function extractErrorMessage(data: unknown): string | null { if (typeof errorData['error_message'] === 'string') { return errorData['error_message']; } + // DWS Extract uses `errorMessage` (camelCase) on every 4xx/5xx response. + if (typeof errorData['errorMessage'] === 'string') { + return errorData['errorMessage']; + } // Common error message fields if (typeof errorData['message'] === 'string') { diff --git a/src/index.ts b/src/index.ts index 65f091d..7531d67 100644 --- a/src/index.ts +++ b/src/index.ts @@ -36,6 +36,27 @@ export type { OutputTypeMap, TypedWorkflowResult, WorkflowDryRunResult, + + // Data Extraction (`/extraction/parse`) — hand-composed client-facing types. + // Schema primitives (Mode, Element and the six subtypes, Bounds, PageRef, + // Word, Metrics, Usage, Configuration, ParseErrorResponse, etc.) live in the + // `extractComponents` namespace below — same pattern as `components` for the + // Processor spec. + ExtractionCredits, + ParseOutputOptions, + ParseInstructions, + ParseOptions, + ParseResponse, + ParseResponseSpatial, + ParseResponseMarkdown, + + // Generated spec namespaces + components, + operations, + paths, + extractComponents, + extractOperations, + extractPaths, } from './types'; // Utility exports diff --git a/src/types/common.ts b/src/types/common.ts index 2d889ff..7edd362 100644 --- a/src/types/common.ts +++ b/src/types/common.ts @@ -32,4 +32,18 @@ export interface NutrientClientOptions { * Timeout in milliseconds */ timeout?: number; + + /** + * Optional API key (or async getter) for the Nutrient DWS **Data Extraction** + * product. Required by `parse()` because Data Extraction is a separate + * product from the Processor API and has its own credit pool — using a + * Processor key against `/extraction/parse` returns 403. + * + * If omitted, `parse()` falls back to `apiKey`. That fallback works on + * tenants where a single global DWS key authorises both products. + * + * No other client method uses this key — `convert`, `sign`, `ocr`, etc. + * always use `apiKey`. + */ + extractApiKey?: string | (() => Promise); } diff --git a/src/types/http.ts b/src/types/http.ts index 64185e4..a5a1c5e 100644 --- a/src/types/http.ts +++ b/src/types/http.ts @@ -1,7 +1,138 @@ import type { components, operations } from '../generated/api-types'; +import type { components as extractComponents } from '../generated/extract-types'; import type { NormalizedFileData } from '../inputs'; import type { ValueOf } from '@typescript-eslint/eslint-plugin/dist/util'; +type ExtractSchemas = extractComponents['schemas']; + +// ───────────────────────────────────────────────────────────────────────────── +// `/extraction/parse` — hand-composed request and response types +// +// The schema primitives (Mode, OutputFormat, Element and the six element +// subtypes, Bounds, PageRef, Word, TableCell, KeyValuePair, KeyValueEntity, +// Metrics, Usage, Configuration, ParseErrorResponse) live in the generated +// extract-types and are accessible to consumers via the `extractComponents` +// re-export from the package root. The types defined below are the four +// shapes the spec doesn't express on its own: +// +// - `ParseOutputOptions` / `ParseInstructions` — spec marks +// `OutputOptions.includeWords` as required but the server defaults it. +// - `ParseResponseSpatial` / `ParseResponseMarkdown` — cross-field +// discriminated narrowing so `if (output.markdown !== undefined)` works +// without per-call `?.` access. +// - `ParseOptions` — adds the client-only `apiVersion` header concern that +// isn't a body field in the spec. +// - `ExtractionCredits` — derived alias for the billing-bucket sub-shape. +// ───────────────────────────────────────────────────────────────────────────── + +/** + * Extraction-credit usage returned by the Data Extraction API + * (`POST /extraction/parse`). + * + * **Extraction credits** are a separate billing bucket from the **processor + * API credits** consumed by `/build`, `/sign`, OCR, and every other endpoint + * on `NutrientClient`. An extraction call never debits processor credits and + * vice-versa. The server surfaces this object at + * `ParseResponse.usage.data_extraction_credits`. + */ +export type ExtractionCredits = NonNullable; + +/** + * Output configuration for `/extraction/parse`. + * + * Defaults: `text` mode emits `markdown`; `structure`, `understand`, and + * `agentic` emit `spatial`. `includeWords` defaults to `false` server-side and + * is only honoured when `format` is `'spatial'`. Hand-written because the spec + * marks `includeWords` as required. + */ +export interface ParseOutputOptions { + /** Output format. */ + format: ExtractSchemas['Configuration']['outputFormat']; + /** + * Include word-level OCR data nested inside paragraph and table cell + * elements. Only applicable when `format` is `'spatial'`. + */ + includeWords?: boolean; +} + +/** + * Instruction payload sent to `/extraction/parse`. All fields are optional; an + * empty object resolves to `mode: 'understand'` with spatial output server-side. + */ +export interface ParseInstructions { + /** + * URL of a remote document to parse. Used by the JSON request shape; when + * passing a local file or buffer, omit this field. + */ + url?: string; + mode?: ExtractSchemas['Mode']; + output?: ParseOutputOptions; + options?: ExtractSchemas['ProcessingOptions']; +} + +/** + * Options accepted by `NutrientClient.parse()`. Hand-written because + * `apiVersion` is a client-only header override, not a body field in the spec. + */ +export interface ParseOptions { + mode?: ExtractSchemas['Mode']; + output?: ParseOutputOptions; + /** OCR language hint. Only honoured for `structure` / `understand` / `agentic` modes. */ + language?: ExtractSchemas['ProcessingOptions']['language']; + /** + * Optional API-version override sent as the `x-nutrient-api-version` header. + * Defaults to the version pinned at API-key creation time. + */ + apiVersion?: string; +} + +/** + * Successful `/extraction/parse` response with spatial element output. + * + * Hand-composed over the generated `ParseOutput` schema: the spec marks both + * `elements` and `markdown` as optional on the same object, forcing `?.` access + * at every call site. These narrowed variants pin one field present and the + * other `undefined`, allowing `if (output.markdown !== undefined)` to + * discriminate cleanly. + */ +export interface ParseResponseSpatial { + status: 200; + /** Unique request identifier for debugging and support. */ + requestId: string; + output: { + elements: ExtractSchemas['Element'][]; + markdown?: undefined; + }; + metrics: ExtractSchemas['Metrics']; + usage?: ExtractSchemas['Usage']; + configuration: ExtractSchemas['Configuration'] & { outputFormat: 'spatial' }; +} + +/** Successful `/extraction/parse` response with whole-document Markdown output. */ +export interface ParseResponseMarkdown { + status: 200; + /** Unique request identifier for debugging and support. */ + requestId: string; + output: { + markdown: string; + elements?: undefined; + }; + metrics: ExtractSchemas['Metrics']; + usage?: ExtractSchemas['Usage']; + configuration: ExtractSchemas['Configuration'] & { outputFormat: 'markdown' }; +} + +/** + * Discriminated union of every successful `/extraction/parse` response. Narrow + * on `configuration.outputFormat` (or simply branch on `output.markdown` / + * `output.elements`) to pick between the two output shapes. + */ +export type ParseResponse = ParseResponseSpatial | ParseResponseMarkdown; + +// ───────────────────────────────────────────────────────────────────────────── +// Endpoint request/response type maps +// ───────────────────────────────────────────────────────────────────────────── + export type RequestTypeMap = { GET: { '/account/info': undefined; @@ -26,6 +157,16 @@ export type RequestTypeMap = { file?: NormalizedFileData; }; '/tokens': components['schemas']['CreateAuthTokenParameters']; + /** + * `/extraction/parse` request body. `instructions` is always sent (callers + * may pass an empty object for server defaults). Use exactly one of: + * - `file` + `instructions` for multipart upload (local files, buffers, streams). + * - `instructions.url` only for URL-based input (sent as `application/json`). + */ + '/extraction/parse': { + instructions: ParseInstructions; + file?: NormalizedFileData; + }; }; DELETE: { '/tokens': { id: string }; @@ -42,6 +183,7 @@ export type ResponseTypeMap = { '/sign': string; '/ai/redact': string; '/tokens': components['schemas']['CreateAuthTokenResponse']; + '/extraction/parse': ParseResponse; }; DELETE: { '/tokens': undefined; diff --git a/src/types/index.ts b/src/types/index.ts index b813af9..757ea2a 100644 --- a/src/types/index.ts +++ b/src/types/index.ts @@ -4,3 +4,11 @@ export * from './workflow'; export * from './http'; // Re-export generated types for convenience export type { components, operations, paths } from '../generated/api-types'; +// Re-export Data Extraction (`/extraction/parse`) spec types under a namespace +// so consumers can access element subtypes, schemas, and operations without a +// name collision with the Processor types above. +export type { + components as extractComponents, + operations as extractOperations, + paths as extractPaths, +} from '../generated/extract-types';