feat: wire import-file endpoint and Phase 2 markdown extraction (#77, #79, #80) by Jurij89 · Pull Request #113 · OriginTrail/dkg

Jurij89 · 2026-04-10T16:23:13Z

Summary

Completes the document ingestion pipeline for V10: agents can now upload a document (PDF, DOCX, Markdown, HTML, CSV, etc.) to a Working Memory assertion, and the node runs a deterministic two-phase extraction pipeline that writes RDF triples into the assertion graph. Closes three open spec issues:

#77 — POST /api/assertion/:name/import-file handler wiring
#79 gap 3 — GET /api/assertion/:name/extraction-status endpoint
#80 — ExtractionPipeline interface split (split-interface pattern: converters return ConverterOutput { mdIntermediate }, the route handler is the orchestrator that assembles the composite ExtractionOutput)

Companion spec PR: OriginTrail/dkgv10-spec#83 — documents the same contract in 05_PROTOCOL_EXTENSIONS.md §6.5: the split ConverterOutput/ExtractionOutput interface, the text/markdown skip-Phase-1 rule, and the graceful-degrade paragraph for unregistered content types. Reviewers should cross-check this code PR against the spec PR for consistency (the reframed Phase 5c cross-PR comment review).

What ships

Extraction pipeline architecture (Phase 1 + Phase 2)

Phase 1 — Converters (ExtractionPipeline interface). Non-Markdown source formats go through a registered converter that produces a Markdown intermediate. MarkItDownConverter is the built-in converter for PDF/DOCX/PPTX/XLSX/CSV/HTML/EPUB/XML when the MarkItDown binary is available. Converters return ConverterOutput { mdIntermediate: string } — they do NOT produce triples or provenance.

text/markdown skip-Phase-1. Uploads with Content-Type: text/markdown bypass Phase 1 entirely — the raw file bytes ARE the Markdown intermediate. text/markdown is deliberately NOT a registered converter content type (PR #108 already removed it from MARKITDOWN_CONTENT_TYPES). The route handler detects this and feeds the bytes straight into Phase 2.

Phase 2 — Structural extractor (markdown-extractor.ts). Deterministic node-side RDF extraction from Markdown per 19_MARKDOWN_CONTENT_TYPE.md. No LLM, no external calls. Handles:

YAML frontmatter → subject properties (special keys: id, type, title/name, description/summary, keywords/tags; arbitrary keys fall into http://schema.org/{key})
type frontmatter key → rdf:type (bare identifiers namespaced to http://schema.org/)
Wikilinks [[Target]] → schema:mentions (slugified to urn:dkg:md:{slug})
Hashtags #keyword → schema:keywords (excludes headings and code fences)
Dataview inline fields key:: value → properties
Heading hierarchy → dkg:hasSection with per-section schema:name (H1 skipped as document title, H2+ become sections)
Every extraction run emits a dkg:ExtractionProvenance block with dkg:extractedBy, dkg:extractionRule, dkg:extractedAt, dkg:derivedFrom, and prov:wasGeneratedBy back-link

Route handler orchestration. POST /api/assertion/:name/import-file wires Phase 1 and Phase 2 together, stores the original file and MD intermediate in a content-addressed file store, writes the resulting triples + provenance to the target assertion graph via agent.assertion.write, and tracks the extraction job state in an in-memory map for status polling.

New endpoints

POST /api/assertion/:name/import-file — multipart/form-data

Field	Required	Description
`file`	yes	The uploaded document bytes
`contextGraphId`	yes	Target context graph
`contentType`	no	Override the file part's `Content-Type` header
`ontologyRef`	no	CG `_ontology` URI for guided Phase 2 extraction (threaded to both phases)
`subGraphName`	no	Target sub-graph inside the CG (must be registered via `createSubGraph`)

Response shape:

{
  "assertionUri": "did:dkg:context-graph:research/assertion/0xAgentAddr/climate-report",
  "fileHash": "sha256:a1b2c3...",
  "detectedContentType": "text/markdown",
  "extraction": {
    "status": "completed",
    "tripleCount": 14,
    "pipelineUsed": "text/markdown",
    "mdIntermediateHash": "sha256:a1b2c3..."
  }
}

extraction.status — "completed" | "skipped" | "failed"
extraction.pipelineUsed — "text/markdown" for MD uploads, the content type of the registered converter otherwise, or null for the skipped case
extraction.mdIntermediateHash — present only when a converter ran Phase 1 (omitted for text/markdown which doesn't produce a separate intermediate)

Graceful degrade for unregistered content types: if the detected content type has no registered converter and isn't text/markdown, the route handler stores the file blob, returns extraction.status = "skipped" with tripleCount: 0 and pipelineUsed: null, and writes NO triples. The file remains retrievable by fileHash for manual extraction later. This is the spec-mandated behavior from 05_PROTOCOL_EXTENSIONS.md §6.5.

GET /api/assertion/:name/extraction-status?contextGraphId=...&subGraphName=...

Returns the current extraction record from the in-memory status tracker. Synchronous extractions (the V10.0 default) populate this on the same import-file response; this endpoint lets agents re-query later without holding the original response and provides the hook for async extraction in V10.x. Returns 404 if no import-file has been run for the assertion.

New infrastructure

FileStore (packages/cli/src/file-store.ts, 103 lines). Content-addressed disk store under {dataDir}/files/, sha256-keyed with a two-level sharded directory layout (ab/cdef...). Idempotent put() — same bytes always yield the same hash. get() and has() accept both sha256:-prefixed and bare hex forms.

Multipart parser (packages/cli/src/http/multipart.ts, 150 lines). Minimal RFC-7578 multipart/form-data parser. Zero new dependencies. Handles the subset needed: one file part with filename + content-type, plus any number of text parts. parseBoundary() extracts the boundary token from Content-Type: multipart/form-data; boundary=.... Throws MultipartParseError on malformed input so the route can return a clean 400.

readBodyBuffer() helper. Buffer variant of the existing readBody() helper for binary payloads where .toString() would corrupt content. Used by the import-file route for multipart bodies.

SKILL.md updates

§5 Memory Model — Working Memory (WM): removed the "🚧 Planned" marker on the assertion API. The 5 assertion routes that shipped in PR fix: consolidated V10 API hardening + finality principle (supersedes #105, #106, #107) #108 (create/write/query/promote/discard) plus the 2 new routes (import-file/extraction-status) are now documented with full body shapes. Added a note about the sub-graph registration check error message.
§7 File Ingestion: replaced the "🚧 Planned" section with full documentation of the shipped import-file endpoint — two-phase pipeline overview, request field table, end-to-end curl example, response shape, extraction status semantics, and extraction-status polling usage.

Test plan

Total: 99 new + updated tests in Phase 3b, all passing on Windows.

Reviewer guidance — Linux CI is the gating signal

Same caveat as PR #112: the full @origintrail-official/dkg-agent suite has 9 pre-existing failures on Windows due to spawn npx ENOENT (the hardhat bootstrap can't find npx in the subprocess PATH) and libp2p timing issues. None of these failures are caused by this PR's changes — this PR modifies packages/cli/ (the daemon, extraction pipeline, and tests) and packages/core/src/extraction-pipeline.ts + index (the interface split from the prep commit). The agent suite's failing tests are all in packages/agent/test/* and do not touch anything I modified. Please rely on the GitHub Actions Linux runner for the merge gate, not local Windows runs.

Commit structure (for review)

4 commits on this branch, each independently buildable:

ff8afe3 — chore: prep for import-file wiring — interface split + markdown extractor. Name-agnostic refactor: splits ExtractionPipeline return into ConverterOutput { mdIntermediate } while keeping ExtractionOutput as the composite type, adds the 331-line Phase 2 markdown structural extractor (27 unit tests), updates MarkItDownConverter.extract() return type. No new HTTP routes, no behavior changes.
d5b3755 — feat(cli): file store + multipart parser for import-file wiring. Infrastructure for the import-file route: content-addressed file store and minimal multipart parser. 31 unit tests. Zero new dependencies.
add808b — feat(cli): wire POST /api/assertion/:name/import-file + extraction-status. The actual route handlers, wired into daemon.ts. MAX_UPLOAD_BYTES = 50 MB. readBodyBuffer() helper. ExtractionStatusRecord type. Graceful degrade for unregistered content types. Pre-existing skill-endpoint.test.ts YAML frontmatter regex gets \r?\n tolerance (was Windows-hostile due to Git core.autocrlf).
d9f3221 — docs(cli): SKILL.md import-file workflow + integration tests. SKILL.md §5 removes "Planned" markers on shipped assertion routes. §7 rewritten with full import-file documentation. 12 new integration tests. Updated skill-endpoint tests to verify the shipped API surface.

Squash-merge is fine if preferred — the commits are logical groupings for review, not required history.

What this PR does NOT change

Sub-graph polish (PR fix: sub-graph polish (issue #81 findings 1, 3, 4, 5, 7) #112) is a separate PR targeting the same v10-rc base. There is no file overlap — PR fix: sub-graph polish (issue #81 findings 1, 3, 4, 5, 7) #112 touches packages/publisher/src/dkg-publisher.ts, packages/agent/src/dkg-agent.ts, packages/storage/src/graph-manager.ts, and packages/publisher/test/draft-lifecycle.test.ts, none of which are in this PR. Both PRs can merge in any order with no conflicts.
MarkItDown binary distribution (#76) is explicitly out of scope. The route handler gracefully degrades to status: "skipped" when the binary is not available, and this PR ships with isMarkItDownAvailable() detection in daemon.ts that logs a clarification message at startup. Binary distribution is a separate toolchain workstream.
skillUrl in register response (#79 gap 2) is explicitly deferred. The /api/agent/register endpoint itself is still marked Planned in SKILL.md and does not exist yet. When the register endpoint lands, it should include skillUrl per the original issue Bug: lookupByUAL returns OK on internal errors, masking failures #79 design.

Closes OriginTrail/dkgv10-spec#77, #79 gap 3, and #80.

🤖 Generated with Claude Code

…ctor Phase 3b prep commit. Adds the name-agnostic extraction pipeline restructuring that the import-file route handler will orchestrate in the next commit(s): - packages/core/src/extraction-pipeline.ts: split interface. Converter returns { mdIntermediate: string } only via ConverterOutput. ExtractionOutput { mdIntermediate, triples, provenance } remains as the composite type assembled by the orchestrator (route handler). - packages/core/src/index.ts: export ConverterOutput. - packages/cli/src/extraction/markitdown-converter.ts: return type updated to ConverterOutput (no behavior change, same binary invocation). - packages/cli/src/extraction/markdown-extractor.ts: NEW Phase 2 structural extractor (~331 lines) implementing deterministic node-side extraction from Markdown. Handles YAML frontmatter, wikilinks, tags, Dataview inline fields, heading structure. No LLM, no external deps. - packages/cli/src/extraction/index.ts: exports the new extractor. - packages/cli/test/extraction-markdown.test.ts: NEW 27 unit tests covering structural extraction cases. All pass. - packages/core/test/extraction-pipeline.test.ts: updated for split interface. 7/7 pass. - packages/cli/test/document-processor-e2e.test.ts: updated for split interface. - packages/cli/test/extraction-markitdown.test.ts: updated for split interface. Next commit wires POST /api/assertion/:name/import-file to orchestrate Phase 1 (converter) + Phase 2 (markdown extractor) and write triples to the target assertion. Prep commit ships no new HTTP routes — the existing import-file endpoint in daemon.ts is unchanged until Phase 3b completes wiring. Part of OriginTrail/dkgv10-spec#77, #79 gap 3, and #80.

Infrastructure commit for Phase 3b document ingestion. Adds two building blocks the import-file route handler will consume in the next commit: - packages/cli/src/file-store.ts: content-addressed disk store for uploaded files and markdown intermediates. sha256-keyed with a two-level sharded directory layout (ab/cdef...). put/get/has APIs return `sha256:<hex>` prefixed hashes which the route handler surfaces as fileHash and mdIntermediateHash in ImportFileResponse. Idempotent: re-putting the same bytes yields the same hash and overwrites with identical content. - packages/cli/src/http/multipart.ts: minimal RFC-7578 multipart/ form-data parser. Handles the exact subset the import-file endpoint needs: one file part with filename + content-type headers, plus any number of text parts. No nested multipart, no base64 transfer-encoding, no streaming (parses a buffered Buffer). Zero new npm dependencies. Throws MultipartParseError on malformed input so the caller can return a clean 400. Tests: - packages/cli/test/file-store.test.ts: 12 unit tests covering put/ get/has/hashToPath, idempotency, binary content, empty input, malformed-hash handling, bare-hex vs sha256:-prefixed forms. - packages/cli/test/multipart.test.ts: 19 unit tests covering parseBoundary (standard, quoted, case-insensitive, missing), and parseMultipart (text fields, file fields, mixed bodies, binary content with 0x00/0xff bytes, malformed input error paths). All 31/31 tests pass. CLI build clean. No route handler changes yet — the next commit wires POST /api/assertion/:name/import-file to use these primitives. Part of OriginTrail/dkgv10-spec#77 and #80.

…atus Implements the import-file document ingestion endpoint and its companion extraction-status polling endpoint on the daemon. Wires Phase 1 (converter) → Phase 2 (markdown structural extractor) → write triples to the assertion graph, matching the orchestration described in 05_PROTOCOL_EXTENSIONS.md §6.5. New endpoints: - POST /api/assertion/:name/import-file (multipart/form-data) Fields: file (required) — the uploaded document bytes contextGraphId (required) — target context graph contentType (optional) — override the file part's Content-Type ontologyRef (optional) — CG _ontology URI for Phase 2 guided extraction subGraphName (optional) — target sub-graph inside the CG Orchestration: 1. Parse multipart body, store original file in FileStore → fileHash 2. Resolve detectedContentType (explicit field > multipart Content-Type) 3. Phase 1: - text/markdown → skip converter, use raw bytes as mdIntermediate - registered converter → run converter.extract(), store MD result in FileStore → mdIntermediateHash - no registered converter → graceful degrade: return status="skipped", no triples written, file blob retained for later manual extraction 4. Phase 2 → extractFromMarkdown({ markdown, agentDid, ontologyRef, documentIri: assertionUri }) → triples + provenance 5. Ensure assertion graph exists (idempotent), write triples + provenance via agent.assertion.write 6. Record in in-memory ExtractionStatusRecord map, return ImportFileResponse Error paths return typed extraction.status = "failed" with the error message. Sub-graph registration errors propagate from assertionCreate/Write (finding 4 of issue #81). - GET /api/assertion/:name/extraction-status?contextGraphId=...&subGraphName=... Returns the current extraction job state for an assertion by looking up the in-memory record. Synchronous extractions populate this on the import-file response; this endpoint lets agents re-query without holding the original response and provides the hook for async extraction workflows in V10.x. Supporting changes: - packages/cli/src/daemon.ts: - Import contextGraphAssertionUri, extractFromMarkdown, FileStore, parseBoundary, parseMultipart, MultipartParseError - New constant MAX_UPLOAD_BYTES = 50 MB for document uploads - New interface ExtractionStatusRecord - New readBodyBuffer() helper — Buffer variant of readBody for binary multipart payloads - Instantiate FileStore at {dataDir}/files and extraction-status Map at daemon start; thread both into handleRequest via two new parameters - Log message for missing MarkItDown updated to clarify markdown uploads still work - packages/cli/test/skill-endpoint.test.ts: - Regex tolerance for CRLF line endings in the YAML frontmatter check (/^---\r?\n/ instead of /^---\n/). Pre-existing test was Windows-hostile because Git's core.autocrlf normalizes LF → CRLF on checkout. Linux CI was fine; Windows was failing. Tolerant regex fixes both. Tests: - All existing cli tests pass unchanged: multipart 19/19, file-store 12/12, extraction-markdown 27/27, extraction-markitdown 8/8, document-processor-e2e 13/13 (4 expected skips), skill-endpoint 11/11, extraction-pipeline 7/7. - Integration tests for the new route handlers land in the next commit. CLI build clean (TypeScript). Part of OriginTrail/dkgv10-spec#77, #79 gap 3, and #80.

Completes Phase 3b by documenting the shipped assertion API surface in SKILL.md and adding integration tests for the import-file orchestration. SKILL.md updates: - §5 Memory Model "Working Memory (WM)" section: removed the "🚧 Planned" marker on the assertion API (create/write/query/promote/ discard ship as of PR #108; import-file and extraction-status ship in this PR). Listed the full shipped API surface with body shapes, added the import-file and extraction-status endpoints, and noted the sub-graph registration check from issue #81 finding 4 so agents know to createSubGraph() before targeting one. - §7 File Ingestion: replaced the "🚧 Planned" section with complete documentation of the shipped POST /api/assertion/{name}/import-file endpoint: - Two-phase pipeline overview (Phase 1 converter, Phase 2 structural extractor) with explicit text/markdown skip-Phase-1 note - Request table listing all form fields (file, contextGraphId, contentType, ontologyRef, subGraphName) - End-to-end curl example - Response shape with all fields populated - Extraction status semantics (completed / skipped / failed) - GET /api/assertion/{name}/extraction-status usage for polling Integration tests (packages/cli/test/import-file-integration.test.ts): NEW 12-test suite that exercises the full Phase 1 → Phase 2 → assertion.write orchestration without requiring a full DKGAgent (which needs libp2p + chain). Uses real FileStore (temp dir), real ExtractionPipelineRegistry, real extractFromMarkdown, real parseMultipart, and a mock agent that captures assertion.create/write calls for verification. This drives the exact call sequence the daemon route handler does, so it covers the orchestration end-to-end. Happy paths (5 tests): - text/markdown upload skips Phase 1, runs Phase 2, writes triples covering every extractor feature (rdf:type, schema:name from frontmatter title, schema:mentions from wikilink, schema:keywords from hashtag, Dataview status field, dkg:hasSection headings) - text/markdown detection from filePart Content-Type header when no explicit contentType field is provided - contentType text field overrides the file part Content-Type - Registered PDF converter runs Phase 1, stores MD intermediate via FileStore with a separate mdIntermediateHash distinct from fileHash, runs Phase 2 on the converter's output - ontologyRef threaded through to the converter - subGraphName threaded through to assertion.create and assertion.write Graceful degrade (2 tests): - Unregistered content type (image/png): file stored with correct magic bytes preserved, status="skipped", pipelineUsed=null, no triples written, no assertion.create/write called - File part with no Content-Type header defaults to application/octet- stream and also degrades gracefully Extraction-status semantics (2 tests): - startedAt and completedAt timestamps populated on success - Multiple imports to different assertions get separate status records keyed by assertionUri Boundary parsing (2 tests, via parseBoundary wrapper): - Extracts boundary from daemon-style header - Rejects non-multipart requests skill-endpoint.test.ts updates: - Replaced the stale "marks planned endpoints clearly" test (which asserted /api/assertion/create was planned — no longer true) with two tests: one that confirms the *(planned)* marker still exists (for context graph sub-resources and agent profile), and a new test "documents the now-shipped assertion API surface" that verifies all 7 shipped assertion routes (create/write/query/promote/discard/ import-file/extraction-status) appear in SKILL.md. Test results: - multipart: 19/19 pass - file-store: 12/12 pass - extraction-markdown: 27/27 pass - extraction-markitdown: 8/8 pass - skill-endpoint: 12/12 pass (was 11; +1 new assertion-API-surface test) - import-file-integration: 12/12 pass (NEW) - document-processor-e2e: 13/13 pass (4 expected skips, markitdown-unavailable) - Total: 99/99 pass + 4 expected skips - Full cli build clean. Closes OriginTrail/dkgv10-spec#77 (import-file wiring), OriginTrail/dkgv10-spec#79 gap 3 (extraction-status endpoint), OriginTrail/dkgv10-spec#80 (ExtractionPipeline interface split — via the ff8afe3 prep commit).

# Conflicts: # packages/publisher/src/dkg-publisher.ts

# Conflicts: # packages/publisher/src/dkg-publisher.ts # packages/publisher/test/draft-lifecycle.test.ts

…sing Two PR #113 review findings: 1. parseBoundary() crashed on duplicated Content-Type headers because the parameter type didn't admit string[] and .toLowerCase() blew up at runtime. Widen the signature to string | string[] | undefined and reject array values as ambiguous so the route handler returns a clean 400 instead of 500-ing inside the parser. 2. The outer write-stage catch in the import-file handler only matched has-not-been-registered / Invalid / Unsafe errors and rethrew everything else without updating the extraction status record. That left /extraction-status stuck reporting in_progress on unexpected agent.write() failures even after the import had failed. Record the failure via recordFailedExtraction(...) before rethrowing so the status reflects reality. Mirror the same fix in the import-file orchestration test helper, which had the same shape. Adds two tests: - parseBoundary returns null for array values - import-file orchestration records failed status on unexpected write-stage errors (e.g. "Connection refused") Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…very Three PR #113 round 2 review findings: 1. multipart.ts Content-Disposition parser: the `name=` parameter regex could match the `name=` substring inside `filename=`, so a part with only `Content-Disposition: form-data; filename="x"` would be silently accepted as a field named `"x"` instead of being rejected as malformed. Anchor both `name=` and `filename=` matches to a real `;` parameter boundary (or start of string). 2. import-file route: an empty `contentType=` form field was treated as a real override because `??` only catches null/undefined, not empty string. A client sending `contentType=` would downgrade a valid text/markdown / application/pdf upload to application/octet-stream and trigger graceful-degrade. Treat blank/whitespace overrides as absent in both the daemon route handler and the test orchestration helper. 3. /.well-known/skill.md discovery: text/markdown is hard-coded as a supported native ingestion type by the import-file route (skip Phase 1, run Phase 2 markdown extractor directly), but extractionRegistry.availableContentTypes() only listed registered Phase 1 converters. Skill clients reading the discovery surface would think Markdown ingestion was unavailable when it was actually always supported. Surface text/markdown alongside the registered converters in both the skill.md endpoint and the startup log. Adds 5 tests: - parseMultipart rejects parts with only filename= and no name= - parseMultipart parses filename-first ordering correctly - parseMultipart parses name= and filename= independently - import-file orchestration treats blank contentType= as absent - import-file orchestration treats whitespace-only contentType= as absent Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

# Conflicts: # packages/cli/src/daemon.ts

Resolve daemon.ts conflicts: accept fileStore and extractionStatus parameters from PR #113, drop publisherInspector (replaced by publisherControl in this branch). Made-with: Cursor

claude added 4 commits April 10, 2026 17:46