feat: wire import-file endpoint and Phase 2 markdown extraction (#77, #79, #80)#113
Merged
feat: wire import-file endpoint and Phase 2 markdown extraction (#77, #79, #80)#113
Conversation
…ctor
Phase 3b prep commit. Adds the name-agnostic extraction pipeline
restructuring that the import-file route handler will orchestrate in
the next commit(s):
- packages/core/src/extraction-pipeline.ts: split interface. Converter
returns { mdIntermediate: string } only via ConverterOutput.
ExtractionOutput { mdIntermediate, triples, provenance } remains as
the composite type assembled by the orchestrator (route handler).
- packages/core/src/index.ts: export ConverterOutput.
- packages/cli/src/extraction/markitdown-converter.ts: return type
updated to ConverterOutput (no behavior change, same binary invocation).
- packages/cli/src/extraction/markdown-extractor.ts: NEW Phase 2
structural extractor (~331 lines) implementing deterministic node-side
extraction from Markdown. Handles YAML frontmatter, wikilinks, tags,
Dataview inline fields, heading structure. No LLM, no external deps.
- packages/cli/src/extraction/index.ts: exports the new extractor.
- packages/cli/test/extraction-markdown.test.ts: NEW 27 unit tests
covering structural extraction cases. All pass.
- packages/core/test/extraction-pipeline.test.ts: updated for split
interface. 7/7 pass.
- packages/cli/test/document-processor-e2e.test.ts: updated for split
interface.
- packages/cli/test/extraction-markitdown.test.ts: updated for split
interface.
Next commit wires POST /api/assertion/:name/import-file to orchestrate
Phase 1 (converter) + Phase 2 (markdown extractor) and write triples
to the target assertion. Prep commit ships no new HTTP routes — the
existing import-file endpoint in daemon.ts is unchanged until Phase 3b
completes wiring.
Part of OriginTrail/dkgv10-spec#77, #79 gap 3, and #80.
Infrastructure commit for Phase 3b document ingestion. Adds two building blocks the import-file route handler will consume in the next commit: - packages/cli/src/file-store.ts: content-addressed disk store for uploaded files and markdown intermediates. sha256-keyed with a two-level sharded directory layout (ab/cdef...). put/get/has APIs return `sha256:<hex>` prefixed hashes which the route handler surfaces as fileHash and mdIntermediateHash in ImportFileResponse. Idempotent: re-putting the same bytes yields the same hash and overwrites with identical content. - packages/cli/src/http/multipart.ts: minimal RFC-7578 multipart/ form-data parser. Handles the exact subset the import-file endpoint needs: one file part with filename + content-type headers, plus any number of text parts. No nested multipart, no base64 transfer-encoding, no streaming (parses a buffered Buffer). Zero new npm dependencies. Throws MultipartParseError on malformed input so the caller can return a clean 400. Tests: - packages/cli/test/file-store.test.ts: 12 unit tests covering put/ get/has/hashToPath, idempotency, binary content, empty input, malformed-hash handling, bare-hex vs sha256:-prefixed forms. - packages/cli/test/multipart.test.ts: 19 unit tests covering parseBoundary (standard, quoted, case-insensitive, missing), and parseMultipart (text fields, file fields, mixed bodies, binary content with 0x00/0xff bytes, malformed input error paths). All 31/31 tests pass. CLI build clean. No route handler changes yet — the next commit wires POST /api/assertion/:name/import-file to use these primitives. Part of OriginTrail/dkgv10-spec#77 and #80.
…atus
Implements the import-file document ingestion endpoint and its
companion extraction-status polling endpoint on the daemon. Wires
Phase 1 (converter) → Phase 2 (markdown structural extractor) → write
triples to the assertion graph, matching the orchestration described
in 05_PROTOCOL_EXTENSIONS.md §6.5.
New endpoints:
- POST /api/assertion/:name/import-file (multipart/form-data)
Fields:
file (required) — the uploaded document bytes
contextGraphId (required) — target context graph
contentType (optional) — override the file part's Content-Type
ontologyRef (optional) — CG _ontology URI for Phase 2 guided extraction
subGraphName (optional) — target sub-graph inside the CG
Orchestration:
1. Parse multipart body, store original file in FileStore → fileHash
2. Resolve detectedContentType (explicit field > multipart Content-Type)
3. Phase 1:
- text/markdown → skip converter, use raw bytes as mdIntermediate
- registered converter → run converter.extract(), store MD result
in FileStore → mdIntermediateHash
- no registered converter → graceful degrade: return status="skipped",
no triples written, file blob retained for later manual extraction
4. Phase 2 → extractFromMarkdown({ markdown, agentDid, ontologyRef,
documentIri: assertionUri }) → triples + provenance
5. Ensure assertion graph exists (idempotent), write triples + provenance
via agent.assertion.write
6. Record in in-memory ExtractionStatusRecord map, return ImportFileResponse
Error paths return typed extraction.status = "failed" with the error message.
Sub-graph registration errors propagate from assertionCreate/Write (finding
4 of issue #81).
- GET /api/assertion/:name/extraction-status?contextGraphId=...&subGraphName=...
Returns the current extraction job state for an assertion by looking up the
in-memory record. Synchronous extractions populate this on the import-file
response; this endpoint lets agents re-query without holding the original
response and provides the hook for async extraction workflows in V10.x.
Supporting changes:
- packages/cli/src/daemon.ts:
- Import contextGraphAssertionUri, extractFromMarkdown, FileStore,
parseBoundary, parseMultipart, MultipartParseError
- New constant MAX_UPLOAD_BYTES = 50 MB for document uploads
- New interface ExtractionStatusRecord
- New readBodyBuffer() helper — Buffer variant of readBody for binary
multipart payloads
- Instantiate FileStore at {dataDir}/files and extraction-status Map at
daemon start; thread both into handleRequest via two new parameters
- Log message for missing MarkItDown updated to clarify markdown uploads
still work
- packages/cli/test/skill-endpoint.test.ts:
- Regex tolerance for CRLF line endings in the YAML frontmatter check
(/^---\r?\n/ instead of /^---\n/). Pre-existing test was Windows-hostile
because Git's core.autocrlf normalizes LF → CRLF on checkout. Linux CI
was fine; Windows was failing. Tolerant regex fixes both.
Tests:
- All existing cli tests pass unchanged: multipart 19/19, file-store 12/12,
extraction-markdown 27/27, extraction-markitdown 8/8, document-processor-e2e
13/13 (4 expected skips), skill-endpoint 11/11, extraction-pipeline 7/7.
- Integration tests for the new route handlers land in the next commit.
CLI build clean (TypeScript).
Part of OriginTrail/dkgv10-spec#77, #79 gap 3, and #80.
Completes Phase 3b by documenting the shipped assertion API surface in SKILL.md and adding integration tests for the import-file orchestration. SKILL.md updates: - §5 Memory Model "Working Memory (WM)" section: removed the "🚧 Planned" marker on the assertion API (create/write/query/promote/ discard ship as of PR #108; import-file and extraction-status ship in this PR). Listed the full shipped API surface with body shapes, added the import-file and extraction-status endpoints, and noted the sub-graph registration check from issue #81 finding 4 so agents know to createSubGraph() before targeting one. - §7 File Ingestion: replaced the "🚧 Planned" section with complete documentation of the shipped POST /api/assertion/{name}/import-file endpoint: - Two-phase pipeline overview (Phase 1 converter, Phase 2 structural extractor) with explicit text/markdown skip-Phase-1 note - Request table listing all form fields (file, contextGraphId, contentType, ontologyRef, subGraphName) - End-to-end curl example - Response shape with all fields populated - Extraction status semantics (completed / skipped / failed) - GET /api/assertion/{name}/extraction-status usage for polling Integration tests (packages/cli/test/import-file-integration.test.ts): NEW 12-test suite that exercises the full Phase 1 → Phase 2 → assertion.write orchestration without requiring a full DKGAgent (which needs libp2p + chain). Uses real FileStore (temp dir), real ExtractionPipelineRegistry, real extractFromMarkdown, real parseMultipart, and a mock agent that captures assertion.create/write calls for verification. This drives the exact call sequence the daemon route handler does, so it covers the orchestration end-to-end. Happy paths (5 tests): - text/markdown upload skips Phase 1, runs Phase 2, writes triples covering every extractor feature (rdf:type, schema:name from frontmatter title, schema:mentions from wikilink, schema:keywords from hashtag, Dataview status field, dkg:hasSection headings) - text/markdown detection from filePart Content-Type header when no explicit contentType field is provided - contentType text field overrides the file part Content-Type - Registered PDF converter runs Phase 1, stores MD intermediate via FileStore with a separate mdIntermediateHash distinct from fileHash, runs Phase 2 on the converter's output - ontologyRef threaded through to the converter - subGraphName threaded through to assertion.create and assertion.write Graceful degrade (2 tests): - Unregistered content type (image/png): file stored with correct magic bytes preserved, status="skipped", pipelineUsed=null, no triples written, no assertion.create/write called - File part with no Content-Type header defaults to application/octet- stream and also degrades gracefully Extraction-status semantics (2 tests): - startedAt and completedAt timestamps populated on success - Multiple imports to different assertions get separate status records keyed by assertionUri Boundary parsing (2 tests, via parseBoundary wrapper): - Extracts boundary from daemon-style header - Rejects non-multipart requests skill-endpoint.test.ts updates: - Replaced the stale "marks planned endpoints clearly" test (which asserted /api/assertion/create was planned — no longer true) with two tests: one that confirms the *(planned)* marker still exists (for context graph sub-resources and agent profile), and a new test "documents the now-shipped assertion API surface" that verifies all 7 shipped assertion routes (create/write/query/promote/discard/ import-file/extraction-status) appear in SKILL.md. Test results: - multipart: 19/19 pass - file-store: 12/12 pass - extraction-markdown: 27/27 pass - extraction-markitdown: 8/8 pass - skill-endpoint: 12/12 pass (was 11; +1 new assertion-API-surface test) - import-file-integration: 12/12 pass (NEW) - document-processor-e2e: 13/13 pass (4 expected skips, markitdown-unavailable) - Total: 99/99 pass + 4 expected skips - Full cli build clean. Closes OriginTrail/dkgv10-spec#77 (import-file wiring), OriginTrail/dkgv10-spec#79 gap 3 (extraction-status endpoint), OriginTrail/dkgv10-spec#80 (ExtractionPipeline interface split — via the ff8afe3 prep commit).
added 2 commits
April 10, 2026 19:58
# Conflicts: # packages/publisher/src/dkg-publisher.ts
# Conflicts: # packages/publisher/src/dkg-publisher.ts # packages/publisher/test/draft-lifecycle.test.ts
…sing Two PR #113 review findings: 1. parseBoundary() crashed on duplicated Content-Type headers because the parameter type didn't admit string[] and .toLowerCase() blew up at runtime. Widen the signature to string | string[] | undefined and reject array values as ambiguous so the route handler returns a clean 400 instead of 500-ing inside the parser. 2. The outer write-stage catch in the import-file handler only matched has-not-been-registered / Invalid / Unsafe errors and rethrew everything else without updating the extraction status record. That left /extraction-status stuck reporting in_progress on unexpected agent.write() failures even after the import had failed. Record the failure via recordFailedExtraction(...) before rethrowing so the status reflects reality. Mirror the same fix in the import-file orchestration test helper, which had the same shape. Adds two tests: - parseBoundary returns null for array values - import-file orchestration records failed status on unexpected write-stage errors (e.g. "Connection refused") Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…very Three PR #113 round 2 review findings: 1. multipart.ts Content-Disposition parser: the `name=` parameter regex could match the `name=` substring inside `filename=`, so a part with only `Content-Disposition: form-data; filename="x"` would be silently accepted as a field named `"x"` instead of being rejected as malformed. Anchor both `name=` and `filename=` matches to a real `;` parameter boundary (or start of string). 2. import-file route: an empty `contentType=` form field was treated as a real override because `??` only catches null/undefined, not empty string. A client sending `contentType=` would downgrade a valid text/markdown / application/pdf upload to application/octet-stream and trigger graceful-degrade. Treat blank/whitespace overrides as absent in both the daemon route handler and the test orchestration helper. 3. /.well-known/skill.md discovery: text/markdown is hard-coded as a supported native ingestion type by the import-file route (skip Phase 1, run Phase 2 markdown extractor directly), but extractionRegistry.availableContentTypes() only listed registered Phase 1 converters. Skill clients reading the discovery surface would think Markdown ingestion was unavailable when it was actually always supported. Surface text/markdown alongside the registered converters in both the skill.md endpoint and the startup log. Adds 5 tests: - parseMultipart rejects parts with only filename= and no name= - parseMultipart parses filename-first ordering correctly - parseMultipart parses name= and filename= independently - import-file orchestration treats blank contentType= as absent - import-file orchestration treats whitespace-only contentType= as absent Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Conflicts: # packages/cli/src/daemon.ts
branarakic
pushed a commit
that referenced
this pull request
Apr 10, 2026
Resolve daemon.ts conflicts: accept fileStore and extractionStatus parameters from PR #113, drop publisherInspector (replaced by publisherControl in this branch). Made-with: Cursor
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Completes the document ingestion pipeline for V10: agents can now upload a document (PDF, DOCX, Markdown, HTML, CSV, etc.) to a Working Memory assertion, and the node runs a deterministic two-phase extraction pipeline that writes RDF triples into the assertion graph. Closes three open spec issues:
POST /api/assertion/:name/import-filehandler wiringGET /api/assertion/:name/extraction-statusendpointExtractionPipelineinterface split (split-interface pattern: converters returnConverterOutput { mdIntermediate }, the route handler is the orchestrator that assembles the compositeExtractionOutput)Companion spec PR: OriginTrail/dkgv10-spec#83 — documents the same contract in
05_PROTOCOL_EXTENSIONS.md§6.5: the splitConverterOutput/ExtractionOutputinterface, thetext/markdownskip-Phase-1 rule, and the graceful-degrade paragraph for unregistered content types. Reviewers should cross-check this code PR against the spec PR for consistency (the reframed Phase 5c cross-PR comment review).What ships
Extraction pipeline architecture (Phase 1 + Phase 2)
Phase 1 — Converters (
ExtractionPipelineinterface). Non-Markdown source formats go through a registered converter that produces a Markdown intermediate.MarkItDownConverteris the built-in converter for PDF/DOCX/PPTX/XLSX/CSV/HTML/EPUB/XML when the MarkItDown binary is available. Converters returnConverterOutput { mdIntermediate: string }— they do NOT produce triples or provenance.text/markdown skip-Phase-1. Uploads with
Content-Type: text/markdownbypass Phase 1 entirely — the raw file bytes ARE the Markdown intermediate.text/markdownis deliberately NOT a registered converter content type (PR #108 already removed it fromMARKITDOWN_CONTENT_TYPES). The route handler detects this and feeds the bytes straight into Phase 2.Phase 2 — Structural extractor (
markdown-extractor.ts). Deterministic node-side RDF extraction from Markdown per19_MARKDOWN_CONTENT_TYPE.md. No LLM, no external calls. Handles:id,type,title/name,description/summary,keywords/tags; arbitrary keys fall intohttp://schema.org/{key})typefrontmatter key →rdf:type(bare identifiers namespaced tohttp://schema.org/)[[Target]]→schema:mentions(slugified tourn:dkg:md:{slug})#keyword→schema:keywords(excludes headings and code fences)key:: value→ propertiesdkg:hasSectionwith per-sectionschema:name(H1 skipped as document title, H2+ become sections)dkg:ExtractionProvenanceblock withdkg:extractedBy,dkg:extractionRule,dkg:extractedAt,dkg:derivedFrom, andprov:wasGeneratedByback-linkRoute handler orchestration.
POST /api/assertion/:name/import-filewires Phase 1 and Phase 2 together, stores the original file and MD intermediate in a content-addressed file store, writes the resulting triples + provenance to the target assertion graph viaagent.assertion.write, and tracks the extraction job state in an in-memory map for status polling.New endpoints
POST /api/assertion/:name/import-file—multipart/form-datafilecontextGraphIdcontentTypeContent-TypeheaderontologyRef_ontologyURI for guided Phase 2 extraction (threaded to both phases)subGraphNamecreateSubGraph)Response shape:
{ "assertionUri": "did:dkg:context-graph:research/assertion/0xAgentAddr/climate-report", "fileHash": "sha256:a1b2c3...", "detectedContentType": "text/markdown", "extraction": { "status": "completed", "tripleCount": 14, "pipelineUsed": "text/markdown", "mdIntermediateHash": "sha256:a1b2c3..." } }extraction.status—"completed"|"skipped"|"failed"extraction.pipelineUsed—"text/markdown"for MD uploads, the content type of the registered converter otherwise, ornullfor the skipped caseextraction.mdIntermediateHash— present only when a converter ran Phase 1 (omitted for text/markdown which doesn't produce a separate intermediate)Graceful degrade for unregistered content types: if the detected content type has no registered converter and isn't
text/markdown, the route handler stores the file blob, returnsextraction.status = "skipped"withtripleCount: 0andpipelineUsed: null, and writes NO triples. The file remains retrievable byfileHashfor manual extraction later. This is the spec-mandated behavior from05_PROTOCOL_EXTENSIONS.md§6.5.GET /api/assertion/:name/extraction-status?contextGraphId=...&subGraphName=...Returns the current extraction record from the in-memory status tracker. Synchronous extractions (the V10.0 default) populate this on the same
import-fileresponse; this endpoint lets agents re-query later without holding the original response and provides the hook for async extraction in V10.x. Returns 404 if noimport-filehas been run for the assertion.New infrastructure
FileStore(packages/cli/src/file-store.ts, 103 lines). Content-addressed disk store under{dataDir}/files/, sha256-keyed with a two-level sharded directory layout (ab/cdef...). Idempotentput()— same bytes always yield the same hash.get()andhas()accept bothsha256:-prefixed and bare hex forms.Multipart parser (
packages/cli/src/http/multipart.ts, 150 lines). Minimal RFC-7578multipart/form-dataparser. Zero new dependencies. Handles the subset needed: one file part with filename + content-type, plus any number of text parts.parseBoundary()extracts the boundary token fromContent-Type: multipart/form-data; boundary=.... ThrowsMultipartParseErroron malformed input so the route can return a clean 400.readBodyBuffer()helper.Buffervariant of the existingreadBody()helper for binary payloads where.toString()would corrupt content. Used by the import-file route for multipart bodies.SKILL.md updates
create/write/query/promote/discard) plus the 2 new routes (import-file/extraction-status) are now documented with full body shapes. Added a note about the sub-graph registration check error message.import-fileendpoint — two-phase pipeline overview, request field table, end-to-end curl example, response shape, extraction status semantics, andextraction-statuspolling usage.Test plan
image/png, no Content-Type header defaults toapplication/octet-stream), 2 extraction-status semantics (timestamps, separate records per assertion), 2 boundary parsing tests. Uses realFileStore(temp dir), realExtractionPipelineRegistry, realextractFromMarkdown, realparseMultipart, with a mock agent that capturesassertion.create/writecall arguments for verification. Covers the full route handler orchestration end-to-end without needing a fullDKGAgent.pnpm run build:runtimeacross all 12 runtime packages: clean (TypeScript)Total: 99 new + updated tests in Phase 3b, all passing on Windows.
Reviewer guidance — Linux CI is the gating signal
Same caveat as PR #112: the full
@origintrail-official/dkg-agentsuite has 9 pre-existing failures on Windows due tospawn npx ENOENT(the hardhat bootstrap can't findnpxin the subprocess PATH) and libp2p timing issues. None of these failures are caused by this PR's changes — this PR modifiespackages/cli/(the daemon, extraction pipeline, and tests) andpackages/core/src/extraction-pipeline.ts+ index (the interface split from the prep commit). The agent suite's failing tests are all inpackages/agent/test/*and do not touch anything I modified. Please rely on the GitHub Actions Linux runner for the merge gate, not local Windows runs.Commit structure (for review)
4 commits on this branch, each independently buildable:
ff8afe3—chore: prep for import-file wiring — interface split + markdown extractor. Name-agnostic refactor: splitsExtractionPipelinereturn intoConverterOutput { mdIntermediate }while keepingExtractionOutputas the composite type, adds the 331-line Phase 2 markdown structural extractor (27 unit tests), updatesMarkItDownConverter.extract()return type. No new HTTP routes, no behavior changes.d5b3755—feat(cli): file store + multipart parser for import-file wiring. Infrastructure for the import-file route: content-addressed file store and minimal multipart parser. 31 unit tests. Zero new dependencies.add808b—feat(cli): wire POST /api/assertion/:name/import-file + extraction-status. The actual route handlers, wired intodaemon.ts.MAX_UPLOAD_BYTES = 50 MB.readBodyBuffer()helper.ExtractionStatusRecordtype. Graceful degrade for unregistered content types. Pre-existingskill-endpoint.test.tsYAML frontmatter regex gets\r?\ntolerance (was Windows-hostile due to Gitcore.autocrlf).d9f3221—docs(cli): SKILL.md import-file workflow + integration tests. SKILL.md §5 removes "Planned" markers on shipped assertion routes. §7 rewritten with full import-file documentation. 12 new integration tests. Updated skill-endpoint tests to verify the shipped API surface.Squash-merge is fine if preferred — the commits are logical groupings for review, not required history.
What this PR does NOT change
v10-rcbase. There is no file overlap — PR fix: sub-graph polish (issue #81 findings 1, 3, 4, 5, 7) #112 touchespackages/publisher/src/dkg-publisher.ts,packages/agent/src/dkg-agent.ts,packages/storage/src/graph-manager.ts, andpackages/publisher/test/draft-lifecycle.test.ts, none of which are in this PR. Both PRs can merge in any order with no conflicts.status: "skipped"when the binary is not available, and this PR ships withisMarkItDownAvailable()detection indaemon.tsthat logs a clarification message at startup. Binary distribution is a separate toolchain workstream.skillUrlin register response (#79 gap 2) is explicitly deferred. The/api/agent/registerendpoint itself is still marked Planned in SKILL.md and does not exist yet. When the register endpoint lands, it should includeskillUrlper the original issue Bug: lookupByUAL returns OK on internal errors, masking failures #79 design.Closes OriginTrail/dkgv10-spec#77, #79 gap 3, and #80.
🤖 Generated with Claude Code