Skip to content

feat: wire import-file endpoint and Phase 2 markdown extraction (#77, #79, #80)#113

Merged
Jurij89 merged 15 commits intov10-rcfrom
feat/assertion-import-file-wiring
Apr 10, 2026
Merged

feat: wire import-file endpoint and Phase 2 markdown extraction (#77, #79, #80)#113
Jurij89 merged 15 commits intov10-rcfrom
feat/assertion-import-file-wiring

Conversation

@Jurij89
Copy link
Copy Markdown
Contributor

@Jurij89 Jurij89 commented Apr 10, 2026

Summary

Completes the document ingestion pipeline for V10: agents can now upload a document (PDF, DOCX, Markdown, HTML, CSV, etc.) to a Working Memory assertion, and the node runs a deterministic two-phase extraction pipeline that writes RDF triples into the assertion graph. Closes three open spec issues:

  • #77POST /api/assertion/:name/import-file handler wiring
  • #79 gap 3GET /api/assertion/:name/extraction-status endpoint
  • #80ExtractionPipeline interface split (split-interface pattern: converters return ConverterOutput { mdIntermediate }, the route handler is the orchestrator that assembles the composite ExtractionOutput)

Companion spec PR: OriginTrail/dkgv10-spec#83 — documents the same contract in 05_PROTOCOL_EXTENSIONS.md §6.5: the split ConverterOutput/ExtractionOutput interface, the text/markdown skip-Phase-1 rule, and the graceful-degrade paragraph for unregistered content types. Reviewers should cross-check this code PR against the spec PR for consistency (the reframed Phase 5c cross-PR comment review).

What ships

Extraction pipeline architecture (Phase 1 + Phase 2)

Phase 1 — Converters (ExtractionPipeline interface). Non-Markdown source formats go through a registered converter that produces a Markdown intermediate. MarkItDownConverter is the built-in converter for PDF/DOCX/PPTX/XLSX/CSV/HTML/EPUB/XML when the MarkItDown binary is available. Converters return ConverterOutput { mdIntermediate: string } — they do NOT produce triples or provenance.

text/markdown skip-Phase-1. Uploads with Content-Type: text/markdown bypass Phase 1 entirely — the raw file bytes ARE the Markdown intermediate. text/markdown is deliberately NOT a registered converter content type (PR #108 already removed it from MARKITDOWN_CONTENT_TYPES). The route handler detects this and feeds the bytes straight into Phase 2.

Phase 2 — Structural extractor (markdown-extractor.ts). Deterministic node-side RDF extraction from Markdown per 19_MARKDOWN_CONTENT_TYPE.md. No LLM, no external calls. Handles:

  • YAML frontmatter → subject properties (special keys: id, type, title/name, description/summary, keywords/tags; arbitrary keys fall into http://schema.org/{key})
  • type frontmatter key → rdf:type (bare identifiers namespaced to http://schema.org/)
  • Wikilinks [[Target]]schema:mentions (slugified to urn:dkg:md:{slug})
  • Hashtags #keywordschema:keywords (excludes headings and code fences)
  • Dataview inline fields key:: value → properties
  • Heading hierarchy → dkg:hasSection with per-section schema:name (H1 skipped as document title, H2+ become sections)
  • Every extraction run emits a dkg:ExtractionProvenance block with dkg:extractedBy, dkg:extractionRule, dkg:extractedAt, dkg:derivedFrom, and prov:wasGeneratedBy back-link

Route handler orchestration. POST /api/assertion/:name/import-file wires Phase 1 and Phase 2 together, stores the original file and MD intermediate in a content-addressed file store, writes the resulting triples + provenance to the target assertion graph via agent.assertion.write, and tracks the extraction job state in an in-memory map for status polling.

New endpoints

POST /api/assertion/:name/import-filemultipart/form-data

Field Required Description
file yes The uploaded document bytes
contextGraphId yes Target context graph
contentType no Override the file part's Content-Type header
ontologyRef no CG _ontology URI for guided Phase 2 extraction (threaded to both phases)
subGraphName no Target sub-graph inside the CG (must be registered via createSubGraph)

Response shape:

{
  "assertionUri": "did:dkg:context-graph:research/assertion/0xAgentAddr/climate-report",
  "fileHash": "sha256:a1b2c3...",
  "detectedContentType": "text/markdown",
  "extraction": {
    "status": "completed",
    "tripleCount": 14,
    "pipelineUsed": "text/markdown",
    "mdIntermediateHash": "sha256:a1b2c3..."
  }
}
  • extraction.status"completed" | "skipped" | "failed"
  • extraction.pipelineUsed"text/markdown" for MD uploads, the content type of the registered converter otherwise, or null for the skipped case
  • extraction.mdIntermediateHash — present only when a converter ran Phase 1 (omitted for text/markdown which doesn't produce a separate intermediate)

Graceful degrade for unregistered content types: if the detected content type has no registered converter and isn't text/markdown, the route handler stores the file blob, returns extraction.status = "skipped" with tripleCount: 0 and pipelineUsed: null, and writes NO triples. The file remains retrievable by fileHash for manual extraction later. This is the spec-mandated behavior from 05_PROTOCOL_EXTENSIONS.md §6.5.

GET /api/assertion/:name/extraction-status?contextGraphId=...&subGraphName=...

Returns the current extraction record from the in-memory status tracker. Synchronous extractions (the V10.0 default) populate this on the same import-file response; this endpoint lets agents re-query later without holding the original response and provides the hook for async extraction in V10.x. Returns 404 if no import-file has been run for the assertion.

New infrastructure

FileStore (packages/cli/src/file-store.ts, 103 lines). Content-addressed disk store under {dataDir}/files/, sha256-keyed with a two-level sharded directory layout (ab/cdef...). Idempotent put() — same bytes always yield the same hash. get() and has() accept both sha256:-prefixed and bare hex forms.

Multipart parser (packages/cli/src/http/multipart.ts, 150 lines). Minimal RFC-7578 multipart/form-data parser. Zero new dependencies. Handles the subset needed: one file part with filename + content-type, plus any number of text parts. parseBoundary() extracts the boundary token from Content-Type: multipart/form-data; boundary=.... Throws MultipartParseError on malformed input so the route can return a clean 400.

readBodyBuffer() helper. Buffer variant of the existing readBody() helper for binary payloads where .toString() would corrupt content. Used by the import-file route for multipart bodies.

SKILL.md updates

  • §5 Memory Model — Working Memory (WM): removed the "🚧 Planned" marker on the assertion API. The 5 assertion routes that shipped in PR fix: consolidated V10 API hardening + finality principle (supersedes #105, #106, #107) #108 (create/write/query/promote/discard) plus the 2 new routes (import-file/extraction-status) are now documented with full body shapes. Added a note about the sub-graph registration check error message.
  • §7 File Ingestion: replaced the "🚧 Planned" section with full documentation of the shipped import-file endpoint — two-phase pipeline overview, request field table, end-to-end curl example, response shape, extraction status semantics, and extraction-status polling usage.

Test plan

  • Multipart parser unit tests: 19/19 pass
  • File store unit tests: 12/12 pass
  • Markdown extractor unit tests: 27/27 pass (frontmatter special keys, array values, wikilinks with dedup, hashtags with heading and code-fence exclusion, Dataview with code-fence exclusion, heading hierarchy, 5 subject-IRI resolution modes, provenance, full end-to-end document)
  • Core extraction-pipeline interface tests: 7/7 pass
  • MarkItDown converter tests: 8/8 pass
  • Import-file integration tests (NEW): 12/12 pass — 5 happy paths (text/markdown with full feature coverage, Content-Type detection, contentType override, registered PDF converter path with MD intermediate storage, ontologyRef threading, subGraphName threading), 2 graceful-degrade paths (unregistered image/png, no Content-Type header defaults to application/octet-stream), 2 extraction-status semantics (timestamps, separate records per assertion), 2 boundary parsing tests. Uses real FileStore (temp dir), real ExtractionPipelineRegistry, real extractFromMarkdown, real parseMultipart, with a mock agent that captures assertion.create/write call arguments for verification. Covers the full route handler orchestration end-to-end without needing a full DKGAgent.
  • Document processor e2e tests: 13/13 pass (4 expected skips on Windows — MarkItDown binary unavailable)
  • SKILL.md endpoint tests: 12/12 pass (was 11; +1 new test verifying all 7 shipped assertion routes appear in the doc)
  • Publisher test suite sanity check: 608/608 pass (no regressions from the daemon.ts changes)
  • Full pnpm run build:runtime across all 12 runtime packages: clean (TypeScript)
  • Linux CI — the definitive green signal, see caveat below

Total: 99 new + updated tests in Phase 3b, all passing on Windows.

Reviewer guidance — Linux CI is the gating signal

Same caveat as PR #112: the full @origintrail-official/dkg-agent suite has 9 pre-existing failures on Windows due to spawn npx ENOENT (the hardhat bootstrap can't find npx in the subprocess PATH) and libp2p timing issues. None of these failures are caused by this PR's changes — this PR modifies packages/cli/ (the daemon, extraction pipeline, and tests) and packages/core/src/extraction-pipeline.ts + index (the interface split from the prep commit). The agent suite's failing tests are all in packages/agent/test/* and do not touch anything I modified. Please rely on the GitHub Actions Linux runner for the merge gate, not local Windows runs.

Commit structure (for review)

4 commits on this branch, each independently buildable:

  1. ff8afe3chore: prep for import-file wiring — interface split + markdown extractor. Name-agnostic refactor: splits ExtractionPipeline return into ConverterOutput { mdIntermediate } while keeping ExtractionOutput as the composite type, adds the 331-line Phase 2 markdown structural extractor (27 unit tests), updates MarkItDownConverter.extract() return type. No new HTTP routes, no behavior changes.

  2. d5b3755feat(cli): file store + multipart parser for import-file wiring. Infrastructure for the import-file route: content-addressed file store and minimal multipart parser. 31 unit tests. Zero new dependencies.

  3. add808bfeat(cli): wire POST /api/assertion/:name/import-file + extraction-status. The actual route handlers, wired into daemon.ts. MAX_UPLOAD_BYTES = 50 MB. readBodyBuffer() helper. ExtractionStatusRecord type. Graceful degrade for unregistered content types. Pre-existing skill-endpoint.test.ts YAML frontmatter regex gets \r?\n tolerance (was Windows-hostile due to Git core.autocrlf).

  4. d9f3221docs(cli): SKILL.md import-file workflow + integration tests. SKILL.md §5 removes "Planned" markers on shipped assertion routes. §7 rewritten with full import-file documentation. 12 new integration tests. Updated skill-endpoint tests to verify the shipped API surface.

Squash-merge is fine if preferred — the commits are logical groupings for review, not required history.

What this PR does NOT change

  • Sub-graph polish (PR fix: sub-graph polish (issue #81 findings 1, 3, 4, 5, 7) #112) is a separate PR targeting the same v10-rc base. There is no file overlap — PR fix: sub-graph polish (issue #81 findings 1, 3, 4, 5, 7) #112 touches packages/publisher/src/dkg-publisher.ts, packages/agent/src/dkg-agent.ts, packages/storage/src/graph-manager.ts, and packages/publisher/test/draft-lifecycle.test.ts, none of which are in this PR. Both PRs can merge in any order with no conflicts.
  • MarkItDown binary distribution (#76) is explicitly out of scope. The route handler gracefully degrades to status: "skipped" when the binary is not available, and this PR ships with isMarkItDownAvailable() detection in daemon.ts that logs a clarification message at startup. Binary distribution is a separate toolchain workstream.
  • skillUrl in register response (#79 gap 2) is explicitly deferred. The /api/agent/register endpoint itself is still marked Planned in SKILL.md and does not exist yet. When the register endpoint lands, it should include skillUrl per the original issue Bug: lookupByUAL returns OK on internal errors, masking failures #79 design.

Closes OriginTrail/dkgv10-spec#77, #79 gap 3, and #80.

🤖 Generated with Claude Code

claude added 4 commits April 10, 2026 17:46
…ctor

Phase 3b prep commit. Adds the name-agnostic extraction pipeline
restructuring that the import-file route handler will orchestrate in
the next commit(s):

- packages/core/src/extraction-pipeline.ts: split interface. Converter
  returns { mdIntermediate: string } only via ConverterOutput.
  ExtractionOutput { mdIntermediate, triples, provenance } remains as
  the composite type assembled by the orchestrator (route handler).
- packages/core/src/index.ts: export ConverterOutput.
- packages/cli/src/extraction/markitdown-converter.ts: return type
  updated to ConverterOutput (no behavior change, same binary invocation).
- packages/cli/src/extraction/markdown-extractor.ts: NEW Phase 2
  structural extractor (~331 lines) implementing deterministic node-side
  extraction from Markdown. Handles YAML frontmatter, wikilinks, tags,
  Dataview inline fields, heading structure. No LLM, no external deps.
- packages/cli/src/extraction/index.ts: exports the new extractor.
- packages/cli/test/extraction-markdown.test.ts: NEW 27 unit tests
  covering structural extraction cases. All pass.
- packages/core/test/extraction-pipeline.test.ts: updated for split
  interface. 7/7 pass.
- packages/cli/test/document-processor-e2e.test.ts: updated for split
  interface.
- packages/cli/test/extraction-markitdown.test.ts: updated for split
  interface.

Next commit wires POST /api/assertion/:name/import-file to orchestrate
Phase 1 (converter) + Phase 2 (markdown extractor) and write triples
to the target assertion. Prep commit ships no new HTTP routes — the
existing import-file endpoint in daemon.ts is unchanged until Phase 3b
completes wiring.

Part of OriginTrail/dkgv10-spec#77, #79 gap 3, and #80.
Infrastructure commit for Phase 3b document ingestion. Adds two
building blocks the import-file route handler will consume in the
next commit:

- packages/cli/src/file-store.ts: content-addressed disk store for
  uploaded files and markdown intermediates. sha256-keyed with a
  two-level sharded directory layout (ab/cdef...). put/get/has APIs
  return `sha256:<hex>` prefixed hashes which the route handler
  surfaces as fileHash and mdIntermediateHash in ImportFileResponse.
  Idempotent: re-putting the same bytes yields the same hash and
  overwrites with identical content.

- packages/cli/src/http/multipart.ts: minimal RFC-7578 multipart/
  form-data parser. Handles the exact subset the import-file
  endpoint needs: one file part with filename + content-type
  headers, plus any number of text parts. No nested multipart, no
  base64 transfer-encoding, no streaming (parses a buffered Buffer).
  Zero new npm dependencies. Throws MultipartParseError on malformed
  input so the caller can return a clean 400.

Tests:
- packages/cli/test/file-store.test.ts: 12 unit tests covering put/
  get/has/hashToPath, idempotency, binary content, empty input,
  malformed-hash handling, bare-hex vs sha256:-prefixed forms.
- packages/cli/test/multipart.test.ts: 19 unit tests covering
  parseBoundary (standard, quoted, case-insensitive, missing), and
  parseMultipart (text fields, file fields, mixed bodies, binary
  content with 0x00/0xff bytes, malformed input error paths).

All 31/31 tests pass. CLI build clean.

No route handler changes yet — the next commit wires
POST /api/assertion/:name/import-file to use these primitives.

Part of OriginTrail/dkgv10-spec#77 and #80.
…atus

Implements the import-file document ingestion endpoint and its
companion extraction-status polling endpoint on the daemon. Wires
Phase 1 (converter) → Phase 2 (markdown structural extractor) → write
triples to the assertion graph, matching the orchestration described
in 05_PROTOCOL_EXTENSIONS.md §6.5.

New endpoints:

- POST /api/assertion/:name/import-file (multipart/form-data)
  Fields:
    file (required) — the uploaded document bytes
    contextGraphId (required) — target context graph
    contentType (optional) — override the file part's Content-Type
    ontologyRef (optional) — CG _ontology URI for Phase 2 guided extraction
    subGraphName (optional) — target sub-graph inside the CG
  Orchestration:
    1. Parse multipart body, store original file in FileStore → fileHash
    2. Resolve detectedContentType (explicit field > multipart Content-Type)
    3. Phase 1:
       - text/markdown → skip converter, use raw bytes as mdIntermediate
       - registered converter → run converter.extract(), store MD result
         in FileStore → mdIntermediateHash
       - no registered converter → graceful degrade: return status="skipped",
         no triples written, file blob retained for later manual extraction
    4. Phase 2 → extractFromMarkdown({ markdown, agentDid, ontologyRef,
       documentIri: assertionUri }) → triples + provenance
    5. Ensure assertion graph exists (idempotent), write triples + provenance
       via agent.assertion.write
    6. Record in in-memory ExtractionStatusRecord map, return ImportFileResponse
  Error paths return typed extraction.status = "failed" with the error message.
  Sub-graph registration errors propagate from assertionCreate/Write (finding
  4 of issue #81).

- GET /api/assertion/:name/extraction-status?contextGraphId=...&subGraphName=...
  Returns the current extraction job state for an assertion by looking up the
  in-memory record. Synchronous extractions populate this on the import-file
  response; this endpoint lets agents re-query without holding the original
  response and provides the hook for async extraction workflows in V10.x.

Supporting changes:

- packages/cli/src/daemon.ts:
  - Import contextGraphAssertionUri, extractFromMarkdown, FileStore,
    parseBoundary, parseMultipart, MultipartParseError
  - New constant MAX_UPLOAD_BYTES = 50 MB for document uploads
  - New interface ExtractionStatusRecord
  - New readBodyBuffer() helper — Buffer variant of readBody for binary
    multipart payloads
  - Instantiate FileStore at {dataDir}/files and extraction-status Map at
    daemon start; thread both into handleRequest via two new parameters
  - Log message for missing MarkItDown updated to clarify markdown uploads
    still work

- packages/cli/test/skill-endpoint.test.ts:
  - Regex tolerance for CRLF line endings in the YAML frontmatter check
    (/^---\r?\n/ instead of /^---\n/). Pre-existing test was Windows-hostile
    because Git's core.autocrlf normalizes LF → CRLF on checkout. Linux CI
    was fine; Windows was failing. Tolerant regex fixes both.

Tests:
- All existing cli tests pass unchanged: multipart 19/19, file-store 12/12,
  extraction-markdown 27/27, extraction-markitdown 8/8, document-processor-e2e
  13/13 (4 expected skips), skill-endpoint 11/11, extraction-pipeline 7/7.
- Integration tests for the new route handlers land in the next commit.

CLI build clean (TypeScript).

Part of OriginTrail/dkgv10-spec#77, #79 gap 3, and #80.
Completes Phase 3b by documenting the shipped assertion API surface
in SKILL.md and adding integration tests for the import-file
orchestration.

SKILL.md updates:

- §5 Memory Model "Working Memory (WM)" section: removed the
  "🚧 Planned" marker on the assertion API (create/write/query/promote/
  discard ship as of PR #108; import-file and extraction-status ship in
  this PR). Listed the full shipped API surface with body shapes, added
  the import-file and extraction-status endpoints, and noted the
  sub-graph registration check from issue #81 finding 4 so agents know
  to createSubGraph() before targeting one.

- §7 File Ingestion: replaced the "🚧 Planned" section with complete
  documentation of the shipped POST /api/assertion/{name}/import-file
  endpoint:
  - Two-phase pipeline overview (Phase 1 converter, Phase 2 structural
    extractor) with explicit text/markdown skip-Phase-1 note
  - Request table listing all form fields (file, contextGraphId,
    contentType, ontologyRef, subGraphName)
  - End-to-end curl example
  - Response shape with all fields populated
  - Extraction status semantics (completed / skipped / failed)
  - GET /api/assertion/{name}/extraction-status usage for polling

Integration tests (packages/cli/test/import-file-integration.test.ts):

NEW 12-test suite that exercises the full Phase 1 → Phase 2 →
assertion.write orchestration without requiring a full DKGAgent
(which needs libp2p + chain). Uses real FileStore (temp dir), real
ExtractionPipelineRegistry, real extractFromMarkdown, real parseMultipart,
and a mock agent that captures assertion.create/write calls for
verification. This drives the exact call sequence the daemon route
handler does, so it covers the orchestration end-to-end.

Happy paths (5 tests):
- text/markdown upload skips Phase 1, runs Phase 2, writes triples
  covering every extractor feature (rdf:type, schema:name from
  frontmatter title, schema:mentions from wikilink, schema:keywords
  from hashtag, Dataview status field, dkg:hasSection headings)
- text/markdown detection from filePart Content-Type header when no
  explicit contentType field is provided
- contentType text field overrides the file part Content-Type
- Registered PDF converter runs Phase 1, stores MD intermediate via
  FileStore with a separate mdIntermediateHash distinct from fileHash,
  runs Phase 2 on the converter's output
- ontologyRef threaded through to the converter
- subGraphName threaded through to assertion.create and assertion.write

Graceful degrade (2 tests):
- Unregistered content type (image/png): file stored with correct magic
  bytes preserved, status="skipped", pipelineUsed=null, no triples
  written, no assertion.create/write called
- File part with no Content-Type header defaults to application/octet-
  stream and also degrades gracefully

Extraction-status semantics (2 tests):
- startedAt and completedAt timestamps populated on success
- Multiple imports to different assertions get separate status records
  keyed by assertionUri

Boundary parsing (2 tests, via parseBoundary wrapper):
- Extracts boundary from daemon-style header
- Rejects non-multipart requests

skill-endpoint.test.ts updates:
- Replaced the stale "marks planned endpoints clearly" test
  (which asserted /api/assertion/create was planned — no longer true)
  with two tests: one that confirms the *(planned)* marker still exists
  (for context graph sub-resources and agent profile), and a new test
  "documents the now-shipped assertion API surface" that verifies all
  7 shipped assertion routes (create/write/query/promote/discard/
  import-file/extraction-status) appear in SKILL.md.

Test results:
- multipart: 19/19 pass
- file-store: 12/12 pass
- extraction-markdown: 27/27 pass
- extraction-markitdown: 8/8 pass
- skill-endpoint: 12/12 pass (was 11; +1 new assertion-API-surface test)
- import-file-integration: 12/12 pass (NEW)
- document-processor-e2e: 13/13 pass (4 expected skips, markitdown-unavailable)
- Total: 99/99 pass + 4 expected skips
- Full cli build clean.

Closes OriginTrail/dkgv10-spec#77 (import-file wiring),
OriginTrail/dkgv10-spec#79 gap 3 (extraction-status endpoint),
OriginTrail/dkgv10-spec#80 (ExtractionPipeline interface split — via
the ff8afe3 prep commit).
Comment thread packages/cli/src/daemon.ts Outdated
Comment thread packages/cli/src/extraction/markdown-extractor.ts
Comment thread packages/cli/src/extraction/markdown-extractor.ts
Comment thread packages/cli/src/extraction/markdown-extractor.ts Outdated
Comment thread packages/cli/src/http/multipart.ts Outdated
Comment thread packages/cli/src/daemon.ts Outdated
Comment thread packages/cli/src/extraction/markdown-extractor.ts Outdated
Comment thread packages/cli/src/daemon.ts
Comment thread packages/cli/src/daemon.ts Outdated
Comment thread packages/cli/src/extraction/markdown-extractor.ts Outdated
Jurij Skornik added 2 commits April 10, 2026 19:58
Comment thread packages/cli/src/file-store.ts Outdated
Comment thread packages/cli/src/daemon.ts
Comment thread packages/cli/src/daemon.ts Outdated
# Conflicts:
#	packages/publisher/src/dkg-publisher.ts
#	packages/publisher/test/draft-lifecycle.test.ts
Comment thread packages/cli/src/daemon.ts
Comment thread packages/cli/src/extraction/markdown-extractor.ts
Comment thread packages/cli/src/extraction/markdown-extractor.ts Outdated
Comment thread packages/cli/src/daemon.ts
Comment thread packages/cli/src/extraction/markdown-extractor.ts Outdated
Comment thread packages/cli/src/http/multipart.ts Outdated
Comment thread packages/cli/src/daemon.ts Outdated
Comment thread packages/cli/src/extraction/markdown-extractor.ts Outdated
Comment thread packages/cli/src/extraction/markdown-extractor.ts Outdated
Comment thread packages/cli/src/daemon.ts Outdated
Comment thread packages/cli/src/http/multipart.ts
Comment thread packages/cli/src/daemon.ts
…sing

Two PR #113 review findings:

1. parseBoundary() crashed on duplicated Content-Type headers because the
   parameter type didn't admit string[] and .toLowerCase() blew up at runtime.
   Widen the signature to string | string[] | undefined and reject array
   values as ambiguous so the route handler returns a clean 400 instead of
   500-ing inside the parser.

2. The outer write-stage catch in the import-file handler only matched
   has-not-been-registered / Invalid / Unsafe errors and rethrew everything
   else without updating the extraction status record. That left
   /extraction-status stuck reporting in_progress on unexpected agent.write()
   failures even after the import had failed. Record the failure via
   recordFailedExtraction(...) before rethrowing so the status reflects
   reality. Mirror the same fix in the import-file orchestration test
   helper, which had the same shape.

Adds two tests:
- parseBoundary returns null for array values
- import-file orchestration records failed status on unexpected
  write-stage errors (e.g. "Connection refused")

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread packages/cli/src/http/multipart.ts Outdated
Comment thread packages/cli/src/daemon.ts
Comment thread packages/cli/src/daemon.ts
Jurij Skornik and others added 2 commits April 11, 2026 00:40
…very

Three PR #113 round 2 review findings:

1. multipart.ts Content-Disposition parser: the `name=` parameter regex
   could match the `name=` substring inside `filename=`, so a part with
   only `Content-Disposition: form-data; filename="x"` would be silently
   accepted as a field named `"x"` instead of being rejected as malformed.
   Anchor both `name=` and `filename=` matches to a real `;` parameter
   boundary (or start of string).

2. import-file route: an empty `contentType=` form field was treated as a
   real override because `??` only catches null/undefined, not empty
   string. A client sending `contentType=` would downgrade a valid
   text/markdown / application/pdf upload to application/octet-stream and
   trigger graceful-degrade. Treat blank/whitespace overrides as absent
   in both the daemon route handler and the test orchestration helper.

3. /.well-known/skill.md discovery: text/markdown is hard-coded as a
   supported native ingestion type by the import-file route (skip
   Phase 1, run Phase 2 markdown extractor directly), but
   extractionRegistry.availableContentTypes() only listed registered
   Phase 1 converters. Skill clients reading the discovery surface
   would think Markdown ingestion was unavailable when it was actually
   always supported. Surface text/markdown alongside the registered
   converters in both the skill.md endpoint and the startup log.

Adds 5 tests:
- parseMultipart rejects parts with only filename= and no name=
- parseMultipart parses filename-first ordering correctly
- parseMultipart parses name= and filename= independently
- import-file orchestration treats blank contentType= as absent
- import-file orchestration treats whitespace-only contentType= as absent

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Conflicts:
#	packages/cli/src/daemon.ts
@Jurij89 Jurij89 merged commit a1a8a21 into v10-rc Apr 10, 2026
2 of 6 checks passed
branarakic pushed a commit that referenced this pull request Apr 10, 2026
Resolve daemon.ts conflicts: accept fileStore and extractionStatus
parameters from PR #113, drop publisherInspector (replaced by
publisherControl in this branch).

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants