feat: source-file linkage triples + devnet hardening (chain onto PR #120)#121
Conversation
cac9544 to
a90d433
Compare
a90d433 to
9a6772b
Compare
9a6772b to
aecb6b8
Compare
| agentDid: `did:dkg:agent:${agent.peerId}`, | ||
| agentDid, | ||
| ontologyRef, | ||
| documentIri: assertionUri, |
There was a problem hiding this comment.
🔴 Bug: pinning documentIri to assertionUri means the new rootEntity override only changes rows 3/14 metadata. assertion.promote() partitions by quad subject, so a document with rootEntity: urn:parent will still promote under assertionUri while _meta says the root is urn:parent. Either rewrite the extracted subjects under the resolved root, or keep the metadata aligned with the actual promoted root.
There was a problem hiding this comment.
Deferred to a follow-up PR. Filed as #122.
The issue is real but the fix is architectural — the current behavior pins documentIri: assertionUri in the import-file route, which means all structural content triples (frontmatter properties, H1→schema:name, wikilinks, hashtags, dataview fields, section headings) emit with the assertion UAL as their subject, regardless of whether frontmatter has rootEntity set. That makes _meta row 14 an annotation rather than a content-partition statement.
The three options discussed in the follow-up issue are:
- 10A — extractor rewrites content-triple subjects to
resolvedRootEntity. Has semantic hazards: if two assertions both claimrootEntity: urn:note:parent, they'd both write content triples on<urn:note:parent>, re-introducing the cross-assertion contention problem the Bug 8 promote-filter fix just solved. - 10B — remove the
rootEntityoverride feature entirely. Drops a spec-defined feature (§19.10.1:508). - 10C — document the current behavior as a metadata HINT, not retargeting. Zero code, just a spec clarification that row 14 is an annotation.
The Round 4 consensus is that 10C is the smallest defensible path, but it needs spec-engineer and broader architectural discussion before committing. Not blocking for PR #121 — users who hit the mismatch can work around it by leaving rootEntity unset (the reflexive default is consistent) or by accepting that the _meta annotation is informational.
Tracking in #122. Not resolving this thread — leaving it open as a deferred marker.
aecb6b8 to
4cb2df3
Compare
| // Row 1 — points at the content-addressed file URN | ||
| { subject: args.subject, predicate: DKG_SOURCE_FILE, object: args.sourceFileIri }, | ||
| // Row 3 — resolved root entity (reflexive or frontmatter/explicit override) | ||
| { subject: args.subject, predicate: DKG_ROOT_ENTITY, object: resolvedRootEntity }, |
There was a problem hiding this comment.
🔴 Bug: the new rootEntity override only changes the emitted metadata/linkage quad; it does not change any subject IRIs. assertionPromote still derives roots from subjects via autoPartition, so promoting a document with rootEntity: ... will still publish under assertionUri/section subjects while _meta now claims a different root. Either reject this override here until the downstream promote path can honor it, or plumb the resolved root through partitioning/promotion.
There was a problem hiding this comment.
Deferred — this is the same issue as the daemon.ts:2613 thread, already filed as #122. The rootEntity override / promote-partitioning mismatch is architectural and requires either subject rewriting (with cross-assertion contention hazards), removing the feature (contradicts spec §19.10.1:508), or a documentIri plumbing rework. Tracked in #122 for follow-up. Leaving this thread open as a deferred marker, same as the original thread.
4cb2df3 to
9a3f2da
Compare
| const metaQuads: Array<{ subject: string; predicate: string; object: string; graph: string }> = [ | ||
| // Row 14 — rootEntity comes from the extractor's resolved value so | ||
| // the data-graph row 3 and `_meta` row 14 point at the same IRI. | ||
| { subject: assertionUri, predicate: 'http://dkg.io/ontology/rootEntity', object: resolvedRootEntity, graph: metaGraph }, |
There was a problem hiding this comment.
🔴 Bug: dkg:rootEntity in _meta can now disagree with the actual root entity the rest of the pipeline uses. assertionPromote still derives KA roots from subjects via autoPartition, and this import path still makes assertionUri the named subject, so a frontmatter rootEntity override only changes bookkeeping here. That leaves downstream consumers seeing one root in _meta and another during promotion/update. Either make the imported document root follow resolvedRootEntity, or keep this row reflexive until the publisher honors the override end-to-end.
There was a problem hiding this comment.
Deferred — this is the third location Codex has flagged the rootEntity override / promote-partitioning mismatch. See the original daemon.ts:2613 thread and the markdown-extractor.ts:529 thread, both tracked in #122. The architectural fix (documentIri plumbing end-to-end) is out of scope for this PR. Leaving this thread open as a deferred marker consistent with the other two.
9a3f2da to
51eb21d
Compare
51eb21d to
ee61822
Compare
ee61822 to
aea36de
Compare
7d87d09 to
8bb41e0
Compare
| // that would otherwise pass the prefix check and blow up at the RDF | ||
| // layer with a cryptic error (Codex Bug 13). Non-IRI values fall | ||
| // through to slugification as before. | ||
| let resolvedRootEntity: string = args.rootEntityIri ?? args.subject; |
There was a problem hiding this comment.
🔴 Bug: This PR now allows rootEntity to diverge from the document subject, but the downstream promote/publish path still partitions KAs by quad subject only (autoPartition() never looks at dkg:rootEntity). A document with rootEntity: urn:dkg:md:parent-root will still be promoted/owned under its subject URI while rows 3/14 claim the root is parent-root. Either keep resolvedRootEntity reflexive until partitioning honors dkg:rootEntity, or update the promote/publish path in the same PR.
There was a problem hiding this comment.
Deferred — duplicate of the autoPartition vs dkg:rootEntity partition divergence already tracked in #122 (also surfaced on this PR as threads PRRT_kwDORwbl8c56TLcv / PRRT_kwDORwbl8c56TWbZ / PRRT_kwDORwbl8c56Tbh8 — this is now the fourth location Codex has flagged the same root-cause pattern). The architectural fix requires plumbing the resolved documentIri / rootEntityIri through partitioning and promotion end-to-end, which is out of scope for this PR. Leaving this thread open as a visible deferred marker pointing at #122, consistent with the other three.
8bb41e0 to
5257a17
Compare
| const metaQuads: Array<{ subject: string; predicate: string; object: string; graph: string }> = [ | ||
| // Row 14 — rootEntity comes from the extractor's resolved value so | ||
| // the data-graph row 3 and `_meta` row 14 point at the same IRI. | ||
| { subject: assertionUri, predicate: 'http://dkg.io/ontology/rootEntity', object: resolvedRootEntity, graph: metaGraph }, |
There was a problem hiding this comment.
🔴 Bug: resolvedRootEntity can now differ from assertionUri, but the import path still extracts all content under the assertion UAL subject. assertionPromote/autoPartition root KAs from subjects, not from this _meta triple, so a frontmatter rootEntity override will make _meta advertise one root while promotion still publishes the assertion under the UAL. Either apply the override to the extracted subject as well, or keep row 14 reflexive until downstream promotion understands dkg:rootEntity overrides.
There was a problem hiding this comment.
Deferred — same autoPartition vs dkg:rootEntity partition divergence pattern tracked in #122, already surfaced on this PR as threads PRRT_kwDORwbl8c56TLcv / PRRT_kwDORwbl8c56TWbZ / PRRT_kwDORwbl8c56Tbh8 / PRRT_kwDORwbl8c56T3jv. This is now the fifth marker Codex has flagged at locations along the same architectural root-cause pattern. The fix requires plumbing resolved documentIri / rootEntityIri through autoPartition / partitioning / promotion end-to-end, which is out of scope for this PR. Leaving this thread open as a visible deferred marker alongside the four prior markers.
5257a17 to
592f7b3
Compare
592f7b3 to
5c5d25e
Compare
5c5d25e to
7f6aa97
Compare
7f6aa97 to
a5c168c
Compare
| const metaQuads: Array<{ subject: string; predicate: string; object: string; graph: string }> = [ | ||
| // Row 14 — rootEntity comes from the extractor's resolved value so | ||
| // the data-graph row 3 and `_meta` row 14 point at the same IRI. | ||
| { subject: assertionUri, predicate: 'http://dkg.io/ontology/rootEntity', object: resolvedRootEntity, graph: metaGraph }, |
There was a problem hiding this comment.
🔴 Bug: resolvedRootEntity is now persisted here, but assertionPromote() still partitions promoted assertions by subject via autoPartition() and ignores these dkg:rootEntity rows. An import with rootEntity: frontmatter will therefore advertise one root in WM/_meta and publish under a different root later. Either thread this override through the promote/publish path or keep row 14 reflexive until downstream promotion honors it.
There was a problem hiding this comment.
Deferred — same autoPartition vs dkg:rootEntity partition divergence pattern tracked in #122. Already flagged in this PR at threads PRRT_kwDORwbl8c56TLcv (Bug 10) / PRRT_kwDORwbl8c56TWbZ (Bug 16) / PRRT_kwDORwbl8c56Tbh8 (Bug 18) / PRRT_kwDORwbl8c56T3jv (Bug 28) / PRRT_kwDORwbl8c56T7e6 (Bug 32).
This is the sixth marker for the same architectural issue — three of which (Bug 18, Bug 32, and now this one) are at daemon.ts:2791 specifically. Three flags at the same line across three review rounds reinforces that this is systemic to the documentIri pinning decision rather than a local patch opportunity. The architectural fix requires plumbing resolved root through autoPartition / partitioning / promotion end-to-end, which is out of scope for this PR.
Leaving this thread open as a visible deferred marker alongside the five prior markers, consistent with the established pattern. See #122 for the tracking issue (already updated to reflect the 6-marker count).
a5c168c to
54e5b09
Compare
….1/§10.2)
The markdown-extraction pipeline in packages/cli was emitting structural
content triples but not the source-file linkage triples required by
19_MARKDOWN_CONTENT_TYPE.md §10.1 and §10.2. Without those triples, the
assertion graph had no way to rediscover the original file blob after a
daemon restart — the in-memory extractionStatus map was the only surviving
linkage, and it vanished on restart.
This commit implements the full 20-row spec-mandated source-file linkage
contract from §10.1 (data-graph entity linkage) and §10.2 (_meta graph
metadata), split between the extractor and the import-file route handler:
**Extractor (packages/cli/src/extraction/markdown-extractor.ts)**
- Removes the non-spec dkg:derivedFrom / prov:wasGeneratedBy provenance
block entirely — daemon owns all provenance emission now.
- Emits rows 1-3 on the document subject IRI: dkg:sourceFile (to fileUri),
dkg:sourceContentType "text/markdown" (always the extractor input type,
even for PDF where the MD intermediate is what the extractor processed),
dkg:rootEntity (reflexive by default, frontmatter override supported).
- Frontmatter rootEntity key is consumed so it doesn't leak through as
schema:rootEntity via the generic frontmatter-to-predicate fallthrough.
**Daemon (packages/cli/src/daemon.ts import-file route handler)**
- Computes keccak256 via ethers.keccak256 alongside sha256. FileStoreEntry
now exposes both hashes; ImportFileResponse.fileHash + ExtractionStatusRecord
+ mdIntermediateHash all switched to keccak256 per spec §2.1:658 (file store
is keccak256-addressed).
- Mints fileUri as urn:dkg:file:keccak256:<hex> and passes it into the extractor.
- After Phase 2, builds rows 4-8 (file descriptor block: rdf:type dkg:File,
dkg:contentHash, dkg:fileName, dkg:contentType, dkg:size).
- Mints one fresh urn:dkg:extraction:<uuidv4> per import and emits rows 9-13
(ExtractionProvenance block with dkg:extractedFrom <fileUri>, dkg:extractedBy,
dkg:extractedAt, dkg:extractionMethod "structural").
- After assertion.write, writes rows 14-19 (always) and row 20 (conditionally,
only when Phase 1 actually ran for PDF/DOCX uploads) into the CG ROOT _meta
graph via agent.store.insert with explicit contextGraphMetaUri(contextGraphId)
— NEVER the sub-graph _meta graph, per §16.2.1:3449-3466.
- Row 15 (_meta sourceContentType) uses the ORIGINAL upload content type
(e.g. "application/pdf"), distinct from row 2 (data-graph sourceContentType)
which uses the extractor input "text/markdown". This row-2-vs-row-15 split
is explicitly tested in both directions.
- _meta insert failures flow through recordFailedExtraction so
/extraction-status doesn't get stuck at in_progress on partial writes.
**FileStore (packages/cli/src/file-store.ts)**
- Dual-hash store: sha256 remains the on-disk primary for back-compat; a
keccak256 pointer file is written under keccak256/<shard>/<hex> containing
the sha256 hex so FileStore.get() accepts either prefix and resolves to
the same blob.
- Non-breaking for any existing sha256 callers.
**Tests**
- extraction-markdown.test.ts grows from 36 to 41 tests: explicit assertions
that dkg:derivedFrom is NEVER emitted; rows 1-3 coverage with and without
sourceFileIri; rootEntity override precedence (frontmatter > explicit input
> reflexive); rootEntity frontmatter key no longer leaks through the
generic predicate fallthrough.
- import-file-integration.test.ts grows from 25 to 31 tests: row 4-13 data-graph
descriptor and provenance block verification with fresh UUIDv4 per import;
row 14-19 _meta graph verification; PDF content-type split (row 2 = text/markdown
AND row 15 = application/pdf on the same import); sub-graph routing with
explicit assertion that _meta quads always land in CG root meta, never
sub-graph meta; daemon-restart recovery (clear extractionStatus map, recover
hash from captured _meta quads, re-fetch blob via FileStore.get with byte
equality); FileStore dual-prefix acceptance.
Refs: 19_MARKDOWN_CONTENT_TYPE.md §10.1, §10.2, §3.2, §4
05_PROTOCOL_EXTENSIONS.md §6.3, §6.5
03_PROTOCOL_CORE.md §2.1 (file store)
EXAMPLE_FULL_FLOW.md §2
Companion spec PR: OriginTrail/dkgv10-spec#86
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
54e5b09 to
adb6180
Compare
review Hardens scripts/devnet-test.sh sections 18-24 (introduced in PR #120) against a review pass that surfaced 4 P0 (blocking), 11 P1 (fix-in-PR), and 7 P2 (nice-to-have) findings. All P0 and P1 addressed; 5 of 7 P2 addressed. **P0 (all 4)** - §21 now SPARQLs the assertion data graph for the new source-file linkage triples (dkg:sourceFile, dkg:sourceContentType, dkg:rootEntity) AND the CG root _meta graph for dkg:sourceFileHash with keccak256 format regex and a drift-check that the _meta hash equals the wire fileHash from the import response. §21h asserts dkg:mdIntermediateHash is ABSENT for markdown uploads (row 20 is Phase-1-only). Previously §21 only checked the in-memory ImportFileResponse fileHash — a daemon that wrote zero linkage triples would have silently passed. - §23a (no-auth 401) now detects DEVNET_NO_AUTH=1 explicitly and emits [SKIP] cleanly; hard-fails if a real auth regression returns 200. Previously silently degraded to WARN under DEVNET_NO_AUTH=1, masking real regressions. - §23c / §23g switched from substring-grep-on-body to new http_post_capture helper that captures both body AND HTTP status; now requires 4xx AND error token. A 500 with body {"error":"internal"} no longer false-passes. - §18a catchup polling removed "idle" from success markers (idle is the INITIAL pre-catchup state, not a completion). Only completed|synced|done break the poll loop. **P1 (all 11)** - New c() helper bounds curl with --max-time 30 --connect-timeout 5 (env-overridable via DEVNET_CURL_TIMEOUT / DEVNET_CURL_CONNECT_TIMEOUT). A hung node can no longer stall CI ~40 min per polling section. - json_get normalizes Python booleans to lowercase true/false; all check "..." "True" callsites flipped to "true". - New safe_bindings_count and safe_quads_count helpers emit a PARSE_ERR sentinel on schema drift instead of silently returning "0". ~20 call sites converted. - §20d/§20f now route through c -X PUT for consistent timeout + auth. - §22c/§22d fragile inline python ternary replaced with explicit try/except that surfaces __ERR__ / __MISSING__ sentinels distinctly from legitimate status values. - New §21i: PNG upload graceful-degrade negative test asserts extraction.status == "skipped" AND tripleCount == 0 AND pipelineUsed null (spec §6.5 graceful-degrade path had zero coverage). - New §24g: write-to-unregistered-sub-graph negative test with nanosecond- suffixed name, requires 4xx response. - §23b no longer conflates PARSE_ERR with legitimate empty results. - §22a asserts triplesWritten >= 2 before 22b enqueues; a silent zero- write can no longer hide a broken publisher queue. - §21e promote check excludes __ERR__ in addition to __NONE__ and "0". - §18b/§18c hard-fail (not warn) when sync claimed completion but VM/SWM data is missing ("catchup reported complete but data missing — bug"). Paired with P0-4 this closes the two-layer cover that catchup bugs previously had. **P2 (5 of 7)** - GOSSIP_WAIT_S env var for the §24b settle window. - Explicit [SKIP] log when §22c-f skip due to missing jobId. - DEVNET_TMPDIR honors $TMPDIR for Windows/WSL developers. - shareOperationId / workspaceOperationId dual-lookup documented as intentional legacy alias (not an API-rename drift). - §24f gossip single sleep replaced with 5×1s poll-and-break loop. Not addressed (scope): P2-5 (positive auth control — adds new coverage rather than hardening existing) and P2-7 (stable test IDs — pure refactor). New helpers added: c(), ok/fail/warn/skip, json_get, check, safe_bindings_count, safe_quads_count, http_post_capture. Refs: spec §10.1, §10.2, §6.5 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
adb6180 to
fe63671
Compare
| const metaQuads: Array<{ subject: string; predicate: string; object: string; graph: string }> = [ | ||
| // Row 14 — rootEntity comes from the extractor's resolved value so | ||
| // the data-graph row 3 and `_meta` row 14 point at the same IRI. | ||
| { subject: assertionUri, predicate: 'http://dkg.io/ontology/rootEntity', object: resolvedRootEntity, graph: metaGraph }, |
There was a problem hiding this comment.
🔴 Bug: This only records the rootEntity override in _meta; the imported quads themselves still live under assertionUri, and assertionPromote/autoPartition derive KA roots from subjects, not from dkg:rootEntity. A later promote will still publish the assertion URI as the root entity, so the new override becomes informational only. If the override is meant to affect publish/update identity, rewrite the imported document/section subjects to the resolved root entity or teach partitioning to honor dkg:rootEntity.
| // based IRI" without restricting schemes; the only exclusions are | ||
| // blank nodes (RDF 1.1 §3.4 — not IRIs) and reserved protocol | ||
| // namespaces (§19.10.2:708-723). `isSafeIri` matches that contract. | ||
| if (isSafeIri(fmId)) return fmId; |
There was a problem hiding this comment.
🟡 Issue: isSafeIri() only checks SPARQL-safe syntax; it does not enforce the reserved urn:dkg:file: / urn:dkg:extraction: namespaces mentioned in this comment. With this change, id: urn:dkg:file:... is accepted here and only rejected later at the publisher boundary (or written if a caller bypasses that boundary). Add the same reserved-prefix guard here so subject resolution and write-time validation stay consistent.
Summary
Chain PR onto
test/devnet-e2e-sections-18-24(the base for PR #120). Two changes bundled:Phase B — Source-file linkage triples. Implements
19_MARKDOWN_CONTENT_TYPE.md §10.1and§10.2— the full 20-row source-file linkage contract that was missing from PR feat: wire import-file endpoint and Phase 2 markdown extraction (#77, #79, #80) #113. After a daemon restart, the assertion graph can now rediscover the original file blob by SPARQLing_metafordkg:sourceFileHash. Previously the only linkage was the in-memoryextractionStatusmap, which vanished on restart. Hash format is keccak256 per03_PROTOCOL_CORE.md §2.1:658.Phase D — Devnet test hardening. Hardens
scripts/devnet-test.shsections 18-24 (introduced in PR test: add devnet e2e sections 18-24 covering V10 feature gaps #120) against 4 P0 + 11 P1 + 5/7 P2 findings from a review pass. §21 now actually SPARQLs for the new linkage predicates, §23a no longer silently masks auth regressions underDEVNET_NO_AUTH=1, §23c/§23g no longer false-pass on 500s, §18a no longer accepts pre-catchupidleas success, and a hung node can no longer stall CI ~40 min via missing--max-time.Companion spec PR
Three spec cleanups surfaced during implementation — see OriginTrail/dkgv10-spec#86:
dkg:mdIntermediateHashrow to19_MARKDOWN_CONTENT_TYPE.md §10.2_metalayoutdkg:rootEntityliteral→IRI typo at §10.2:601fileUriURN shape (urn:dkg:file:keccak256:<hex>)The spec PR is independent and can merge in either order.
Test plan
packages/cli/test/extraction-markdown.test.ts— 41/41 passing (was 36)packages/cli/test/import-file-integration.test.ts— 31/31 passing (was 25)packages/cli/test/multipart.test.ts— 25/25 passing (unchanged)packages/corefull suite — 415/415 passing (unchanged)bash -n scripts/devnet-test.sh— syntax OKPre-existing Windows-hostility failures (
slot-helpers,migration,rollback,auto-update,blue-green,publisher-wallets,publisher-cli-smoke,install-script,indexer) are unrelated to this PR — none of the modified files have failures.Follow-up:
GET /api/file/:hashendpointThis PR establishes the in-graph linkage (
<assertionUal> dkg:sourceFileHash "keccak256:<hex>"in CG root_meta) that lets a SPARQL client rediscover the source file hash after daemon restart. Actually retrieving the file bytes over HTTP is deferred — the daemon currently has noGET /api/file/:hashroute, so the round-trip is only exercisable in-process viaFileStore.get()(whichimport-file-integration.test.tsverifies). Exposing the file store over HTTP requires a separate design for access control semantics (private CGs per19_MARKDOWN_CONTENT_TYPE.md §4.1), content-type preservation, and spec-side language. Tracked as a follow-up.Closes / references
Not closing any issues — this is a chain PR onto the PR #120 base, and the linkage work was surfaced post-merge on PR #113 rather than from a dedicated issue.
🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com