AD-324: Switch RO-Crate provenance export to a PROV-shaped model#262
Open
AD-324: Switch RO-Crate provenance export to a PROV-shaped model#262
Conversation
Adopt the data team's PROV-shaped RO-Crate metadata as Torc's canonical generation and export format. Update file, job, software, workflow, and run provenance entities to use the new relationships and type arrays. Adjust export/import remapping, refresh the RO-Crate docs, and add a rationale document covering the design choices and assumptions behind the change.
…ithub.com/NatLabRockies/torc into AD-324-ro-create-mods-for-naerm-data-team
…ithub.com/NatLabRockies/torc into AD-324-ro-create-mods-for-naerm-data-team
Remove the accidentally committed tmp workspace files from the index while keeping them on disk locally. Keep /tmp in .gitignore so future scratch notes and examples stay untracked by default.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Torc RO-Crate Provenance Change Rationale
Decision
Torc now uses a PROV-shaped RO-Crate format as the canonical export and generation
model. I chose the breaking-change path because the assignment explicitly allowed it and because a
translation layer would have kept two provenance models alive at once.
That would have increased long-term cost in three ways:
differ
Using the target model directly keeps Torc's stored entities, auto-generated metadata, and exported
ro-crate-metadata.jsonaligned.Core Modifications
1. File provenance now uses the PROV-facing shape
Generated file entities now use:
@type: ["File", "prov:Entity"]prov:wasGeneratedByprov:wasAttributedToprov:wasDerivedFromRemoved
torc:run_idbecause it was Torc-specific bookkeeping, not a provenance relationship inthe requested model.
2. Job provenance is modeled as PROV activities
Generated job entities now use:
@type: ["CreateAction", "prov:Activity"]prov:hadPlanisPartOfprov:usedprov:wasAssociatedWithThis makes job execution records describe both the workflow plan they follow and the inputs they
consume, instead of only pointing at outputs.
3. Workflow-level provenance entities were added
Torc now creates:
#torc-workflow#torc-run-{run_id}These entities are necessary because the requested model refers to a workflow plan and a workflow
run explicitly. Without them,
prov:hadPlanand run attribution would point to synthetic IDs thatdid not exist as entities.
4. Software entities were aligned with the target model
Torc software records now use:
@type: ["SoftwareApplication", "prov:SoftwareAgent"]That keeps Torc's own binaries compatible with both RO-Crate consumers and the data team's PROV
interpretation.
5. Export now preserves the richer stored metadata
The exporter no longer flattens stored metadata back to Torc's older shape. It now:
@typearrays@idvalues when present#torc-workflowand#torc-run-{run_id}if older records do not already have themlocalEvidenceGraphprovnamespace in@contextThis was important because switching the generators alone would not have been enough. The exported
crate had to look like the data team's example even when some metadata was entered manually or came
from older workflows.
6. Workflow export/import remapping still works
The import/export ID remapping logic was updated so job provenance references continue to remap when
entity IDs change. The key case here was switching from
wasGeneratedBytoprov:wasGeneratedBy.Assumptions
These choices were made explicitly:
input_file_ids#torc-run-{run_id}run_idis the right identifier to use for workflow-run provenanceagain during output generation so they stay present and current
larger agent-model redesign
Why I Did Not Add a Mapping Layer
I did not keep the old storage model and export through a conversion layer because that would have
preserved internal semantics the data team explicitly does not want. A mapper would be useful only
if Torc still needed to support both formats as first-class outputs. That was not the assignment's
bias.
Why I Did Not Change the Database Schema
The database already stores RO-Crate metadata as JSON strings plus a few indexing fields
(
workflow_id,file_id,entity_id,entity_type). That was already flexible enough for thenew model.
Changing the schema would not have improved provenance quality. It would only have increased risk
and migration cost for no practical gain.
Validation Status
Validated directly:
Partially blocked in this worktree:
Those failures were not caused by the RO-Crate logic itself. This workspace already has unrelated
server-feature build issues and test-harness assumptions about feature-gated binaries.
Known Follow-Ups
If this needs to be production-hardened further, the next useful follow-ups are:
SoftwareApplication + prov:Planor move to amore domain-specific plan entity later