Skip to content

Improve RunProject data handoff and per-folder configuration #23

@jwildfire

Description

This Issue was drafted by GitHub Copilot using Claude Opus 4.6 and reviewed by Jeremy Wildfire (@jwildfire)

Improve RunProject data handoff and per-folder configuration

Summary

RunProject() currently merges all workflow results flat into current_data and passes everything forward. This does not match real-world usage where each phase needs specific data shaped differently. The goal is to get to the point where sequential calls to runWorkflows() (e.g. in this open.gismo demo) can be replaced by a single call to workr::RunProject().

Related: #10 (original RunProject implementation, now closed).

Problem

The demo runWorkflows.R script runs four phases with custom data plumbing between each:

Phase Input data Notes
1_mappings lRaw (CSV files) Straightforward — raw data in, mapped data out
2_metrics c(mapped, list(lWorkflows = metrics_wf)) Needs mapped data plus the workflow definitions themselves
3_reporting c(mapped, list(lAnalyzed = analyzed, lWorkflows = metrics_wf, dSnapshotDate = Sys.Date(), Reporting_Results_Longitudinal = NULL)) Wraps phase 2 results inside lAnalyzed, carries forward mapped (not all accumulated results), injects non-workflow data (dSnapshotDate, NULL placeholder)
4_modules reporting Only phase 3 results — not the full accumulated data

Additionally, between phases 2 and 3, a data transformation coerces GroupID to character across all analyzed results.

Gaps in current RunProject()

  1. Naive data merging — All phase results are merged flat into current_data. There's no way to wrap results (e.g., lAnalyzed = analyzed), select a subset, or exclude prior phases.
  2. No per-folder configuration — Each phase may need custom input mappings, extra static data, or post-processing hooks. There's no mechanism for this.
  3. Can't pass workflow metadata as data — Phase 2 results need the workflow list passed alongside actual data (lWorkflows = metrics_wf), and phase 3 passes it again. RunProject has no way to inject workflow definitions into lData.
  4. No inter-phase transformations — The GroupID coercion between phases 2 and 3 has no home in the current architecture.
  5. No control over what data carries forward — Phase 4 should only receive phase 3 results, but RunProject accumulates everything.

Proposed Solution

Per-folder config file (_config.yaml)

Each phase folder can optionally include a _config.yaml file that controls data flow and phase behavior. Example:

# workflows/2_metrics/_config.yaml
input:
  from_phases: [1_mappings]      # Which prior phase results to include (default: all)
  include_workflows: true         # Inject lWorkflows for this phase into lData
  extra:                          # Additional static data to merge into lData
    some_param: "value"

output:
  wrap_as: null                   # Optionally wrap all results under a named key (e.g., "lAnalyzed")
  transform: null                 # Optional post-processing (see below)
# workflows/3_reporting/_config.yaml
input:
  from_phases: [1_mappings]        # Only mapped data, not metrics
  from_results:                    # Pull specific named results from prior phases
    lAnalyzed: 2_metrics           # Wrap all phase 2 results into lAnalyzed
  include_workflows:
    from_phase: 2_metrics          # Inject workflow definitions from phase 2 as lWorkflows
  extra:
    dSnapshotDate: "Sys.Date()"
    Reporting_Results_Longitudinal: null
# workflows/4_modules/_config.yaml
input:
  from_phases: [3_reporting]       # Only phase 3 results

Possible alternative: callback functions

Instead of (or in addition to) YAML config, support callback functions:

RunProject(
  strPath = "workflows",
  lData = lRaw,
  fnPhaseInput = function(phase, lPhaseResults, lData, lWorkflowsByPhase) {
    # Custom logic to build lData for each phase
  },
  fnPhaseOutput = function(phase, result) {
    # Post-processing (e.g., GroupID coercion)
  }
)

Acceptance Criteria

  • runWorkflows.R from the open.gismo demo branch can be fully replicated with a single RunProject() call (plus initial data loading)
  • Each phase receives exactly the data it needs (not a flat merge of everything)
  • Per-folder configuration mechanism exists (YAML, callbacks, or both)
  • Workflow definitions can be passed as data to downstream phases
  • Inter-phase transformations are supported
  • Backward compatible — existing RunProject() calls without config files continue to work
  • Unit tests cover all new data flow scenarios
  • Documentation updated with examples

Implementation Notes

  • The _config.yaml approach is more declarative and aligns with the YAML-driven workflow philosophy, but may not handle arbitrary transformations (like GroupID coercion).
  • The callback approach is more flexible but less portable.
  • A hybrid approach (YAML for common patterns + optional callbacks for custom logic) may be best.
  • Consider whether RunWorkflows() itself needs changes or if this is purely a RunProject concern.

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions