Document current Pipeline Context use cases#17
Conversation
There was a problem hiding this comment.
@sbooth-nrao Thank you for your work on combining our drafts into this document. I’ve left several comments for discussion. I have a couple of overall comments:
- There is a lot of non-use case content in this document. My suggestion is that we either trim this for brevity or move it to an appendix so the use cases are readable on their own.
- Several current use cases reference current implementation details: specific task names, class names, access patterns — at a level of specificity that makes it hard to evaluate whether RADPS needs to satisfy them. I think these should be abstracted up: written in terms of what the system needs to do, not how the current pipeline does it. These implementation details could be moved to an appendix for reference.
- For the future use cases, I think we should note whether each traces to a specific RADPS requirement, an aspect of the RADPS design that implies this use case, or is more of a ‘wish list’ item that may belong in a separate doc or section.
|
|
||
| --- | ||
|
|
||
| ## 6. Architectural Observations |
There was a problem hiding this comment.
Sections 6-8 are interesting and useful for discussing the future context design, but I don't think that this document is the best place for them. I'd prefer to keep it streamlined and focused so when we pass this document around for review and feedback about missing use cases the most relevant content is clear and isolated.
| - Role-based access to context fields | ||
| - Audit logging of all context mutations | ||
|
|
||
| ### FUC-07 — Partial Re-Execution / Targeted Stage Re-Run |
There was a problem hiding this comment.
I think this is a great idea, and I am also wondering if it is in scope? Can this be tied to a requirement for RADPS?
| - A query API (REST, gRPC, or GraphQL) | ||
| - Type definitions shared across languages | ||
|
|
||
| ### FUC-04 — Streaming / Incremental Processing |
There was a problem hiding this comment.
Can this be tied to a requirement from RADPS?
| - Artifact references rather than filesystem paths for cal tables and images | ||
| - Tasks that can operate on remote datasets without requiring local copies | ||
|
|
||
| ### FUC-03 — Multi-Language / Multi-Framework Access to Context |
There was a problem hiding this comment.
Nice to have -- is this a requirement from RADPS?
| - A merge/reconciliation step when concurrent results are accepted | ||
| - Explicit declaration of which context fields each task reads and writes | ||
|
|
||
| ### FUC-02 — Cloud / Distributed Execution Without Shared Filesystem |
There was a problem hiding this comment.
I think we could probably tie non-local execution to RADPS requirements.
There was a problem hiding this comment.
If the plan is to map these "GAP" use cases to RADPS requirements, should they even be in this document?
There was a problem hiding this comment.
This is a very good point. After reflection, I think my questions about specifically "RADPS requirements" were too narrow.
Here is my updated thought on this: I think each future use case should identify its source — whether that's an explicit RADPS requirement, something implied by the RADPS architecture, a known pain point, or something else. Without that it's hard to evaluate whether they belong here.
…ndix file; updated some wording choices for more accurate language and removed deployment-level GAP scenario
krlberry
left a comment
There was a problem hiding this comment.
I left some more comments for discussion.
| - A merge/reconciliation step when concurrent results are accepted | ||
| - Explicit declaration of which context fields each task reads and writes | ||
|
|
||
| ### FUC-02 — Cloud / Distributed Execution Without Shared Filesystem |
There was a problem hiding this comment.
This is a very good point. After reflection, I think my questions about specifically "RADPS requirements" were too narrow.
Here is my updated thought on this: I think each future use case should identify its source — whether that's an explicit RADPS requirement, something implied by the RADPS architecture, a known pain point, or something else. Without that it's hard to evaluate whether they belong here.
…t UC-06 into two use cases and update use case numbering
…date use case titles and numbering.
…n the pipeline (the Context with backticks).
…ecision making in downstream tasks. Make other assorted wording updates including removing references to removed use cases and standardizing actor names.
Use case edit suggestions
|
UC-1 metadata: also need cross-MS matching and lookup which I'd make a separate item This is one place where reinventing the wheel could be beneficial, because the current implementations use a "single master MS" and all MS were originally assumed to have the exact same sources, spw IDs, etc. |
|
Here's a PDF version of this: context_use_cases_current_pipeline.pdf |
| **Implementation notes** — the current pipeline satisfies these needs through two different propagation paths: | ||
|
|
||
| 1. **Immediate state propagation** — `Results.merge_with_context(context)` updates calibration library, image libraries, and more so later tasks can access the current processing state directly. | ||
| 2. **Serialized Results** — tasks read `context.results` to find outputs from earlier stages when those outputs are needed from the recorded results rather than from merged shared state. For example: |
There was a problem hiding this comment.
This pattern has crept in over time. The original idea was to not be dependent on results object parsing outside their native tasks. Also using explicitly the "previous" result makes assumptions about the recipe sequence. But even checking for an explicit results object type then still needs knowledge about the class structure of another task. That's why we tried using the extra attributes like "clean_list_pending" etc. Though they are, as you wrote, a bit ad-hoc and should probably at least have had a container class.
There was a problem hiding this comment.
Thanks Dirk, I will add some clarifying language regarding the intended behavior versus adapted behavior, including specific example in the code where each is used. We can also make sure to include this as an example of context creep, where an intended behavior gets lost without strict contractual definitions.
|
|
||
| **Implementation notes** — `WebLogGenerator.render(context)` in `pipeline/infrastructure/renderer/htmlrenderer.py`: | ||
|
|
||
| - Reads `context.results` — unpickled from `ResultsProxy` objects, iterated for every renderer |
There was a problem hiding this comment.
Is there really a case where all results are automatically unpickled? I thought one always had to call "read()".
There was a problem hiding this comment.
You are correct Dirk. There is a line in the htmlrenderer.py:1897 that does a mass read of all the result proxies that was mistaken as automatic unpickling of all the result objects. I will modify the language to be more representative of the actual behavior.
| - Most handlers call `context.observing_run.get_ms(vis)` to look up metadata for scoring (antenna count, channel count, SPW properties, field intents) | ||
| - Some handlers check `context.imaging_mode` to branch on VLASS-specific scoring | ||
| - Others check things in `context.observing_run`, `context.project_structure`, or the callibrary (`context.callibrary`) | ||
| - Scores are appended to `result.qa.pool`, so the scores are stored on the results rather than directly on the context. |
There was a problem hiding this comment.
I don't remember if this was due to some size consideration too. We can potentially have many QA score objects in the pool if there is one per detailed data selection (field/spw/pol/ant/baseline/...).
There was a problem hiding this comment.
Added language to the UC-15 implementation notes indicating current implementation behavior. Also made a note to create explicit rules to restrict this in future designs.
| - GAP-03: Provenance and reproducibility — requires immutable per-attempt records, input hashing, and lineage capture. | ||
| - GAP-04: Partial re-execution / targeted rerun — requires explicit dependency tracking and invalidation semantics at the context level. | ||
| - GAP-05: External system integration — requires stable identifiers, event subscriptions/webhooks, and exportable summaries/manifests. | ||
| - GAP-06: Multi-language access — requires a language-neutral schema and API for context state and artifact queries. |
There was a problem hiding this comment.
Do you mean "programming language"?
For a low barrier approach it would be good to have something like a middleware so that the local language API does not need to expose the actual structure of how the context is stored. And since we/I think that the dev team should be able to add new items quickly, some "standard" data types (incl. dictionaries or equivalent) should be readily available.
There was a problem hiding this comment.
I renamed GAP-06 to explicitly mention Programming Language and included a recommendation for a stable middleware layer.
…-12/UC-14 impl notes; add GAP-08 (cross-MS matching); update GAP-06 title and summary
tnakazato
left a comment
There was a problem hiding this comment.
@sbooth-nrao @krlberry thank you very much for your work. The document is comprehensive and is very good high-level summary of the usecase. I made a few comments. I would appreciate it if you could take a look.
…dd GAP-08, refine GAP-06)
There was a problem hiding this comment.
Pull request overview
Adds documentation to capture how the current ALMA/VLA pipeline Context is used today, plus a gap list intended to inform RADPS context design.
Changes:
- Added a use-case catalogue for the current pipeline context (UC-01–UC-18).
- Added a “gaps” document enumerating missing capabilities and RADPS implications (GAP-01–GAP-08).
- Added an appendix describing current implementation details and code references for selected use cases.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| docs/context_use_cases_current_pipeline.md | New primary catalogue of context use cases and required capabilities. |
| docs/context_gap_use_cases.md | New list of capability gaps and implications for RADPS context design. |
| docs/context_current_pipeline_appendix.md | New appendix with implementation notes and references back to the codebase. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
My review showed up a little oddly because I apparently had un-posted comments from 3 weeks ago. All the requested changes from my review have already been made. |
| | | | | ||
| |-------|---------| | ||
| | **Actor(s)** | Data import task, any downstream task, heuristics, renderers, QA handlers | | ||
| | **Summary** | The context must load observation metadata (datasets, spectral windows, fields, antennas, scans, time ranges), make it queryable by all subsequent processing steps, and allow downstream tasks to update it as processing progresses (e.g., registering new derived datasets, data column and type changes, reference antenna selection). It must also be able to hold derived or cached metadata products created during import when later stages rely on them for efficiency rather than recomputing them from the raw measurement set, and it must provide a unified identifier scheme when multiple datasets use different native numbering. | |
There was a problem hiding this comment.
Just minor wording:
Load
Could be reworded to be more specific, such as "read," "ingest," or "register" (as already used in other places in the document), to avoid confusion with "load" in the sense of "already harvested—just need to load into memory."
allow downstream tasks to update it as processing progresses (e.g., registering new derived datasets, data column and type changes, reference antenna selection).
I usually consider observation metadata to be read-only facts. We might occasionally need to correct them if the source is wrong (or apply the correction as a caltable, e.g., an antpos table), but it is not something that downstream tasks should typically update. Instead, we might say something like:
"... allow downstream tasks to access them as processing progresses, even if the underlying dataset has been updated, transformed into a new derived dataset from raw or primarily preliminary calibrated data."
There was a problem hiding this comment.
I agree that "load" is not the best word here. "Ingest" may get conflated with archive ingestion and "register" sounds too official in my opinion, perhaps indicating immutability. Perhaps "populate" to signify the filling of metadata into a container-like object?
There was a problem hiding this comment.
I agree with the change in language from "load" to "populate".
Yes, observation metadata in some sense is read-only aside from some early corrections...so perhaps I've conflated multiple things in this use case. There is:
- proper observation metadata that isn't updated aside from corrections (which should happen early-on right?)
- the contents of the
ObservingRunclass, which includes registering and managing MSes and virtual spws (this now effectively has its own use case) - attributes like
reference_antenna,spwmaps, etc, which are set later by design
Do some of these belong in a different use case?
| | | | | ||
| |-------|---------| | ||
| | **Actor(s)** | Workflow orchestration layer, parallel worker processes | | ||
| | **Summary** | When work is distributed across parallel workers, each worker needs read-only access to the current processing state (observation metadata, calibration state). The context must provide a mechanism for workers to obtain a consistent snapshot of that state. Workers must not be able to modify the shared processing state directly. The snapshot mechanism must support efficient distribution to workers. | |
There was a problem hiding this comment.
Workers must not be able to modify the shared processing state directly. The snapshot mechanism must support efficient distribution to workers.
It is true that, in the current implementation, workers do not directly modify the shared/central processing state before task completion. However, I am slightly puzzled by the use of the word “must”:
Is this simply a description of the current implementation, or was it an original design requirement of the Pipeline?
I do not see a major issue with workers directly modifying the shared processing state, as long as proper concurrency state tracking and conflict resolution mechanisms are in place. That said, this comment might be more for the GAP use-case study.
There was a problem hiding this comment.
You're right, this is a case where research for future implementation bled into the description of the current pipeline. I will modify the language to more accurately reflect the current implementation.
The immutability argument is definitely one that can be contentious. The current pipeline design has aspects of immutability but overall the Context object is an open, mutable state container with minimal to no restriction in regards to data validation. This isn't always an issue but there are some instances where this behavior can lead to consequences. For example, Result acceptance calls merge_with_context directly to the live context, which can lead to a partial mutation in the case of a failure that never gets rolled back. There are other instances of direct mutation that expose the state to drift and make it harder to reconstruct or audit after the fact. I am not suggesting full immutability, just tighter restrictions on certain interfaces to the context (e.g. use of setters and getters over direct mutation).
There was a problem hiding this comment.
Yes, the "must" was in there because this was a use case derived from the current implementation that remained too tightly coupled to the current implementation structure. I also don't see any issue with workers directly modifying the shared processing state as long as it's done appropriately and have added suggestions about removing this language from the use case.
krlberry
left a comment
There was a problem hiding this comment.
Thanks for incorporating the feedback from reviewers into this document. My main recurring feedback is about separating use case language from implementation and design details: summaries should describe what actors need to achieve rather than how the system should be built. I've left various comments and suggestions.
| | **Postconditions** | Any past processing step can be reproduced or audited using the recorded provenance chain. | | ||
| | **RADPS requirements** | ALMA-TR103, ALMA-TR104, ALMA-TR105 | | ||
|
|
||
| Addendum: provenance should also capture hardware and execution-environment details (CPU architecture, node/cluster specification, kernel and MPI versions, and workload-manager/scheduler configuration and relevant scheduler limits), since non-deterministic behaviour has been observed even when software versions and inputs are identical across different hardware or scheduler setups. |
There was a problem hiding this comment.
I'd suggest just updating the main body of the use case to incorporate Rui's suggestions and removing the addendum
|
|
||
| ### UC-02 — Store and Provide Project-Level Metadata | ||
|
|
||
| **Addendum**: conceptually separate *project metadata* (properties of the observation program such as PI, targeted sensitivities, beam requirements) from *workflow/recipe metadata* (processing recipes, execution instructions, heuristics parameters). Project metadata describes the scientific intent and constraints, while workflow metadata captures how the data should be processed; the two interact but have different origins and lifecycles. Recording this distinction helps ensure that processing recipes remain reusable across projects and that project-level constraints are treated as inputs to heuristics rather than being conflated with the workflow definition. |
There was a problem hiding this comment.
This addendum contains useful distinctions and definitions that I think are more clear when worked into the existing structure. As is, it's missing some context for why this note is here. I've left suggestions below.
|
|
||
| **Addendum**: conceptually separate *project metadata* (properties of the observation program such as PI, targeted sensitivities, beam requirements) from *workflow/recipe metadata* (processing recipes, execution instructions, heuristics parameters). Project metadata describes the scientific intent and constraints, while workflow metadata captures how the data should be processed; the two interact but have different origins and lifecycles. Recording this distinction helps ensure that processing recipes remain reusable across projects and that project-level constraints are treated as inputs to heuristics rather than being conflated with the workflow definition. | ||
|
|
||
| **Implementation notes** — project metadata is set during initialization or import, is not modified after import, and is read many times: |
There was a problem hiding this comment.
| **Implementation notes** — project metadata is set during initialization or import, is not modified after import, and is read many times: | |
| **Implementation notes** — project metadata (properties of the observation program such as PI, targeted sensitivities, beam requirements) is set during initialization or import, is not modified after import, and is read many times: |
| - **Write:** `context.callibrary.add(calto, calfrom)` — register a calibration application (cal table + target selection); `context.callibrary.unregister_calibrations(matcher)` — remove by predicate | ||
| - **Read:** `context.callibrary.active.get_caltable(caltypes=...)` — list active cal tables; `context.callibrary.get_calstate(calto)` — get full application state for a target selection | ||
| - Backed by `CalApplication` → `CalTo` / `CalFrom` objects with interval trees for efficient matching. | ||
| - The callibrary also supports de-registration of trial or reverted calibrations via predicate-based removal; implementations should ensure such removals are atomic and leave an audit entry so provenance is preserved when rollbacks or experiments occur. |
There was a problem hiding this comment.
This is mixing current implementation with future design.
| - **Write:** `context.callibrary.add(calto, calfrom)` — register a calibration application (cal table + target selection); `context.callibrary.unregister_calibrations(matcher)` — remove by predicate | ||
| - **Read:** `context.callibrary.active.get_caltable(caltypes=...)` — list active cal tables; `context.callibrary.get_calstate(calto)` — get full application state for a target selection | ||
| - Backed by `CalApplication` → `CalTo` / `CalFrom` objects with interval trees for efficient matching. | ||
| - The callibrary also supports de-registration of trial or reverted calibrations via predicate-based removal; implementations should ensure such removals are atomic and leave an audit entry so provenance is preserved when rollbacks or experiments occur. |
There was a problem hiding this comment.
| - The callibrary also supports de-registration of trial or reverted calibrations via predicate-based removal; implementations should ensure such removals are atomic and leave an audit entry so provenance is preserved when rollbacks or experiments occur. | |
| - The callibrary also supports de-registration of trial or reverted calibrations via predicate-based removal. |
| @@ -86,3 +92,6 @@ This document records capabilities the current pipeline context design cannot ye | |||
| | **Summary** | The current design provides limited support for heterogeneous multi-MS datasets through virtual SPW translation and per-MS data-column tracking, but many workflows still rely on a single reference-MS or master-MS model and do not expose general cross-MS matching semantics. The context must instead support heterogeneous multi-MS scenarios by providing: (1) cross-MS SPW matching with distinct semantics for exact matching (required by calibration tasks) and partial/overlap matching (required for imaging tasks that can combine overlapping spectral windows); and (2) data-type and column tracking across multiple MSes without assuming a shared layout. Because the current virtual-SPW translation mechanism is tightly coupled to the single-master-MS assumption, a fresh design is preferable to extending it. | | |||
There was a problem hiding this comment.
This description and also contains notes about the current implementation and future design ideas in addition to the use case description.
| | **Postconditions** | Calibration and imaging tasks can look up applicable SPWs and data columns across an arbitrary collection of heterogeneous MSes using the appropriate matching semantics for their use. | | ||
| | **RADPS requirements** | | | ||
|
|
||
| Addendum: real-world heterogeneous datasets frequently lack fully structured or consistent metadata linking related SPWs across different MOUS/EBs. The system should therefore support a flexible, high-level metadata management layer (for example, permissive flat labels and augmentations rather than rigid relational mapping tables) and explicit hooks for heuristic or user-supplied mapping rules (frequency/channel overlap heuristics, manual SPW mappings, etc.). When heuristics or user interventions are required, the context should record the override, the rationales or decision parameters used, and any associated model/version identifiers (including ML model versions) so decisions remain auditable and reproducible. Storing overrides in a structured form (for example, annotated decision trees or named mapping artifacts) will make it possible to query which metadata gaps commonly trigger manual fixes and to retrain or improve heuristics accordingly. |
There was a problem hiding this comment.
| Addendum: real-world heterogeneous datasets frequently lack fully structured or consistent metadata linking related SPWs across different MOUS/EBs. The system should therefore support a flexible, high-level metadata management layer (for example, permissive flat labels and augmentations rather than rigid relational mapping tables) and explicit hooks for heuristic or user-supplied mapping rules (frequency/channel overlap heuristics, manual SPW mappings, etc.). When heuristics or user interventions are required, the context should record the override, the rationales or decision parameters used, and any associated model/version identifiers (including ML model versions) so decisions remain auditable and reproducible. Storing overrides in a structured form (for example, annotated decision trees or named mapping artifacts) will make it possible to query which metadata gaps commonly trigger manual fixes and to retrain or improve heuristics accordingly. |
| | **Postconditions** | Calibration and imaging tasks can look up applicable SPWs and data columns across an arbitrary collection of heterogeneous MSes using the appropriate matching semantics for their use. | | ||
| | **RADPS requirements** | | | ||
|
|
||
| Addendum: real-world heterogeneous datasets frequently lack fully structured or consistent metadata linking related SPWs across different MOUS/EBs. The system should therefore support a flexible, high-level metadata management layer (for example, permissive flat labels and augmentations rather than rigid relational mapping tables) and explicit hooks for heuristic or user-supplied mapping rules (frequency/channel overlap heuristics, manual SPW mappings, etc.). When heuristics or user interventions are required, the context should record the override, the rationales or decision parameters used, and any associated model/version identifiers (including ML model versions) so decisions remain auditable and reproducible. Storing overrides in a structured form (for example, annotated decision trees or named mapping artifacts) will make it possible to query which metadata gaps commonly trigger manual fixes and to retrain or improve heuristics accordingly. |
There was a problem hiding this comment.
these are very useful design notes we could take into account for the future. Also the ML model version info is more of an update to what provenance we need to save and track in the future.
| @@ -86,3 +92,6 @@ This document records capabilities the current pipeline context design cannot ye | |||
| | **Summary** | The current design provides limited support for heterogeneous multi-MS datasets through virtual SPW translation and per-MS data-column tracking, but many workflows still rely on a single reference-MS or master-MS model and do not expose general cross-MS matching semantics. The context must instead support heterogeneous multi-MS scenarios by providing: (1) cross-MS SPW matching with distinct semantics for exact matching (required by calibration tasks) and partial/overlap matching (required for imaging tasks that can combine overlapping spectral windows); and (2) data-type and column tracking across multiple MSes without assuming a shared layout. Because the current virtual-SPW translation mechanism is tightly coupled to the single-master-MS assumption, a fresh design is preferable to extending it. | | |||
There was a problem hiding this comment.
| | **Summary** | The current design provides limited support for heterogeneous multi-MS datasets through virtual SPW translation and per-MS data-column tracking, but many workflows still rely on a single reference-MS or master-MS model and do not expose general cross-MS matching semantics. The context must instead support heterogeneous multi-MS scenarios by providing: (1) cross-MS SPW matching with distinct semantics for exact matching (required by calibration tasks) and partial/overlap matching (required for imaging tasks that can combine overlapping spectral windows); and (2) data-type and column tracking across multiple MSes without assuming a shared layout. Because the current virtual-SPW translation mechanism is tightly coupled to the single-master-MS assumption, a fresh design is preferable to extending it. | | |
| | **Summary** | Calibration tasks, imaging tasks, and heuristics must be able to match and coordinate data across heterogeneous collections of MSes that may not share native SPW numbering, column layout, or other assumptions. Calibration tasks require exact SPW matching; imaging tasks require partial/overlap matching to combine overlapping spectral windows. Where automated matching is ambiguous or fails, heuristics or users must be able to supply explicit mapping rules. |
…uirements mapping for GAP scenarios 06-08
Adds document that catalogues how the current ALMA/VLA pipeline uses its
Contextobject.context_use_cases_legacy_pipeline.md — 17 use cases (UC-01 – UC-17) describing what the current pipeline context does. The initial draft was merged from drafts by Berry and Booth.
docs/context_current_pipeline_appendix.md - an appendix which describes the implementation of the current context use cases.