diff --git a/rfcs/text/0000-contributed-pipelines.md b/rfcs/text/0000-contributed-pipelines.md new file mode 100644 index 00000000..7bd8b71e --- /dev/null +++ b/rfcs/text/0000-contributed-pipelines.md @@ -0,0 +1,161 @@ +### DCP PR: + +***Leave this blank until the RFC is approved** then the **Author(s)** must create a link between the assigned RFC number and this pull request in the format:* + +`[dcp-community/rfc#](https://github.com/HumanCellAtlas/dcp-community/pull/)` + +# Contributed Pipelines +## Summary +We would like to provide a greater variety of analyzed data to users and engage with the scientific community by accepting community contributed pipelines into the HCA DCP. This RFC proposes technical and scientific requirements for these pipelines as well as draft guidelines for the contribution process. + +## Author(s) + [Kylee Degatano](mailto:kdegatano@broadinstitute.org) + + [Ambrose Carr](mailto:acarr@chanzuckerberg.com) + + In partnership with Kathleen Tibbetts, Tim Tickle, Clare Bernard, Marcus Kinsella, and the DCP Data Pipelines team. + +## Shepherd +***Leave this blank.** This role is assigned by DCP PM to guide the **Author(s)** through the RFC process.* + +*Recommended format for Shepherds:* + + `[Name](mailto:username@example.com)` + +## Motivation +As the DCP accepts data from assays, it takes on responsibility for eventually processing that data and returning analysis products to our contributors. The core mission of the HCA DCP is to process and make available the diverse data types comprising the reference atlas. + +This is a complex task best accomplished by learning from the scientific community. By rapidly inducting existing pipelines into the DCP, and then improving them based on user feedback and demand, we can quickly build capacity to process diverse data types. By focusing on adding breadth across data types, we provide a platform to advance computational and assay diversity in support of the HCA, improving its quality and accelerating its construction. + +This RFC discusses (1) technical and scientific standards to determine when pipelines contributed by community members are ready for inclusion in the DCP (2), a potential contribution process, (3) the prioritization of such pipelines, and (4) off-boarding pipelines when they lose value or cease to fulfill standard requirements. + +## User Stories +- As a user of the DCP (both user archetypes, but researcher with a pipette especially), I am looking for all raw data in the HCA DCP to be analyzed. I trust the scientific community to write high quality pipelines. +- As a computational biologist or methodologist, I have a pipeline that I would like to see leveraged in the HCA DCP. +- As the HCA DCP, I want to develop pipelines iteratively based on user feedback. To accomplish this, I need new pipelines to be leveraged in the DCP as quickly as possible. + +## Definitions +**Assay:** A biological and technological process that transforms a biological specimen into interpretable raw data, ideally of a known and standardized format. Generates output which must adhere to a specific raw data format. + +**Pipeline:** A collection of one or more functional tasks that operate on input data and, from that, transform the input data, or derive features often used to interpret the input data. In a high throughput setting, these tasks are often automated to be performed in a batch setting. + +## Detailed Design +### Criteria for Consideration +#### Technical standards for consideration +The DCP commits to processing included assay types with high quality pipelines that can operate at scale. Thus, there are some basic technical standards that a pipeline must deliver to be eligible for the DCP. A pipeline must: + +- Be open source (with MIT, ISC, Simplified BSD, or Apache 2.0 license), and on GitHub or similar code versioning and sharing platform. +- Be under active development (e.g. recent commits/releases, responsiveness to bug reports, known pipeline maintainer(s)), with docs, bug fixes, and testing. +- Not have any restrictions on the use of pipeline output(s). +- Have multiple test data available for use by the Data Processing Service to evaluate the validity of the results and serve as a benchmark for future improvements +- Provide acceptable ranges and a method for validation of any required run-specific parameters (example: starfish) +- Provide pipeline outputs using standard formats (if such standards exist) +- Utilize public, open source tools. + +Data produced by Contributed Pipelines will be clearly marked as non-release data, to distinguish it from data eligible for release that was generated by AWG-vetted pipelines. Not all DCP services will be available for non-release data. + +These requirements, and in particular the maintenance requirements, should be clearly communicated to contributors, and the DCP should make an effort to verify that the contributors understand. + +#### Scientific standards for consideration +Although the characteristics of contributed pipelines are not as well understood as standard HCA pipelines, we should takes steps to avoid installing a pipeline into production that could produce misleading or bad scientific data. To accomplish this, contributed pipelines should demonstrate that they have been vetted by members of the scientific community, in addition to meeting the technical requirements for consideration. Contributed pipelines must be shown to produce meaningful scientific results, a requirement that is met by any of the following: +- Has produced data that is used extensively in analysis found in a published, peer-reviewed manuscript. +- Produces data that is shown to replicate analysis found in a published, peer-reviewed paper. +- Is a known pipeline to an AWG member who is willing to vouch for the pipeline. +- Is a pipeline that is used by 3 or more experts in the field, all of whom confirm they have used it to successfully analyze data and can point to that analysis. +- Is actively being used in a scientific consortium and has the endorsement of their Analysis Working Group or equivalent scientific leadership. + +### Contribution process +#### Draft of user-facing contribution guidelines +To contribute a pipeline, you can create a workspace in Terra, a cloud platform for batch and interactive analysis.The creation of this workspace should provide as much information as possible to enable the pipelines team to hook up the pipeline to the DCP. To contribute, you will need to follow the steps outlined below. If you have questions about the contribution process, please contact pipelines-team@data.humancellatlas.org. +1. Write a WDL 1.0 workflow that encapsulates the pipeline. The tasks of this pipeline will need to be containerized in order to run in the cloud and in Terra. The containers must be public to be accepted into the HCA. +2. Upload the WDL to the public Terra tools repository, with configurations, data, and descriptions for each mode the pipeline can be run in, and import the tool into a workspace. + 3. Another option is to put the tool into Dockstore and then link it to Terra. +3. Upload small testing data, necessary references, and any benchmarking datasets to the workspace. Ensure these data are eligible for public, open use. +4. Write a markdown formatted workspace description that summarizes: + 1. the data being analyzed, + 2. the way the input data is generated, + 3. the computational stages of the pipeline, + 4. the output data, + 5. and how the pipeline meets the scientific and technical contribution standards +5. Run the pipeline in the workspace, in each mode, with at least the test data, so that the outputs can be verified. If there are data that work with your pipeline in the HCA DCP, demonstrate analyzing this data with your pipeline in the workspace. +6. Using another wdl tool, write a checker test that verifies the outputs of the pipeline meet the technical and scientific expectations for each run mode of the pipeline. +7. Share the workspace with write access to pipelines-team@data.humancellatlas.org for internal review by the DCP pipelines team. The workspace will then be announced to the DCP and HCA community for review. + 1. During review, we may request instructions and code to read the output file into a common scientific computing language as a sparse or dense array. +8. Respond to the community and DCP concerns, and update the submission as requested. +9. When the pipeline has met the criteria for consideration and has been approved by the DCP and community, the DCP pipelines team will tag the workspace "HCA-contributed-pipeline". The DCP will then determine when it can be prioritized to be pulled into production, and will communicate the timeline and expectations with the contributor and community (see Section 3.3, Prioritizing Contributions, below). +10. Running pipelines in the cloud and in Terra requires a google cloud billing project. If you have interest in contributing and have difficulty obtaining a google cloud billing project, please reach out via the above email. + +#### Why Terra? +Contributing pipelines via Terra has a few benefits: +1. Ensures the pipeline can be run in the DCP pipeline infrastructure +2. Provides support for contributors developing and testing the pipelines +3. Makes the pipelines immediately available to the community, where users can run the pipelines on HCA data, in the case where they may be useful but not immediately able to be accepted to the DCP, for example if they haven’t met an acceptance criteria, or aren’t prioritized into production because of a reason in section 3.3 +4. Enables the contributor to test the pipeline on HCA data and demonstrate functionality and performance + +The guidelines drafted above describe the contribution process as user facing. When we publish them officially, we will include links to tutorials and more background on the contribution requirements outlined in this document. + +### Prioritizing contributions +It will take some effort on behalf of DCP's engineers and computational biologists to assist external developers in adapting their pipelines for use in the DCP. As a result, the order of incorporation for contributed pipelines will consider factors like: + +- The amount of data in the DCP for that assay. +- The rate at which new data for that assay is being added, based on the HCA Data Roadmap. +- Relative difficulty of adapting the pipeline to run in the DCP. +- Value to the community in exposing this data to users (can it be made available in a different service?) +- Risk associated with inaction (will we lose the ability to include data in the reference atlas?) +- The amount of external developer bandwidth to support pipeline incorporation. +- HCA Community feedback (for example, polls of the HCA community that ask them to rank the 5 assays they think will be most important in the coming year). +- Whether there is an existing pipeline in the HCA that serves the same data as the contributed pipeline. +- Support for this data type in HCA metadata. +- Inclusion of the pipeline in the HCA Scientific Roadmap or communication from HCA scientific stakeholders that the assay is a priority. + +When a pipeline is prioritized for incorporation into the DCP, the contributor will be contacted and reminded of the requirements outlined here prior to running in production. The pipeline will be documented on the HCA Data Portal, where the author will be cited for their contribution. Additionally, analysis data produced by the pipeline will be labeled with minimally the pipeline author and contact info, and a permanent link to the pipeline reference workspace in the HCA analysis metadata. + +## Operating pipelines +### Responding to Failures +Operating any DCP pipeline on relevant data may result in occasional pipeline failures. The pipelines team will use the following procedure to resolve these failures: +1. The pipelines team will take action to debug the workflow, determining if the workflow has failed due to the infrastructure, the input data, or a bug in the pipeline. The pipeline will be restarted in production as soon as possible. + 1. An appropriate timebox for debugging issues in the pipeline or inputs will be established. +2. Should the team be unable to debug the failures, the team will reach out to the contributor via email to assist in debugging. +3. The contributor will be expected to work with the team to debug and resolve the failure. +4. If the pipeline is failing suddenly and/or regularly (>2% of workflows/quarter failing) due to qualities of the pipeline itself, the pipeline will be paused in production until the issue is resolved or consider decommissioning. + +### Decommissioning Pipelines +Supporting pipelines that no longer provide value to the DCP represents an unnecessary cost for the Data Processing Service. The following events can trigger an evaluation of whether a pipeline should be decommissioned: + +- No new data produced in 12 months +- Pipeline is one of several competing pipelines for an assay and consensus is reached that one of the other pipelines is preferred. +- A requirement for inclusion ceases to be met (see “Standards for consideration” section) +- Operational failure rate in a quarter surpasses 2% . +- Contributor is not responsive to requests for help debugging operational failures. +- A standard pipeline is instantiated and supports the same use cases. +- A contributor asks for the pipeline to be decommissioned. + +The DCP reserves the right to deny updating a pipeline should there be a new version (e.g. if the update is not a priority, or the update does not support a current DCP user need). + +## Prerequisite work to enable pipeline contribution +The DCP needs to fulfill the following capabilities before we can begin to accept contributed pipelines. These systems could be very light-weight to start (e.g. we could manually email users failure logs) + +- The DCP must distinguish between contributed pipeline data and releasable data +- The DCP must confirm that users understand that the DCP has not validated contributed pipelines or the resulting analysis data. +- Create a system to provide pipeline failure logs to pipeline contributors to make them aware of failures and enable them to debug the problems. + +## Ongoing DCP work needed to support pipeline contribution +The following deliverables may be needed to support integrating contributed pipelines up to the DCP. +- Translate the description of the data that should be analyzed into an appropriate query. +- Connect pipeline to HCA infrastructure to run in production and confirm that the pipeline runs as expected. +- Confirm that a completed pipeline execution contains a record of pipeline provenance. +- Communication with contributors in a timely manner about operational troubleshooting. +- Create documentation describing pipelines. +- Review contribution workspaces. +- Ensure metadata in the HCA describes this data sufficiently. +- Ensure the outputs of these data can be served to users by the matrix service, data portal, DCP CLI, and other DCP services as appropriate. +- Ensure ingest can validate the input data. +- Ensure the data store can support the data. +- Decommissioning pipelines quarterly as needed. + +## Productionizing pipelines +This document describes the characteristics that a pipeline must meet to be considered eligible for inclusion in the DCP. Contributed Pipelines may be further developed into "Standard Pipelines", which are engineered for efficiency by DCP engineers and vetted for scientific excellence with the Analysis Working Group. The pipelines team and AWG are responsible for Standard Pipelines. The contributor will be cited for the pipeline and consulted for feedback on the benchmarking. The contributor will no longer be responsible for debugging failures. + +### Unresolved Questions +- Are any other DCP components concerned with this proposal / have ideas on how to protect themselves from undue operational burden? +- As we implement this process, we expect to iterate on it based on user and DCP feedback \ No newline at end of file