-
Notifications
You must be signed in to change notification settings - Fork 17
RFC: Bundle Types #86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
82b1302
WIP
kislyuk 4734480
Wording fixup
kislyuk c53a3ce
Add Query Service use case
kislyuk b4d5f5e
Add shepherd
kislyuk 43e0d86
Expand reference bundle user story to clarify
kislyuk 4d0a588
Rename analysis-support to analysis-reference
kislyuk 2c4a65e
Clarify bundle type registry ownership
kislyuk 6f3c135
Address comments in review
kislyuk f6d2cad
WIP
kislyuk File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,124 @@ | ||
| ### DCP PR: | ||
|
|
||
| ***Leave this blank until the RFC is approved** then the **Author(s)** must create a link between the assigned RFC number and this pull request in the format:* | ||
|
|
||
| `[dcp-community/rfc#](https://github.com/HumanCellAtlas/dcp-community/pull/<PR#>)` | ||
|
|
||
| # RFC: Bundle Types | ||
|
|
||
| ## Summary | ||
|
|
||
| Formalizes *de facto* data bundle types, definitions, target users/consumers, and specific use cases to improve clarity | ||
| and confidence among DCP developers when working with the DCP data model. | ||
|
|
||
| ## Author(s) | ||
|
|
||
| * [Mallory Freeberg](mailto:mfreeberg@ebi.ac.uk) | ||
| * [Brian Hannafious](mailto:bhannafi@ucsc.edu) | ||
| * [Hannes Schmidt](mailto:hannes@ucsc.edu) | ||
| * [Andrey Kislyuk](mailto:akislyuk@chanzuckerberg.com) | ||
|
|
||
| ## Shepherd | ||
|
|
||
| * [Andrey Kislyuk](mailto:akislyuk@chanzuckerberg.com) | ||
|
|
||
| ## Motivation | ||
|
|
||
| There are currently no documented definitions for what a "bundle" is in the HCA DCP, including no definitions of what | ||
| they are, what they contain, or who they support. Lack of clarity causes confusion for DCP developers and data consumers | ||
| about how the data and metadata in the DCP are structured and organized. | ||
|
|
||
| ### User Stories | ||
|
|
||
| 1. As a tool developer, I would like to identify and get raw data in the DSS so that I can run my data processing and | ||
| analysis tools. | ||
|
|
||
| 1. As a tool developer, I would like to identify and get alignment results and count matrices in order to produce | ||
| expression matrices. | ||
|
|
||
| 1. As a computational biologist, I would like to get expression matrices for the Immune Cell Atlas project so that I can | ||
| do my research. | ||
|
|
||
| 1. As an external user of the DSS who is developing analysis tools, I would like to browse, query, and get raw and | ||
| processed data based on some rough criteria so that I can understand what the data are and develop my tool. | ||
|
|
||
| 1. As an Azul or Query Service developer I want to be able to distinguish between different types of bundles in order to | ||
| know which bundle to index, link, etc. or to provide bundle types to users as a search criterion. | ||
|
|
||
| 1. As a computational biologist, I would like to use the same references that the HCA DCP uses in their analysis. | ||
|
|
||
| ## Detailed Design | ||
|
|
||
| ### What is a bundle? | ||
|
|
||
| A bundle is a unit of data organization in the DCP, roughly equivalent to a folder or directory. Bundles in DCP can be | ||
| categorized into data bundles and utility bundles. | ||
|
|
||
| * Data bundles contain the results of a single data transaction. They can be viewed as the digital equivalent of an | ||
| "instrument run" - a quantity of data gathered from a single process instance. Data bundles are annotated with | ||
| metadata using a procedure designed by the metadata team. | ||
|
|
||
| * Utility bundles exist to organize (group) data bundles and represent data of non-DCP origin (such as reference data) | ||
| that is used in the course of analysis. | ||
|
|
||
| ### How are bundle types defined? | ||
|
|
||
| Operationally, bundle types can be defined like file types or media types: they are used as a data type identifier for | ||
| applications to pattern match on whether data should be considered as suitable for input to the application. | ||
|
|
||
| ### When is a new bundle type needed? | ||
|
|
||
| * There are 3 classes of data bundles currently provided or concretely envisioned in DCP: primary data bundles, | ||
| secondary analysis bundles, and tertiary (meta-analysis or aggregate analysis) bundles. A new class of bundle could be | ||
| needed if a new analysis modality or data aggregation method were to be introduced. | ||
|
|
||
| * As denoted in the bundle type notation sketch below, [RFC 7231](https://tools.ietf.org/html/rfc7231) parameters can be | ||
| used to designate the process that generated a bundle. A new class of process (for example, a new secondary analysis | ||
| pipeline) could cause a new `bundle-type` value with the same bundle type and subtype as previous secondary analysis | ||
| data bundles, but with a new parameter value for `hca-pipeline`. | ||
|
|
||
| * Similarly, a new modality or format of primary experimental output data would cause a new primary data bundle | ||
| `bundle-type` value. | ||
|
|
||
| ### Conveying bundle types | ||
|
|
||
| A new field, `bundle-type`, will be introduced to the DSS bundle manifest (the response to `GET bundle` in DSS). The | ||
| field value is a string formatted in a similar manner | ||
| to [DCP Media Types](https://docs.google.com/document/d/1TqihrgXjct9aDmTJO52_gE2WlpFysB1OkG9C8exmWTw) and consistent | ||
| with the syntax defined in [RFC 7231](https://tools.ietf.org/html/rfc7231#section-3.1.1.1). DSS must require the bundle | ||
| type to be specified when creating a new bundle, but should not enforce the vocabulary. DSS must allow subscriptions | ||
| and search results to be predicated on bundle type using the same string predicates available with other metadata and | ||
| manifest string fields. | ||
|
|
||
| ### Registry of Bundle Types | ||
|
|
||
| Introduce a registry of bundle types to be maintained by the Data Store team with specification input from the Metadata | ||
| team. The initial contents of the registry are: | ||
|
|
||
| | Bundle Type | Bundle Class | Example `bundle-type` value | | ||
| |--------------------------|---------------------|-------------------------------------------------------------| | ||
| | Project Metadata Bundle | Utility bundle | `hca/project` | | ||
| | Primary Sequence Bundle | Primary data bundle | `hca/primary-data; hca-data-type:SmartSeq2` | | ||
| | Primary Imaging Bundle | Primary data bundle | `hca/primary-data; hca-data-type:sptx` | | ||
| | Secondary Analysis Bundle| Analysis data bundle| `hca/analysis-output; hca-pipeline:snap-atac` | | ||
| | Reference Resource Bundle| Utility bundle | `hca/analysis-reference; hca-resource-type:reference-genome`| | ||
| | Expression Matrix Bundle | Analysis data bundle| `hca/analysis-output; hca-resource-type:expression-matrix` | | ||
|
|
||
| The registry would be maintained either as part of this git repository (HumanCellAtlas/dcp-community) or as part of a | ||
| repository chosen by the Data Store team, subject to a pull request-based contribution model. | ||
|
|
||
| ### Unresolved Questions | ||
|
|
||
| - What defines a "correct" bundle type? | ||
|
|
||
| - Is the level of detail for the operational requirements of a bundle registry sufficient? | ||
|
|
||
| ### Prior Art | ||
|
|
||
| [Submission to bundles SOP](https://docs.google.com/document/d/1x8mYLU8ubpZtTrzkwJrft1heqReX-pJLjJfb2X0fd2w) (Apr 2018) | ||
| [Bundle metadata schema update ticket](https://github.com/HumanCellAtlas/metadata-schema/issues/986) | ||
|
|
||
| ### Alternatives | ||
|
|
||
| - Status Quo: no well-defined bundle types. Existing DCP components continue to infer bundle types via ad hoc methods | ||
| and included metadata. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a syntax note, numbering is not incrementing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TimothyTickle that is intentional. Markdown supports automatic numbered list incrementing, so that items can be re-arranged more easily. The rendered version increments the list properly.