-
Notifications
You must be signed in to change notification settings - Fork 17
RFC: Data processing robustness #125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
2a0398a
4707678
6039c1a
9155183
ba4a639
9c68c8e
dde95b5
f2176e3
aac8a27
d0ba1d3
432c9fc
54ae14c
7a0c6bb
96972bd
f3c8101
1275aa9
9517a5d
f9101ee
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,73 @@ | ||
| ### DCP PR: | ||
|
|
||
| ***Leave this blank until the RFC is approved** then the **Author(s)** must create a link between the assigned RFC number and this pull request in the format:* | ||
|
|
||
| `[dcp-community/rfc#](https://github.com/HumanCellAtlas/dcp-community/pull/<PR#>)` | ||
|
|
||
| # RFC DCP data processing robustness principles | ||
|
|
||
| ## Summary | ||
|
|
||
| This document describes requirements for individual DCP components related to robustness against incorrect, corrupt, or unexpected data. It also lays out expectations for data integrity and error handling for DCP software. It does not specify implementation details or technology. | ||
|
|
||
| ## Author(s) | ||
|
|
||
| [Mark Diekhans](mailto:markd@ucsc.edu), | ||
| [Tony Burdett](mailto:tburdett@ebi.ac.uk) | ||
|
|
||
| ## Shepherd | ||
| [Tony Burdett](mailto:tburdett@ebi.ac.uk) | ||
|
|
||
| ## Motivation | ||
|
|
||
| The HCA DCP is driven by input generated from external sources. As most developers on biological data processing systems can attest, this produces data that is highly variable and error-prone. Collecting, storing and processing this type of content can pose challenges for software, as systems must be highly robust against all possible types of problematic data. When unexpected content is encountered, it needs to be handled reliably and predictably. | ||
|
|
||
| As well as capturing data (e.g. sequencing FASTQ files, sequence alignment BAM files, images), the HCA DCP contains experimental "metadata". This metadata precisely describes biological materials, experimental processes, and protocols, and as such is considered by most scientists in biomedical research to be a source of [data about an experiment](https://en.wikipedia.org/wiki/Metadata#In_biomedical_research). This is different from how many software engineers consider metadata, for example [on the internet](https://en.wikipedia.org/wiki/Metadata#On_the_Internet) or in [digital media](https://en.wikipedia.org/wiki/Metadata#In_broadcast_industry). Critically, DCP experimental metadata is part of a searchable, informational record and has value in its own right, distinct from merely annotating data files. As such, it is important the DCP can adequately handle and tolerate many differences in its metadata, rather than treat metadata as part of a static API description. | ||
|
|
||
| The single-cell community is part of a new field, and the optimal experimental designs are currently poorly understood. This leads to high levels of variability in the experiments being conducted, and therefore in the descriptions of experimental techniques and technologies used. This, in turn, creates highly variable and highly volatile metadata, both in terms of the experimental design (which is typically expressed as a graph capturing experimental steps) and in terms of the descriptive elements required (captured in the metadata schema itself). Over time, experimental metadata in the DCP will, therefore, represent a wide range of variability of experiment designs and schema versions. This variability will be many orders of magnitude higher than the variability observed in the types of data captured by the DCP. | ||
|
|
||
| Likely, most consumers of DCP data will only be interested in some of the experimental metadata. In some cases, this will be confined to a subset of the available metadata (e.g. "which dissociation protocol was used in this experiment?") and in other cases, it will be confined to certain experimental designs ("how was the organoid created in this experiment?"). In both scenarios, we will observe differences. Over time, the set of fields needed to adequately define dissociation protocols will likely change, and not all experimental designs make use of organoids. | ||
|
|
||
| Given the manually created nature of experimental metadata, its high variability, and the needs of consumers, it is likely that errors and mismatches will occur. Assumptions of data consumers will be invalid, highly unusual experimental designs that had never been considered when writing software components will be submitted, and metadata that is not backward compatible or simply incorrect, will all end up in the DCP. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The purpose of metadata & wranglers is to both limit inaccuracies in metadata and keep consistency in how experiments are captured, even as the models of those experiments vary. Rather than call out the inevitable errors in our manual processes within one component, I think a bigger point to make here is that the science we are wrangling is fresh and still developing in unpredictable ways. Thus, the DCP needs to be prepared to manage a changing model because with each new submission, we have a high chance of needing to update our model in order to capture everything accurately (but still in a standard way - this might be in disagreement with the statement about 'not backward compatible'). |
||
|
|
||
| Data integrity for biological data sets goes beyond simply ensuring the integrity of individual files, however. All files in a coherent data set (e.g. all FASTQ files from a single sequencing experiment) must be valid to the same standard, and the experimental metadata that references these files must correctly describe which files were generated from which technique, for example. The partitioned nature of many single-cell experiments, along with the continuous processing design of the DCP, makes it hard to even define a *complete data set*, let alone ensure its integrity. | ||
|
|
||
| To create a robust and resilient system, DCP components must be engineered with a liberal attitude to data and experimental metadata, being highly tolerant of errors in data and metadata, integrity problems, and unexpected content. This is likely to mean that DCP components encounter data that they cannot handle, and this is an acceptable compromise as long as these cases are highly visible. This RFC proposes several strategies to engineer data processing systems for robustness and data integrity. | ||
|
|
||
| ### User Stories | ||
|
|
||
| - As a DCP developer, I don't want to make urgent fixes to software due to differences or errors in data, so that development efforts can be planned and managed in a controlled manner, rather than a reactive one. | ||
| - As a DCP submitter, I want my data to be ingested promptly, without waiting for DCP developers to make modifications to handle my experiment, so that my lab can move on to other tasks. | ||
| - As a DCP data consumer, I don't want to receive scientifically incorrect, inaccurate or incomplete data caused by developers attempting to work around problematic data, so that I can spend my time on actual research. | ||
|
|
||
| ## Detailed Design | ||
|
|
||
| In this RFC, we require two main strategies for making changes across the DCP that will ensure greater reliability and predictability of data processing. These strategies will reduce coupling between components, providing the opportunity for development teams to react in a controlled, planned manner whenever data is encountered that fails to meet the assumptions or expectations of DCP software components. | ||
|
|
||
| The proposed strategies are: | ||
| 1. *Log and Continue* | ||
| 2. *Repair and Recover* | ||
|
|
||
| ### Log and Continue | ||
|
|
||
| - DCP components define their expectations of any data they receive (see | ||
| [Querying DSS by Metadata Schema Version(s)](https://github.com/HumanCellAtlas/dcp-community/blob/master/rfcs/text/0011-query-by-metadata-schema-versions.md) | ||
| - DCP components ignore data they retrieve if it is mismatched against their expectations (e.g. bundles contain unrecognized file formats such as PDFs) | ||
| - DCP components skip processing of data and fail gracefully when they encounter an error, rather than crashing or producing incorrect results. | ||
| - DCP components log all errors in a manner that allows them to be detected and addressed. See [Monitoring for production systems](https://github.com/HumanCellAtlas/dcp-community/blob/master/rfcs/text/0002-monitoring-for-production-systems.md). | ||
| - DCP components report correct error statuses (e.g. via HTTP response codes) to clients as appropriate. | ||
|
|
||
|
|
||
| ### Repair and Recover | ||
|
|
||
| - DCP components include operator admin functions to manually trigger automated steps that may have failed | ||
| - In case of errors that cause some data to be skipped, DCP components ensure that all data expected to be handled together (e.g. all data from a single project) complete before making some of the data available | ||
|
|
||
|
|
||
| ### Unresolved Questions | ||
|
|
||
| - It is expected that this RFC to raise questions that require a more detailed specification of the requirements. | ||
| - How each component determines and reports compliance with these principles needs to be defined. | ||
|
|
||
|
diekhans marked this conversation as resolved.
|
||
| ### Prior work: | ||
| - [The Harmful Consequences of the Robustness Principle](https://tools.ietf.org/html/draft-iab-protocol-maintenance-03) - Despite the name, this IETF draft very much supports a premise of this RFC, which is to not attempt process unexpected data, rather skip and move on in a principled manner. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Your skip and move on statement is more reflective of Postel's principle, which this internet draft is responding against - Time and experience shows that negative consequences to interoperability accumulate over time if implementations apply the robustness principle. The guidance includes:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is an interesting and nuanced discussion. Postel's principle says to ignore and move on, while this specification is about record error, move on, and fix later. So superficially it might look like Postel's, however, it is about both not stopping processing and not losing data or creating incorrect results. This treats the DCP more like a credit card system than twitter. Don't let one bad transaction go undetected but don't let it bring down the whole system either. Right now, changes upstream are hard to make because of the fear of breaking downstream. So it makes it very difficult to bring in new data.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a very important conversation, I am glad to see it happening. I would imagine no one would want any component to go down due to data or metadata that is not well understood. So in this way, I am echoing resilience of the operation of components to the unknown is critical. It was my experience over years of working with many scientists that researchers will use any output given to them, even on a failed run with extensive logging indicating the data is wrong; the mere existence of output is validation enough that the analysis was successful. It is important that, while components do not fail, the system does not progress metadata or data that is not understood and deletes or does not store any derived product from meta(data) that encounters an error. Clear signals should exist in components throughout the life-cycle of the data to alert operations to the (meta)data so the scenario can be resolved (and that resolution hopefully automated to be used to handle the next occurrence) without breaking the operation of the component but with such an open system as ours, it is important we assure users can not access outputs associated with meta(data) in question and we have sufficient validation to check our assumptions on meta(data).
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Exactly @TimothyTickle . Outputting incorrect results on error is far worse crashing. This RFC forbids such behavior.. |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's difficult to handle something predictably when that something is unexpected. But I think this is semantics and likely we want all things that don't fall into the bin of 'not the norm' to be handled reliably and predictably.