RFC: Data processing robustness#125
Conversation
Expands the motivation section and adds details of experimental metadata variability
Super messy WIP in this commit! This will need a lot of cleaning up before it is ready to go.
| - How each component determines and reports compliance to these principles needs to be defined. | ||
|
|
||
| ### Prior work: | ||
| - [The Harmful Consequences of the Robustness Principle](https://tools.ietf.org/html/draft-iab-protocol-maintenance-03) - Despite the name, this IETF draft very much supports a premise of this RFC, which is to not attempt process unexpected data, rather skip and move on in a principled manner. |
There was a problem hiding this comment.
Your skip and move on statement is more reflective of Postel's principle, which this internet draft is responding against - Time and experience shows that negative consequences to interoperability accumulate over time if implementations apply the robustness principle.
The guidance includes:
A well-specified protocol includes rules for consistent handling of
aberrant conditions. This increases the changes that implementations
have interoperable handling of unusual conditions.
Intolerance of any deviation from specification, where implementations generate fatal errors in response to observing undefined or unusual behaviour, can be harnessed to reduce occurrences of aberrant implementations. Choosing to generate fatal errors for unspecified conditions instead of attempting error recovery can ensure that faults receive attention.
This improves feedback for new implementations in particular. When a
new implementation encounters an intolerant implementation, it
receives strong feedback that allows problems to be discovered
quickly.
There was a problem hiding this comment.
This is an interesting and nuanced discussion.
Postel's principle says to ignore and move on, while this specification is about record error, move on, and fix later. So superficially it might look like Postel's, however, it is about both not stopping processing and not losing data or creating incorrect results.
This treats the DCP more like a credit card system than twitter. Don't let one bad transaction go undetected but don't let it bring down the whole system either.
Right now, changes upstream are hard to make because of the fear of breaking downstream. So it makes it very difficult to bring in new data.
There was a problem hiding this comment.
This is a very important conversation, I am glad to see it happening. I would imagine no one would want any component to go down due to data or metadata that is not well understood. So in this way, I am echoing resilience of the operation of components to the unknown is critical. It was my experience over years of working with many scientists that researchers will use any output given to them, even on a failed run with extensive logging indicating the data is wrong; the mere existence of output is validation enough that the analysis was successful. It is important that, while components do not fail, the system does not progress metadata or data that is not understood and deletes or does not store any derived product from meta(data) that encounters an error. Clear signals should exist in components throughout the life-cycle of the data to alert operations to the (meta)data so the scenario can be resolved (and that resolution hopefully automated to be used to handle the next occurrence) without breaking the operation of the component but with such an open system as ours, it is important we assure users can not access outputs associated with meta(data) in question and we have sufficient validation to check our assumptions on meta(data).
There was a problem hiding this comment.
Exactly @TimothyTickle . Outputting incorrect results on error is far worse crashing. This RFC forbids such behavior..
|
Points brought up at the 2019-11-08 tech arch meeting
|
|
I'm still confused about the purpose of this RFC. There are a lot of words with positive connotations but little concrete guidance for any of the scenarios we actually encounter when dealing with new data or metadata. Components should be "liberal", "tolerant", and "graceful" all while "ensuring integrity." Are there no tradeoffs to consider there? To the extent that this does suggest any requirements, it's not clear how any components are supposed to fulfill them. For example, "DCP components skip processing...rather than crashing or producing incorrect results." How does a component know when it's produced incorrect results? For example, the meaning of a metadata field changes, how would data processing or the matrix service catch that? Saying that "needs to be defined" is really punting on the core question. Neither of those components wants to produce incorrect results, even in the absence of an RFC admonishing them not to do so. |
|
We changed the review deadline due to not having had the time to incorporate comments and questions. |
|
thanks @mckinsel ...
An explicit goal of this RFC is to define expectations for the external behavior of components without making requirements on the internal implementation. This is based on loosely coupled architect concepts. Several people have requested examples be added and hopefully, they will make things clearer.
Of course, there are always trade-offs. Given our current rate of data process, one could argue that the DCP pausing on bad input doesn't make any robustness work is a high priority. There is also the danger of false positive on deciding input is bad could require excessive intervention.
Thanks, this needs to be rephrased. The "incorrect results" statement has to do with process bad input that was not detected. While sanity checking results is useful (e.g. alignment QC), this is not part of this RFC. We will make this clear.
This is actually a pretty interesting case with "meaning" changing. The format could change (e.g. scalar to array), however, if the fundamental meaning changes and the field name hasn't changed, the metadata group has made a huge mistake. We will add this to the request examples.
It is on purpose punting to be defined as part of the RFC review process based on the input of the TL. Approved RFCs should also be revised as they are implemented. If these are just static snapshots of a few people's thinking, then the whole RFC process is not very useful. |
jahilton
left a comment
There was a problem hiding this comment.
I would really like to see something in here about a requirement for components to attempt early detection of errors. Currently, it seems we don't find an error until we actually have our hands on the new data (and usually, data consumers also have their hands on at least some portion of that data). This can include thorough automated tests and/or an environment that fully mirrors our prod env so we can 'try' to push data through the system and see how it responds.
|
|
||
| ## Motivation | ||
|
|
||
| The HCA DCP is driven by input generated from external sources. As most developers on biological data processing systems can attest, this produces data that is highly variable and error-prone. Collecting, storing and processing this type of content can pose challenges for software, as systems must be highly robust against all possible types of problematic data. When unexpected content is encountered, it needs to be handled reliably and predictably. |
There was a problem hiding this comment.
It's difficult to handle something predictably when that something is unexpected. But I think this is semantics and likely we want all things that don't fall into the bin of 'not the norm' to be handled reliably and predictably.
|
|
||
| Likely, most consumers of DCP data will only be interested in some of the experimental metadata. In some cases, this will be confined to a subset of the available metadata (e.g. "which dissociation protocol was used in this experiment?") and in other cases, it will be confined to certain experimental designs ("how was the organoid created in this experiment?"). In both scenarios, we will observe differences. Over time, the set of fields needed to adequately define dissociation protocols will likely change, and not all experimental designs make use of organoids. | ||
|
|
||
| Given the manually created nature of experimental metadata, its high variability, and the needs of consumers, it is likely that errors and mismatches will occur. Assumptions of data consumers will be invalid, highly unusual experimental designs that had never been considered when writing software components will be submitted, and metadata that is not backward compatible or simply incorrect, will all end up in the DCP. |
There was a problem hiding this comment.
The purpose of metadata & wranglers is to both limit inaccuracies in metadata and keep consistency in how experiments are captured, even as the models of those experiments vary. Rather than call out the inevitable errors in our manual processes within one component, I think a bigger point to make here is that the science we are wrangling is fresh and still developing in unpredictable ways. Thus, the DCP needs to be prepared to manage a changing model because with each new submission, we have a high chance of needing to update our model in order to capture everything accurately (but still in a standard way - this might be in disagreement with the statement about 'not backward compatible').
Dec 19, 2019: Community Review Last Call
This RFC provides guidelines for the robustness of DCP components in handling data processing problems.
This is an architecture requirements RFC and doesn’t specify any code changes, although each component is expected to determine what is required to meet these requirements.
Shepherd Summary
Nov 25, 2019:
The deadline has been extended to give the authors time to address late arriving reviewer comments. Note that the current status is approved, with neutral (although helpful) comments from several reviewers.
Reviewers: please update your review status to 'request changes' if you feel the issues your comments are targeting are so significant as to prevent acceptance of this RFC.