RFC: Data processing robustness by diekhans · Pull Request #125 · HumanCellAtlas/dcp-community

diekhans · 2019-10-31T16:35:04Z

Dec 19, 2019: Community Review Last Call

This RFC provides guidelines for the robustness of DCP components in handling data processing problems.
This is an architecture requirements RFC and doesn’t specify any code changes, although each component is expected to determine what is required to meet these requirements.

Shepherd Summary

Nov 25, 2019:
The deadline has been extended to give the authors time to address late arriving reviewer comments. Note that the current status is approved, with neutral (although helpful) comments from several reviewers.

Reviewers: please update your review status to 'request changes' if you feel the issues your comments are targeting are so significant as to prevent acceptance of this RFC.

Expands the motivation section and adds details of experimental metadata variability

Super messy WIP in this commit! This will need a lot of cleaning up before it is ready to go.

rfcs/text/0000-data-processing-robustness.md

brianraymor · 2019-10-31T22:33:17Z

rfcs/text/0000-data-processing-robustness.md

+- How each component determines and reports compliance to these principles needs to be defined.
+
+### Prior work:
+- [The Harmful Consequences of the Robustness Principle](https://tools.ietf.org/html/draft-iab-protocol-maintenance-03) - Despite the name, this IETF draft very much supports a premise of this RFC, which is to not attempt process unexpected data, rather skip and move on in a principled manner.


Your skip and move on statement is more reflective of Postel's principle, which this internet draft is responding against - Time and experience shows that negative consequences to interoperability accumulate over time if implementations apply the robustness principle.

The guidance includes:

Virtuous Intolerance

A well-specified protocol includes rules for consistent handling of
aberrant conditions. This increases the changes that implementations
have interoperable handling of unusual conditions.

Intolerance of any deviation from specification, where implementations generate fatal errors in response to observing undefined or unusual behaviour, can be harnessed to reduce occurrences of aberrant implementations. Choosing to generate fatal errors for unspecified conditions instead of attempting error recovery can ensure that faults receive attention.

This improves feedback for new implementations in particular. When a
new implementation encounters an intolerant implementation, it
receives strong feedback that allows problems to be discovered
quickly.

This is an interesting and nuanced discussion.

Postel's principle says to ignore and move on, while this specification is about record error, move on, and fix later. So superficially it might look like Postel's, however, it is about both not stopping processing and not losing data or creating incorrect results.

This treats the DCP more like a credit card system than twitter. Don't let one bad transaction go undetected but don't let it bring down the whole system either.

Right now, changes upstream are hard to make because of the fear of breaking downstream. So it makes it very difficult to bring in new data.

This is a very important conversation, I am glad to see it happening. I would imagine no one would want any component to go down due to data or metadata that is not well understood. So in this way, I am echoing resilience of the operation of components to the unknown is critical. It was my experience over years of working with many scientists that researchers will use any output given to them, even on a failed run with extensive logging indicating the data is wrong; the mere existence of output is validation enough that the analysis was successful. It is important that, while components do not fail, the system does not progress metadata or data that is not understood and deletes or does not store any derived product from meta(data) that encounters an error. Clear signals should exist in components throughout the life-cycle of the data to alert operations to the (meta)data so the scenario can be resolved (and that resolution hopefully automated to be used to handle the next occurrence) without breaking the operation of the component but with such an open system as ours, it is important we assure users can not access outputs associated with meta(data) in question and we have sufficient validation to check our assumptions on meta(data).

Exactly @TimothyTickle . Outputting incorrect results on error is far worse crashing. This RFC forbids such behavior..

rfcs/text/0000-data-processing-robustness.md

diekhans · 2019-11-08T16:57:14Z

Points brought up at the 2019-11-08 tech arch meeting

Add confessional approach to RFC
HTTP response codes point needs to have examples
- this point maybe overly prescriptive for this RFC.
do a confessional (round table) first to identify where we are at
log and continue should not be confused with pushing through the failed data
- data groups will allow this to extend partially failed submission
how does this relate to the metadata integration testing process?
examples would be useful

mckinsel · 2019-11-14T21:28:24Z

I'm still confused about the purpose of this RFC. There are a lot of words with positive connotations but little concrete guidance for any of the scenarios we actually encounter when dealing with new data or metadata. Components should be "liberal", "tolerant", and "graceful" all while "ensuring integrity." Are there no tradeoffs to consider there?

To the extent that this does suggest any requirements, it's not clear how any components are supposed to fulfill them. For example, "DCP components skip processing...rather than crashing or producing incorrect results." How does a component know when it's produced incorrect results? For example, the meaning of a metadata field changes, how would data processing or the matrix service catch that? Saying that "needs to be defined" is really punting on the core question. Neither of those components wants to produce incorrect results, even in the absence of an RFC admonishing them not to do so.

diekhans · 2019-11-15T01:21:29Z

We changed the review deadline due to not having had the time to incorporate comments and questions.

diekhans · 2019-11-17T17:12:52Z

thanks @mckinsel ...

I'm still confused about the purpose of this RFC. There are a lot of words with positive connotations but little concrete guidance for any of the scenarios we actually encounter when dealing with new data or metadata.

An explicit goal of this RFC is to define expectations for the external behavior of components without making requirements on the internal implementation. This is based on loosely coupled architect concepts.

Several people have requested examples be added and hopefully, they will make things clearer.

Components should be "liberal", "tolerant", and "graceful" all while "ensuring integrity." Are there no tradeoffs to consider there?

Of course, there are always trade-offs. Given our current rate of data process, one could argue that the DCP pausing on bad input doesn't make any robustness work is a high priority. There is also the danger of false positive on deciding input is bad could require excessive intervention.

To the extent that this does suggest any requirements, it's not clear how any components are supposed to fulfill them. For example, "DCP components skip processing...rather than crashing or producing incorrect results. How does a component know when it's produced incorrect results?

Thanks, this needs to be rephrased. The "incorrect results" statement has to do with process bad input that was not detected. While sanity checking results is useful (e.g. alignment QC), this is not part of this RFC. We will make this clear.

For example, the meaning of a metadata field changes, how would data processing or the matrix service catch that?

This is actually a pretty interesting case with "meaning" changing. The format could change (e.g. scalar to array), however, if the fundamental meaning changes and the field name hasn't changed, the metadata group has made a huge mistake.

We will add this to the request examples.

Saying that "needs to be defined" is really punting on the core question.

It is on purpose punting to be defined as part of the RFC review process based on the input of the TL. Approved RFCs should also be revised as they are implemented. If these are just static snapshots of a few people's thinking, then the whole RFC process is not very useful.

jahilton

I would really like to see something in here about a requirement for components to attempt early detection of errors. Currently, it seems we don't find an error until we actually have our hands on the new data (and usually, data consumers also have their hands on at least some portion of that data). This can include thorough automated tests and/or an environment that fully mirrors our prod env so we can 'try' to push data through the system and see how it responds.

jahilton · 2019-11-22T12:35:50Z

rfcs/text/0000-data-processing-robustness.md

+
+## Motivation
+
+The HCA DCP is driven by input generated from external sources. As most developers on biological data processing systems can attest, this produces data that is highly variable and error-prone. Collecting, storing and processing this type of content can pose challenges for software, as systems must be highly robust against all possible types of problematic data. When unexpected content is encountered, it needs to be handled reliably and predictably.


It's difficult to handle something predictably when that something is unexpected. But I think this is semantics and likely we want all things that don't fall into the bin of 'not the norm' to be handled reliably and predictably.

jahilton · 2019-11-22T12:43:27Z

rfcs/text/0000-data-processing-robustness.md

+
+Likely, most consumers of DCP data will only be interested in some of the experimental metadata. In some cases, this will be confined to a subset of the available metadata (e.g. "which dissociation protocol was used in this experiment?") and in other cases, it will be confined to certain experimental designs ("how was the organoid created in this experiment?"). In both scenarios, we will observe differences. Over time, the set of fields needed to adequately define dissociation protocols will likely change, and not all experimental designs make use of organoids.
+
+Given the manually created nature of experimental metadata, its high variability, and the needs of consumers, it is likely that errors and mismatches will occur. Assumptions of data consumers will be invalid, highly unusual experimental designs that had never been considered when writing software components will be submitted, and metadata that is not backward compatible or simply incorrect, will all end up in the DCP.


The purpose of metadata & wranglers is to both limit inaccuracies in metadata and keep consistency in how experiments are captured, even as the models of those experiments vary. Rather than call out the inevitable errors in our manual processes within one component, I think a bigger point to make here is that the science we are wrangling is fresh and still developing in unpredictable ways. Thus, the DCP needs to be prepared to manage a changing model because with each new submission, we have a high chance of needing to update our model in order to capture everything accurately (but still in a standard way - this might be in disagreement with the statement about 'not backward compatible').

diekhans and others added 9 commits October 31, 2019 12:20

started doc

2a0398a

reorganized wording

4707678

partation text

6039c1a

Extend the motivation section

9155183

Expands the motivation section and adds details of experimental metadata variability

minor edits

ba4a639

more simplication

9c68c8e

wording edits

dde95b5

merged other changes

f2176e3

Add some rough content to the detailed design

aac8a27

Super messy WIP in this commit! This will need a lot of cleaning up before it is ready to go.

diekhans added rfc-community-review Architecture labels Oct 31, 2019

brianraymor reviewed Oct 31, 2019

View reviewed changes

rfcs/text/0000-data-processing-robustness.md Outdated Show resolved Hide resolved

brianraymor reviewed Oct 31, 2019

View reviewed changes

rfcs/text/0000-data-processing-robustness.md Outdated Show resolved Hide resolved

brianraymor reviewed Oct 31, 2019

View reviewed changes

rfcs/text/0000-data-processing-robustness.md Outdated Show resolved Hide resolved

fixed typo

d0ba1d3

brianraymor reviewed Oct 31, 2019

View reviewed changes

rfcs/text/0000-data-processing-robustness.md Outdated Show resolved Hide resolved

brianraymor reviewed Oct 31, 2019

View reviewed changes

rfcs/text/0000-data-processing-robustness.md Outdated Show resolved Hide resolved

brianraymor reviewed Oct 31, 2019

View reviewed changes

rfcs/text/0000-data-processing-robustness.md Show resolved Hide resolved

justincc approved these changes Oct 31, 2019

View reviewed changes

diekhans added 6 commits October 31, 2019 13:30

added citation to IETF draft suggested by Brian R.

432c9fc

added citation suggested by Brian R.

54ae14c

fixed typos and incorrect formatting

7a0c6bb

fixed formatting of link

96972bd

added explict link to Metadata Schema Version RFC

f3c8101

clarrifed some text

1275aa9

brianraymor reviewed Oct 31, 2019

View reviewed changes

rfcs/text/0000-data-processing-robustness.md Outdated Show resolved Hide resolved

brianraymor reviewed Oct 31, 2019

View reviewed changes

fixed corrupted list

9517a5d

pshapiro4broad reviewed Nov 1, 2019

View reviewed changes

rfcs/text/0000-data-processing-robustness.md Outdated Show resolved Hide resolved

pshapiro4broad reviewed Nov 1, 2019

View reviewed changes

rfcs/text/0000-data-processing-robustness.md Outdated Show resolved Hide resolved

typo and grammar fixes

f9101ee

jahilton reviewed Nov 22, 2019

View reviewed changes


		## Motivation

		The HCA DCP is driven by input generated from external sources. As most developers on biological data processing systems can attest, this produces data that is highly variable and error-prone. Collecting, storing and processing this type of content can pose challenges for software, as systems must be highly robust against all possible types of problematic data. When unexpected content is encountered, it needs to be handled reliably and predictably.


		Likely, most consumers of DCP data will only be interested in some of the experimental metadata. In some cases, this will be confined to a subset of the available metadata (e.g. "which dissociation protocol was used in this experiment?") and in other cases, it will be confined to certain experimental designs ("how was the organoid created in this experiment?"). In both scenarios, we will observe differences. Over time, the set of fields needed to adequately define dissociation protocols will likely change, and not all experimental designs make use of organoids.

		Given the manually created nature of experimental metadata, its high variability, and the needs of consumers, it is likely that errors and mismatches will occur. Assumptions of data consumers will be invalid, highly unusual experimental designs that had never been considered when writing software components will be submitted, and metadata that is not backward compatible or simply incorrect, will all end up in the DCP.

Conversation

diekhans commented Oct 31, 2019 • edited by tburdett Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brianraymor Oct 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

diekhans Oct 31, 2019

Choose a reason for hiding this comment

Uh oh!

TimothyTickle Nov 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

diekhans Nov 1, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

diekhans commented Nov 8, 2019

Uh oh!

mckinsel commented Nov 14, 2019

Uh oh!

diekhans commented Nov 15, 2019

Uh oh!

diekhans commented Nov 17, 2019

Uh oh!

jahilton left a comment

Choose a reason for hiding this comment

Uh oh!

jahilton Nov 22, 2019

Choose a reason for hiding this comment

Uh oh!

jahilton Nov 22, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

diekhans commented Oct 31, 2019 •

edited by tburdett

Loading

brianraymor Oct 31, 2019 •

edited

Loading

TimothyTickle Nov 1, 2019 •

edited

Loading