Skip to content

RFC: Data processing robustness#125

Open
diekhans wants to merge 18 commits intoHumanCellAtlas:masterfrom
diekhans:data-processing-robustness
Open

RFC: Data processing robustness#125
diekhans wants to merge 18 commits intoHumanCellAtlas:masterfrom
diekhans:data-processing-robustness

Conversation

@diekhans
Copy link
Contributor

@diekhans diekhans commented Oct 31, 2019

Dec 19, 2019: Community Review Last Call

This RFC provides guidelines for the robustness of DCP components in handling data processing problems.
This is an architecture requirements RFC and doesn’t specify any code changes, although each component is expected to determine what is required to meet these requirements.

Shepherd Summary

Nov 25, 2019:
The deadline has been extended to give the authors time to address late arriving reviewer comments. Note that the current status is approved, with neutral (although helpful) comments from several reviewers.

Reviewers: please update your review status to 'request changes' if you feel the issues your comments are targeting are so significant as to prevent acceptance of this RFC.

diekhans and others added 9 commits October 31, 2019 12:20
Expands the motivation section and adds details of experimental metadata variability
Super messy WIP in this commit! This will need a lot of cleaning up
before it is ready to go.
- How each component determines and reports compliance to these principles needs to be defined.

### Prior work:
- [The Harmful Consequences of the Robustness Principle](https://tools.ietf.org/html/draft-iab-protocol-maintenance-03) - Despite the name, this IETF draft very much supports a premise of this RFC, which is to not attempt process unexpected data, rather skip and move on in a principled manner.
Copy link
Collaborator

@brianraymor brianraymor Oct 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your skip and move on statement is more reflective of Postel's principle, which this internet draft is responding against - Time and experience shows that negative consequences to interoperability accumulate over time if implementations apply the robustness principle.

The guidance includes:

Virtuous Intolerance

A well-specified protocol includes rules for consistent handling of
aberrant conditions. This increases the changes that implementations
have interoperable handling of unusual conditions.

Intolerance of any deviation from specification, where implementations generate fatal errors in response to observing undefined or unusual behaviour, can be harnessed to reduce occurrences of aberrant implementations. Choosing to generate fatal errors for unspecified conditions instead of attempting error recovery can ensure that faults receive attention.

This improves feedback for new implementations in particular. When a
new implementation encounters an intolerant implementation, it
receives strong feedback that allows problems to be discovered
quickly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting and nuanced discussion.

Postel's principle says to ignore and move on, while this specification is about record error, move on, and fix later. So superficially it might look like Postel's, however, it is about both not stopping processing and not losing data or creating incorrect results.

This treats the DCP more like a credit card system than twitter. Don't let one bad transaction go undetected but don't let it bring down the whole system either.

Right now, changes upstream are hard to make because of the fear of breaking downstream. So it makes it very difficult to bring in new data.

Copy link
Contributor

@TimothyTickle TimothyTickle Nov 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very important conversation, I am glad to see it happening. I would imagine no one would want any component to go down due to data or metadata that is not well understood. So in this way, I am echoing resilience of the operation of components to the unknown is critical. It was my experience over years of working with many scientists that researchers will use any output given to them, even on a failed run with extensive logging indicating the data is wrong; the mere existence of output is validation enough that the analysis was successful. It is important that, while components do not fail, the system does not progress metadata or data that is not understood and deletes or does not store any derived product from meta(data) that encounters an error. Clear signals should exist in components throughout the life-cycle of the data to alert operations to the (meta)data so the scenario can be resolved (and that resolution hopefully automated to be used to handle the next occurrence) without breaking the operation of the component but with such an open system as ours, it is important we assure users can not access outputs associated with meta(data) in question and we have sufficient validation to check our assumptions on meta(data).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly @TimothyTickle . Outputting incorrect results on error is far worse crashing. This RFC forbids such behavior..

@diekhans
Copy link
Contributor Author

diekhans commented Nov 8, 2019

Points brought up at the 2019-11-08 tech arch meeting

  • Add confessional approach to RFC
  • HTTP response codes point needs to have examples
    • this point maybe overly prescriptive for this RFC.
  • do a confessional (round table) first to identify where we are at
  • log and continue should not be confused with pushing through the failed data
    • data groups will allow this to extend partially failed submission
  • how does this relate to the metadata integration testing process?
  • examples would be useful

@mckinsel
Copy link

I'm still confused about the purpose of this RFC. There are a lot of words with positive connotations but little concrete guidance for any of the scenarios we actually encounter when dealing with new data or metadata. Components should be "liberal", "tolerant", and "graceful" all while "ensuring integrity." Are there no tradeoffs to consider there?

To the extent that this does suggest any requirements, it's not clear how any components are supposed to fulfill them. For example, "DCP components skip processing...rather than crashing or producing incorrect results." How does a component know when it's produced incorrect results? For example, the meaning of a metadata field changes, how would data processing or the matrix service catch that? Saying that "needs to be defined" is really punting on the core question. Neither of those components wants to produce incorrect results, even in the absence of an RFC admonishing them not to do so.

@diekhans
Copy link
Contributor Author

We changed the review deadline due to not having had the time to incorporate comments and questions.

@diekhans
Copy link
Contributor Author

thanks @mckinsel ...

I'm still confused about the purpose of this RFC. There are a lot of words with positive connotations but little concrete guidance for any of the scenarios we actually encounter when dealing with new data or metadata.

An explicit goal of this RFC is to define expectations for the external behavior of components without making requirements on the internal implementation. This is based on loosely coupled architect concepts.

Several people have requested examples be added and hopefully, they will make things clearer.

Components should be "liberal", "tolerant", and "graceful" all while "ensuring integrity." Are there no tradeoffs to consider there?

Of course, there are always trade-offs. Given our current rate of data process, one could argue that the DCP pausing on bad input doesn't make any robustness work is a high priority. There is also the danger of false positive on deciding input is bad could require excessive intervention.

To the extent that this does suggest any requirements, it's not clear how any components are supposed to fulfill them. For example, "DCP components skip processing...rather than crashing or producing incorrect results. How does a component know when it's produced incorrect results?

Thanks, this needs to be rephrased. The "incorrect results" statement has to do with process bad input that was not detected. While sanity checking results is useful (e.g. alignment QC), this is not part of this RFC. We will make this clear.

For example, the meaning of a metadata field changes, how would data processing or the matrix service catch that?

This is actually a pretty interesting case with "meaning" changing. The format could change (e.g. scalar to array), however, if the fundamental meaning changes and the field name hasn't changed, the metadata group has made a huge mistake.

We will add this to the request examples.

Saying that "needs to be defined" is really punting on the core question.

It is on purpose punting to be defined as part of the RFC review process based on the input of the TL. Approved RFCs should also be revised as they are implemented. If these are just static snapshots of a few people's thinking, then the whole RFC process is not very useful.

Copy link
Contributor

@jahilton jahilton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really like to see something in here about a requirement for components to attempt early detection of errors. Currently, it seems we don't find an error until we actually have our hands on the new data (and usually, data consumers also have their hands on at least some portion of that data). This can include thorough automated tests and/or an environment that fully mirrors our prod env so we can 'try' to push data through the system and see how it responds.


## Motivation

The HCA DCP is driven by input generated from external sources. As most developers on biological data processing systems can attest, this produces data that is highly variable and error-prone. Collecting, storing and processing this type of content can pose challenges for software, as systems must be highly robust against all possible types of problematic data. When unexpected content is encountered, it needs to be handled reliably and predictably.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's difficult to handle something predictably when that something is unexpected. But I think this is semantics and likely we want all things that don't fall into the bin of 'not the norm' to be handled reliably and predictably.


Likely, most consumers of DCP data will only be interested in some of the experimental metadata. In some cases, this will be confined to a subset of the available metadata (e.g. "which dissociation protocol was used in this experiment?") and in other cases, it will be confined to certain experimental designs ("how was the organoid created in this experiment?"). In both scenarios, we will observe differences. Over time, the set of fields needed to adequately define dissociation protocols will likely change, and not all experimental designs make use of organoids.

Given the manually created nature of experimental metadata, its high variability, and the needs of consumers, it is likely that errors and mismatches will occur. Assumptions of data consumers will be invalid, highly unusual experimental designs that had never been considered when writing software components will be submitted, and metadata that is not backward compatible or simply incorrect, will all end up in the DCP.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of metadata & wranglers is to both limit inaccuracies in metadata and keep consistency in how experiments are captured, even as the models of those experiments vary. Rather than call out the inevitable errors in our manual processes within one component, I think a bigger point to make here is that the science we are wrangling is fresh and still developing in unpredictable ways. Thus, the DCP needs to be prepared to manage a changing model because with each new submission, we have a high chance of needing to update our model in order to capture everything accurately (but still in a standard way - this might be in disagreement with the statement about 'not backward compatible').

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants