-
Notifications
You must be signed in to change notification settings - Fork 0
ProvLog
There will be a separate paper detailing the purpose and implementation of ProvLog, but here we'll sketch out the highlights. ($jeff)
Most data-centric projects have a robust and thorough data backup system in place. With it, any damaged or deleted file can be quickly and easily recovered without any appreciable loss to the project. The problem, however, is not in how to recover those files, but how to detect when they've been damaged/deleted in the first place.
Most data researchers assume that the greatest risk of damage or deletion comes when the files are being moved from one machine or directory to another, and great care is usually taken to ensure that the file lists and checksums match before and after the move.
Commonly cited causes of file damage:
- lost/corrupted during transfer
- spontaneous degradation of storage media
- vandalism
But there is one last source of error that is even more common than all those others, yet is neither commonly discussed, nor protected against: accidental alteration by the data managers themselves. Particularly in a research environment, where data is often examined directly by stakeholders, the tools employed are of general purpose in nature and capable of altering the file content. And such alterations happen frequently. In most cases, when the operator accidentally deletes a line while perusing the file, the change is noticed immediately and backed out. But in some cases, those changes are not detected and corrected, and may lie dormant in the active version of the file throughout the entire processing phase.
In a world where significant economic, political, and healthcare decisions are increasingly being driven by data, it becomes imperative that the chain of provenance for such datasets be open to scrutiny and verification. It isn't sufficient to know that a particular file has not been altered since it was written, but also that it also still matches the file that was originally created on some other machine, and has since been leapfrogged across a dozen intervening systems. Such provenance tracking is not widely done, and tools for supporting it are almost non-existent. So for this reason, we've created a prototype of such a system, which we call ProvLog, which is a contraction of Provenance Logger.
ProvLog works by establishing a simple, MD5 checksum fingerprint on a datafile at the time of creation, and attaches that checksum to the target file as a sidecar file. As the target file moves around the data pipeline, the sidecar moves with it, and can be used to authenticate the target file's content at any time, and know that it not only matches now, but matches all the way back across those many intervening hops, to the very first version of the file.
Sometimes, however, target files are altered for legitimate purposes, such as correcting typos, normalizing or clarifying ambiguous data, etc. For these cases, the ProvLog sidecars support logging, allowing the user to add a new checksum to the sidecar along with an explanation of what was changed and why. By maintaining the entire history of these logs, checksums, and file URIs, the sidecar file encapsulates a complete history of how individual data files have been manipulated across their lifespan.
Importantly, the ProvLog system also includes a scanning mode that can be run on a recurring schedule to scan all data directories and validate all sidecar files found. In this way, accidental changes to a file is found very quickly and the operator is given the chance to either reverse the change, restore from backups, or update the sidecar and explain the change.