Main ingester wrapper refactoring, introducing metadata file read/writing and Dataset Tracker integration.#30
Open
Main ingester wrapper refactoring, introducing metadata file read/writing and Dataset Tracker integration.#30
Conversation
…orting als-dataset-metadata.json files.
…ng/scicat_beamline into 2026/01/file_record_updating
… folder, exit. If we're given exactly one folder, hand it to the iterator for the given spec. If we're given no folders but some number of files, pass them directly to the ingestor (and assume it will handle them).
…ataset-metadata.json file, and expecting metadata back from the ingesters.
…till some internal changes to make (e.g. temp_dir).
…mmon path, and a file list. Looking for and creating a manifest file. Instantiating the Dataset Tracker client and using it to construct and/or update records. Still a few things to fix: File creation, and manifest building.
…to "time:2000-01-02T12:23:45" and then splitting on ":" and expecting exactly two entries would result in dropping a perfectly usable time value.
… problem: It apparently looks for a "derived" subfolder and creates multiple datasets based on the contents, linking them to the first. This thoroughly breaks our current workflow. Not sure whether we want to even allow the creation of derived data at the beamline, but I'm assuming we do. So this means refactoring the main ingest function yet again, to make a space for multiple separate rounds of ingestion and Data Tracker registering, and it ALSO breaks our reliance on "als-dataset-metadata.json" always having a known name, because a pile of files in a "derived" folder may result in multiple matadata files written to the same place. We could give them all unique postfixes based on the Dataset Tracker slug, which would "solve" the problem, but it would look ugly.
…g a name that reflects the presence of a Dataset Tracker ID.
…e getting any useful metadata out of this .txt file ingestion process at all. I hope I can have a conversation with Damon about this at some point.
… afterwards (if we write one). We also look for a metadata file in the incoming list, but do not look in the filesystem for it. Unless: We have been given no files, in which case we build a manifest from the given path. We'll look for a metadata file in there. This makes the ingesters require a file manifest, instead of doing their own directory crawling. Our intention is to make it clear that if we're given a folder (and no file list), then everything in that folder is considered part of the dataset, whether or not it's used for SciCat ingestion. Still need to account for adding (if needed) the metadata file to its own manifest.
…es" to report stuff, just putting it in the logfile.
…ing "issues" to report stuff, just putting it in the logfile. Still need to do the 'spot' version. Needs testing.
…to derived classes.
… parse. Other time zone fixes.
…Prefect-based test that doesn't fake the client calls.
…class like I wanted to months ago. Now 733 and 832 have class-based ingesters. The rest can be converted as we go.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a lot to look through. I apologize. :D
This refactors the main ingestion function:
als-dataset-metadata.jsonfile, and imports it if possible. (If this is a re-ingestion, the file may be of the formals-dataset-metadata-_____.jsonwhere_____is the ID of the Dataset Tracker record for this dataset.)als-dataset-metadata-_____.jsonfile out. (If the Dataset Tracker is not being used, there is no suffix. This feature can also be turned off.)als-dataset-metadata.json, it intelligently updates the records to match any changes in the manifest. (This is for two purposes: 1. Datasets that a user has revised and wants to re-ingest, 2. A near-future scenario where we register things in the Dataset Tracker as soon as they get to Beegfs, and we attempt a SciCat ingest in a separate step.)This branch is meant to stabilize ingesters on creating one dataset in SciCat per ingestion, and if we want to create many, we should invoke it multiple times and pass in a different file list.
Note: The beamline-specific ingester modules are all classes now, and derive from a base class. So far 7.3.3 and 8.3.2 have been rewritten this way. We'll do others as needed.
Primary points of interest:
The rewritten process in https://github.com/als-computing/scicat_beamline/pull/30/changes#diff-8b15f38de565a134fe2d43f70b02e5fcbfb438790fa7219e4e201aac32890212 . .
The (highly suspicious) metadata code in the 7.3.3 ingester. It was highly suspicious before, and it still is.