Skip to content

Main ingester wrapper refactoring, introducing metadata file read/writing and Dataset Tracker integration.#30

Open
GBirkel wants to merge 58 commits intomainfrom
2026/01/file_record_updating
Open

Main ingester wrapper refactoring, introducing metadata file read/writing and Dataset Tracker integration.#30
GBirkel wants to merge 58 commits intomainfrom
2026/01/file_record_updating

Conversation

@GBirkel
Copy link
Contributor

@GBirkel GBirkel commented Jan 9, 2026

This is a lot to look through. I apologize. :D

This refactors the main ingestion function:

  • It now requires one input folder, and a list of files that are assumed to be somewhere within the input folder.
  • It now accepts optional credentials for connecting to the Dataset Tracker.
  • It looks in the provided file list for als-dataset-metadata.json file, and imports it if possible. (If this is a re-ingestion, the file may be of the form als-dataset-metadata-_____.json where _____ is the ID of the Dataset Tracker record for this dataset.)
  • If the file exists, it does some sanity checking. (Does this dataset already appear to have been imported? Does the manifest resemble the incoming file list?)
  • Once ingestion is done, it writes an updated als-dataset-metadata-_____.json file out. (If the Dataset Tracker is not being used, there is no suffix. This feature can also be turned off.)
  • When a SciCat ID is available, it connects to the Dataset Tracker and creates a series of records identifying the dataset and its location (an "instance record"). If it's doing a re-import, and a Dataset Tracker record ID was already in a pre-existing als-dataset-metadata.json, it intelligently updates the records to match any changes in the manifest. (This is for two purposes: 1. Datasets that a user has revised and wants to re-ingest, 2. A near-future scenario where we register things in the Dataset Tracker as soon as they get to Beegfs, and we attempt a SciCat ingest in a separate step.)

This branch is meant to stabilize ingesters on creating one dataset in SciCat per ingestion, and if we want to create many, we should invoke it multiple times and pass in a different file list.

Note: The beamline-specific ingester modules are all classes now, and derive from a base class. So far 7.3.3 and 8.3.2 have been rewritten this way. We'll do others as needed.

Primary points of interest:

The rewritten process in https://github.com/als-computing/scicat_beamline/pull/30/changes#diff-8b15f38de565a134fe2d43f70b02e5fcbfb438790fa7219e4e201aac32890212 . .

The (highly suspicious) metadata code in the 7.3.3 ingester. It was highly suspicious before, and it still is.

GBirkel and others added 27 commits January 5, 2026 13:08
…ng/scicat_beamline into 2026/01/file_record_updating
… folder, exit. If we're given exactly one folder, hand it to the iterator for the given spec. If we're given no folders but some number of files, pass them directly to the ingestor (and assume it will handle them).
…ataset-metadata.json file, and expecting metadata back from the ingesters.
…till some internal changes to make (e.g. temp_dir).
…mmon path, and a file list. Looking for and creating a manifest file. Instantiating the Dataset Tracker client and using it to construct and/or update records. Still a few things to fix: File creation, and manifest building.
…to "time:2000-01-02T12:23:45" and then splitting on ":" and expecting exactly two entries would result in dropping a perfectly usable time value.
… problem: It apparently looks for a "derived" subfolder and creates multiple datasets based on the contents, linking them to the first. This thoroughly breaks our current workflow. Not sure whether we want to even allow the creation of derived data at the beamline, but I'm assuming we do. So this means refactoring the main ingest function yet again, to make a space for multiple separate rounds of ingestion and Data Tracker registering, and it ALSO breaks our reliance on "als-dataset-metadata.json" always having a known name, because a pile of files in a "derived" folder may result in multiple matadata files written to the same place. We could give them all unique postfixes based on the Dataset Tracker slug, which would "solve" the problem, but it would look ugly.
…g a name that reflects the presence of a Dataset Tracker ID.
…e getting any useful metadata out of this .txt file ingestion process at all. I hope I can have a conversation with Damon about this at some point.
@GBirkel GBirkel changed the title WIP: File movement and location tracking Main ingester wrapper refactoring, introducing metadata file read/writing and Dataset Tracker integration. Jan 29, 2026
@GBirkel GBirkel requested a review from davramov January 29, 2026 02:46
@GBirkel GBirkel self-assigned this Jan 29, 2026
… afterwards (if we write one). We also look for a metadata file in the incoming list, but do not look in the filesystem for it. Unless: We have been given no files, in which case we build a manifest from the given path. We'll look for a metadata file in there. This makes the ingesters require a file manifest, instead of doing their own directory crawling. Our intention is to make it clear that if we're given a folder (and no file list), then everything in that folder is considered part of the dataset, whether or not it's used for SciCat ingestion. Still need to account for adding (if needed) the metadata file to its own manifest.
…es" to report stuff, just putting it in the logfile.
…ing "issues" to report stuff, just putting it in the logfile. Still need to do the 'spot' version. Needs testing.
…Prefect-based test that doesn't fake the client calls.
…class like I wanted to months ago. Now 733 and 832 have class-based ingesters. The rest can be converted as we go.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant