Main ingester wrapper refactoring, introducing metadata file read/writing and Dataset Tracker integration. by GBirkel · Pull Request #30 · als-computing/scicat_beamline

GBirkel · 2026-01-09T22:58:08Z

This is a lot to look through. I apologize. :D

This refactors the main ingestion function:

It now requires one input folder, and a list of files that are assumed to be somewhere within the input folder.
It now accepts optional credentials for connecting to the Dataset Tracker.
It looks in the provided file list for als-dataset-metadata.json file, and imports it if possible. (If this is a re-ingestion, the file may be of the form als-dataset-metadata-_____.json where _____ is the ID of the Dataset Tracker record for this dataset.)
If the file exists, it does some sanity checking. (Does this dataset already appear to have been imported? Does the manifest resemble the incoming file list?)
Once ingestion is done, it writes an updated als-dataset-metadata-_____.json file out. (If the Dataset Tracker is not being used, there is no suffix. This feature can also be turned off.)
When a SciCat ID is available, it connects to the Dataset Tracker and creates a series of records identifying the dataset and its location (an "instance record"). If it's doing a re-import, and a Dataset Tracker record ID was already in a pre-existing als-dataset-metadata.json, it intelligently updates the records to match any changes in the manifest. (This is for two purposes: 1. Datasets that a user has revised and wants to re-ingest, 2. A near-future scenario where we register things in the Dataset Tracker as soon as they get to Beegfs, and we attempt a SciCat ingest in a separate step.)

This branch is meant to stabilize ingesters on creating one dataset in SciCat per ingestion, and if we want to create many, we should invoke it multiple times and pass in a different file list.

Note: The beamline-specific ingester modules are all classes now, and derive from a base class. So far 7.3.3 and 8.3.2 have been rewritten this way. We'll do others as needed.

Primary points of interest:

The rewritten process in https://github.com/als-computing/scicat_beamline/pull/30/changes#diff-8b15f38de565a134fe2d43f70b02e5fcbfb438790fa7219e4e201aac32890212 . .

The (highly suspicious) metadata code in the 7.3.3 ingester. It was highly suspicious before, and it still is.

… first use

…orting als-dataset-metadata.json files.

…ng/scicat_beamline into 2026/01/file_record_updating

… folder, exit. If we're given exactly one folder, hand it to the iterator for the given spec. If we're given no folders but some number of files, pass them directly to the ingestor (and assume it will handle them).

…ataset-metadata.json file, and expecting metadata back from the ingesters.

…till some internal changes to make (e.g. temp_dir).

…mmon path, and a file list. Looking for and creating a manifest file. Instantiating the Dataset Tracker client and using it to construct and/or update records. Still a few things to fix: File creation, and manifest building.

…to "time:2000-01-02T12:23:45" and then splitting on ":" and expecting exactly two entries would result in dropping a perfectly usable time value.

… problem: It apparently looks for a "derived" subfolder and creates multiple datasets based on the contents, linking them to the first. This thoroughly breaks our current workflow. Not sure whether we want to even allow the creation of derived data at the beamline, but I'm assuming we do. So this means refactoring the main ingest function yet again, to make a space for multiple separate rounds of ingestion and Data Tracker registering, and it ALSO breaks our reliance on "als-dataset-metadata.json" always having a known name, because a pile of files in a "derived" folder may result in multiple matadata files written to the same place. We could give them all unique postfixes based on the Dataset Tracker slug, which would "solve" the problem, but it would look ugly.

… a file list.

…g a name that reflects the presence of a Dataset Tracker ID.

…e getting any useful metadata out of this .txt file ingestion process at all. I hope I can have a conversation with Damon about this at some point.

… afterwards (if we write one). We also look for a metadata file in the incoming list, but do not look in the filesystem for it. Unless: We have been given no files, in which case we build a manifest from the given path. We'll look for a metadata file in there. This makes the ingesters require a file manifest, instead of doing their own directory crawling. Our intention is to make it clear that if we're given a folder (and no file list), then everything in that folder is considered part of the dataset, whether or not it's used for SciCat ingestion. Still need to account for adding (if needed) the metadata file to its own manifest.

…es" to report stuff, just putting it in the logfile.

…ing "issues" to report stuff, just putting it in the logfile. Still need to do the 'spot' version. Needs testing.

…to derived classes.

…ngesters.

… parse. Other time zone fixes.

…Prefect-based test that doesn't fake the client calls.

…class like I wanted to months ago. Now 733 and 832 have class-based ingesters. The rest can be converted as we go.

GBirkel and others added 27 commits January 5, 2026 13:08

Minor cleanup

de420cc

We're not just doing ingesting anymore!

07b2c20

Adding dataset_metadata_schemas access, and a utility function with a…

4588f0a

… first use

Handling old schema versions and attempting to auto-convert, when imp…

e06b7f8

…orting als-dataset-metadata.json files.

A minor note.

bb1735a

Minor technical debt

aeda1f1

Merge branch '2026/01/file_record_updating' of github.com:als-computi…

633f311

…ng/scicat_beamline into 2026/01/file_record_updating

Removing an old note

9d9a738

Merge branch 'refs/heads/main' into 2026/01/file_record_updating

ffa5cb6

These have been moved to dataset_metadata_schemas .

f9d4363

Adding dataset_tracker_client as a dependency.

7ba0143

Extra arguments for Dataset Tracker config. Seeking an existing als-d…

1bff7ab

…ataset-metadata.json file, and expecting metadata back from the ingesters.

Passing the dataset tracker into the ingesters.

c1bcb77

Partially updating current ingesters to use new function signature. S…

80f8836

…till some internal changes to make (e.g. temp_dir).

This file seems a bit bloated. :D

dfc9950

"uv sync" is a more sensible invocation since we're using uv.

cbd2858

Changing the signature yet again!

6febe67

Finishing up the ingester refactor. Needs testing,

3bf6181

Was badly written. For example, turning "time=2000-01-02T12:23:45" in…

9983670

…to "time:2000-01-02T12:23:45" and then splitting on ":" and expecting exactly two entries would result in dropping a perfectly usable time value.

Redundant line...

e8dc9dc

This was deceptively named.

a15c85c

Utility functions for building file manifest objects from a folder or…

933ff51

… a file list.

Filling out metadata in the file before writing it back to disk. Usin…

2011685

…g a name that reflects the presence of a Dataset Tracker ID.

Now that I've combed through the rest of it, I'm kind of shocked we'r…

1049f6a

…e getting any useful metadata out of this .txt file ingestion process at all. I hope I can have a conversation with Damon about this at some point.

GBirkel changed the title ~~WIP: File movement and location tracking~~ Main ingester wrapper refactoring, introducing metadata file read/writing and Dataset Tracker integration. Jan 29, 2026

GBirkel requested a review from davramov January 29, 2026 02:46

GBirkel self-assigned this Jan 29, 2026

GBirkel added 30 commits January 28, 2026 19:17

Not using the "bl" prefix.

c0d327d

Fail early, fail often!

0a0aeec

More minor commentary.

53421a9

Yet another function signature change!

5214a8d

Refactored 733 ingester to use a file manifest. No longer using "issu…

d9dd862

…es" to report stuff, just putting it in the logfile.

Refactored 832 ingesters 1 and 2 to use a file manifest. No longer us…

620a850

…ing "issues" to report stuff, just putting it in the logfile. Still need to do the 'spot' version. Needs testing.

Not using "issues".

8d3f354

Updating flow method signature.

17542dc

About time we added a uv lockfile.

821bd7f

More function signature updates... I really need to turn all these in…

1829162

…to derived classes.

Bogus import!

d598ca5

A bit more logging.

87c7c02

We let Prefect pass in a logger, and we need to pass it down to the i…

669c443

…ngesters.

Is this a Datetime object? Let's find out.

e5207d5

"isoformat" does not reliably produce a UTC timestamp that SciCat can…

55042f2

… parse. Other time zone fixes.

Oops, need separators

8736aad

A bit more time/date cleanup.

50143d7

Hey maaaan, you can't log that heeere. You gonna have to move it uptown.

d7dabf2

This is so out of date we shouldn't keep it. It's been replaced by a …

7a889de

…Prefect-based test that doesn't fake the client calls.

I couldn't stand it any more, and decided to create an ingester base …

0f1028b

…class like I wanted to months ago. Now 733 and 832 have class-based ingesters. The rest can be converted as we go.

Import corrections

1999242

Date handling corrections.

6f53511

Need to pass beamline/proposal IDs as slugs in record creation.

93da0fe

Updating this just in case. Hopefully no one uses it. ;)

60b3cad

Testing with a different ingester.

b227ab7

Slightly better logging of important values.

3ebf721

Ooops wrong field name

726899a

Just a bit more logging.

e679185

Removing 'issue' and 'severity', as they aren't used by the new class.

9df7e68

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Main ingester wrapper refactoring, introducing metadata file read/writing and Dataset Tracker integration.#30

Main ingester wrapper refactoring, introducing metadata file read/writing and Dataset Tracker integration.#30
GBirkel wants to merge 58 commits intomainfrom
2026/01/file_record_updating

GBirkel commented Jan 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

GBirkel commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Primary points of interest:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

GBirkel commented Jan 9, 2026 •

edited

Loading