Parallel updates

Make dataset updates parallel across batches of data variables, similar to how we parallelize backfills.

 
We have machinery to get_jobs that would give multiple region jobs to do an update.

The challenge is that certain steps need to be done before and after the main region job processing, namely
* For a standard zarr v3 dataset: only write the updated template metadata after all region jobs finish processing
* For an icechunk zarr:
   * Before any region jobs: create & checkout new temp branch, resize commit, fresh icechunk session, fork and serialize (pickle) it to object storage that each region job can access.
   * Each (parallel) region job loads the pickled session, does its normal writing of chunk data, and then writes the updated session back to a key in object storage that includes its worker index.
   * Finally something waits for all region jobs to finish, reads and merges all their seasons and does an icechunk commit.

In short we need to add a way to do some work before the region jobs main work, and other work after every parallel process has completed its main region job work.

I’d like to not add new infrastructure dependencies in addition to kubernetes / docker.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel updates #418

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parallel updates #418

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions