Skip to content

Parallel updates #418

@aldenks

Description

@aldenks

Make dataset updates parallel across batches of data variables, similar to how we parallelize backfills.

We have machinery to get_jobs that would give multiple region jobs to do an update.

The challenge is that certain steps need to be done before and after the main region job processing, namely

  • For a standard zarr v3 dataset: only write the updated template metadata after all region jobs finish processing
  • For an icechunk zarr:
    • Before any region jobs: create & checkout new temp branch, resize commit, fresh icechunk session, fork and serialize (pickle) it to object storage that each region job can access.
    • Each (parallel) region job loads the pickled session, does its normal writing of chunk data, and then writes the updated session back to a key in object storage that includes its worker index.
    • Finally something waits for all region jobs to finish, reads and merges all their seasons and does an icechunk commit.

In short we need to add a way to do some work before the region jobs main work, and other work after every parallel process has completed its main region job work.

I’d like to not add new infrastructure dependencies in addition to kubernetes / docker.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions