Skip to content

HCSDataModule: BadZipFile CRC error reading OZX with num_workers > 0 (fork-shared file descriptor race) #417

@alxndrkalinin

Description

@alxndrkalinin

HCSDataModule: BadZipFile CRC error reading OZX with num_workers > 0 (fork-shared file descriptor race)

Summary

HCSDataModule cannot read .ozx (RFC-9 packed OME-Zarr) stores when num_workers > 0. The first batch fails with:

zipfile.BadZipFile: Caught BadZipFile in DataLoader worker process 0.
Original Traceback (most recent call last):
  ...
  File "iohub/core/implementations/zarr_python.py", line 112, in read_oindex
  File "zarr/storage/_zip.py", line 158, in _get
  File "zipfile/__init__.py", line 1033, in _update_crc
zipfile.BadZipFile: Bad CRC-32 for file 'B/1/000000/0/c/0/0/0/0/0'

The OZX file itself is fine — zipfile.testzip() walks all entries with no errors. The same predict run succeeds with num_workers=0. The CRC failure is a runtime read race, not disk corruption.

Reproducer

PLATE=2024_11_07
uv run dynacell predict \
  -c applications/dynacell/configs/benchmarks/virtual_staining/er/fnet3d_paper/ipsc_confocal/predict__a549_mantis_${PLATE}.yml \
  --data.init_args.num_workers 4

(Same leaf with num_workers=0 — the leaf default — completes successfully.)

Root cause

SlidingWindowDataset.__init__ calls _get_windows() (packages/viscy-data/src/viscy_data/sliding_window.py:118-140), which reads img_arr.frames / img_arr.slices for every position. That metadata access triggers OzxStore._sync_open() in the parent process and stores a live zipfile.ZipFile as self._zf.

When DataLoader then forks workers (Linux default), every worker inherits the same OS-level file descriptor and seek state. Concurrent read(...) calls from different workers stomp on the shared seek pointer; the bytes that land in the read buffer don't match the central directory's CRC for the requested entry, surfacing as BadZipFile: Bad CRC-32 for file '<chunk-path>'.

This is a generic "zipfile is not fork-safe" pattern (same class as h5py + fork). Confirmed it's not file corruption: zipfile.testzip() returns OK on the same .ozx (391 entries walked, no bad CRC). The error reproduces only when DataLoader workers > 0.

Affected workflows

The race is in the worker __getitem__ path. SlidingWindowDataset._read_img_window (packages/viscy-data/src/viscy_data/sliding_window.py:198-210) branches:

preloaded = _preloaded if _preloaded is not None else self._preloaded
if preloaded is not None and arr_idx >= 0:
    data = preloaded[arr_idx][t:t+1, :, z:z+self.z_window_size]...   # /dev/shm mmap
    return ...
data = img.oindex[...]   # zarr/OZX read — this is what races under fork

So whether the bug fires depends on whether mmap_preload is staging FOVs into /dev/shm:

Stage Reads OZX in worker? Affected?
predict, test Always — mmap_preload is fit/validate-only Yes
fit + mmap_preload=True No — workers read from /dev/shm MemoryMappedTensor No
fit + mmap_preload=False Yes — direct OZX read in worker Yes
validate (standalone) + mmap_preload=False Yes Yes
validate riding on fit + mmap_preload=True No — same buffer, validate reads from preloaded views No

The OZX file descriptor is still opened in the parent during _get_windows() metadata access and inherited by every worker on fork — but with mmap_preload=True the workers never read from it, so the inherited fd just sits unused in each worker. No concurrent reads → no CRC race.

BatchedConcatDataModule / ConcatDataModule (joint training) is the same per child: safe iff every constituent enables mmap_preload.

Net effect on dynacell:

  • Predict: always affected. Currently runs with the leaf default num_workers: 0, which throttles GPU utilization (~22% on A40, very spiky).
  • Production training with mmap_preload=True: not affected by this bug. The mmap-preload optimization sidesteps the OZX worker reads.
  • Smoke / dev / debug runs with mmap_preload=False: affected the same way as predict. Same goes for any standalone validate invocation that doesn't share a fit's preload buffer.

iohub's OZX support is wired into open_ome_zarr via is_ozx_path/OzxStore (czbiohub-sf/iohub PR #408, currently pinned in this repo's root pyproject.toml under [tool.uv.sources]).

Proposed fixes (in order of effort)

A. multiprocessing_context='spawn' for the DataLoader (recommended starting point)

zarr.storage.ZipStore already implements pickle support — _zip.py:109-119 strips _zf and _lock before pickle, and __setstate__ reopens fresh per process. So spawn-mode workers receive the dataset, drop the parent's open ZipFile during pickle, and re-open their own fd inside the worker. Each worker has an independent file descriptor → no shared seek state → no CRC race.

Change is ~5 lines in HCSDataModule._loader_kwargs:

kw = {
    "batch_size": self.batch_size,
    "num_workers": self.num_workers,
    "pin_memory": self.pin_memory,
    "persistent_workers": self.persistent_workers,
}
if self.num_workers > 0:
    kw["multiprocessing_context"] = "spawn"

Same change needed in combined.py (ConcatDataModule / BatchedConcatDataModule / CachedConcatDataModule).

Trade-offs:

  • One-time ~1-3 s startup per worker (Python interpreter + viscy/torch/lightning import). With persistent_workers=True this is paid once per epoch boundary.
  • Not the same as Lightning's ddp_spawn strategy warning — that's about GPU rank processes; we keep strategy='ddp'. DataLoader's multiprocessing_context is independent.
  • Filesystem caches (/dev/shm mmap_preload, /tmp scratch) survive: page cache is shared across processes; re-mmap is O(1) page-table setup, no data copy.
  • Gating concern: mmap_preload passes preloaded_fovs (slices into a tensordict.MemoryMappedTensor buffer) into the dataset. If a slice pickles as a regular tensor (by value) instead of a path reference, each worker would receive a full data copy → defeats the optimization. Need a pickle-size measurement on a real mmap_preload=True datamodule before claiming spawn is free for training.

B. Stay on fork, reset the ZipFile in a worker_init_fn

def _ozx_worker_init(worker_id: int) -> None:
    info = torch.utils.data.get_worker_info()
    for fov in info.dataset.positions:
        store = fov.zgroup.store
        store._is_open = False
        store._zf = None  # next read triggers _sync_open in this worker

Pass worker_init_fn=_ozx_worker_init to the DataLoader. Keeps fork (zero startup cost, preserves all in-memory state including mmap_preload views via copy-on-write). Fragile against iohub/zarr internal layout changes.

C. Lazy open in __getitem__ — refactor the dataset to never hold live Position objects

Store (plate_path, position_name) tuples; open (and cache per worker) on first __getitem__. Most invasive, eliminates the fork/spawn distinction entirely, cleanest long-term.

Verification gate

Before landing (A): pickle-size measurement on an mmap_preload=True datamodule.

import pickle, time
dm = HCSDataModule(data_path=..., mmap_preload=True, scratch_dir="/dev/shm/...", ...)
dm.prepare_data(); dm.setup("fit")
t0 = time.perf_counter(); blob = pickle.dumps(dm.train_dataset); t1 = time.perf_counter()
print(f"pickle size = {len(blob)/1e6:.2f} MB, time = {t1-t0:.2f}s")
  • KB-MB → spawn is safe; tensors pickled as path references.
  • GB-scale → spawn defeats mmap_preload; fall back to (B).

Suggested plan

  1. Run the pickle measurement on one mmap_preload-enabled fit datamodule. Dataset graphs that hold MemoryMappedTensor views need to pickle as path references (KB-MB), not by-value (GB).
  2. If pickle is clean: ship (A) with persistent_workers=True for the predict and mmap_preload=False fit/validate code paths; add a regression test that builds a small in-memory .ozx fixture and exercises num_workers>0.
  3. If pickle bloats: ship (B) — fork is preserved, mmap_preload keeps its current cost profile, and predict / non-preloaded fit gain a per-worker fd reset.
  4. Either way, file a follow-up issue for (C) — it's the right architecture independent of fork/spawn and would also let us drop the mmap_preload-as-OZX-workaround framing.

Priority: predict is the immediate blocker (no preload escape hatch). Production training with mmap_preload=True is not on fire today, but the bug still exists for any path that doesn't preload (smoke leaves, dev iteration, standalone validate), so the fix should cover both.

Out of scope

  • iohub itself (the underlying OzxStore already supports lazy open + pickle correctly via inheritance from zarr's ZipStore).
  • Lightning DDP strategy='ddp_spawn' — different process tree.
  • Disk-corruption diagnostics — testzip() already cleared the file.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions