HCSDataModule: BadZipFile CRC error reading OZX with num_workers > 0 (fork-shared file descriptor race)
Summary
HCSDataModule cannot read .ozx (RFC-9 packed OME-Zarr) stores when num_workers > 0. The first batch fails with:
zipfile.BadZipFile: Caught BadZipFile in DataLoader worker process 0.
Original Traceback (most recent call last):
...
File "iohub/core/implementations/zarr_python.py", line 112, in read_oindex
File "zarr/storage/_zip.py", line 158, in _get
File "zipfile/__init__.py", line 1033, in _update_crc
zipfile.BadZipFile: Bad CRC-32 for file 'B/1/000000/0/c/0/0/0/0/0'
The OZX file itself is fine — zipfile.testzip() walks all entries with no errors. The same predict run succeeds with num_workers=0. The CRC failure is a runtime read race, not disk corruption.
Reproducer
PLATE=2024_11_07
uv run dynacell predict \
-c applications/dynacell/configs/benchmarks/virtual_staining/er/fnet3d_paper/ipsc_confocal/predict__a549_mantis_${PLATE}.yml \
--data.init_args.num_workers 4
(Same leaf with num_workers=0 — the leaf default — completes successfully.)
Root cause
SlidingWindowDataset.__init__ calls _get_windows() (packages/viscy-data/src/viscy_data/sliding_window.py:118-140), which reads img_arr.frames / img_arr.slices for every position. That metadata access triggers OzxStore._sync_open() in the parent process and stores a live zipfile.ZipFile as self._zf.
When DataLoader then forks workers (Linux default), every worker inherits the same OS-level file descriptor and seek state. Concurrent read(...) calls from different workers stomp on the shared seek pointer; the bytes that land in the read buffer don't match the central directory's CRC for the requested entry, surfacing as BadZipFile: Bad CRC-32 for file '<chunk-path>'.
This is a generic "zipfile is not fork-safe" pattern (same class as h5py + fork). Confirmed it's not file corruption: zipfile.testzip() returns OK on the same .ozx (391 entries walked, no bad CRC). The error reproduces only when DataLoader workers > 0.
Affected workflows
The race is in the worker __getitem__ path. SlidingWindowDataset._read_img_window (packages/viscy-data/src/viscy_data/sliding_window.py:198-210) branches:
preloaded = _preloaded if _preloaded is not None else self._preloaded
if preloaded is not None and arr_idx >= 0:
data = preloaded[arr_idx][t:t+1, :, z:z+self.z_window_size]... # /dev/shm mmap
return ...
data = img.oindex[...] # zarr/OZX read — this is what races under fork
So whether the bug fires depends on whether mmap_preload is staging FOVs into /dev/shm:
| Stage |
Reads OZX in worker? |
Affected? |
predict, test |
Always — mmap_preload is fit/validate-only |
Yes |
fit + mmap_preload=True |
No — workers read from /dev/shm MemoryMappedTensor |
No |
fit + mmap_preload=False |
Yes — direct OZX read in worker |
Yes |
validate (standalone) + mmap_preload=False |
Yes |
Yes |
validate riding on fit + mmap_preload=True |
No — same buffer, validate reads from preloaded views |
No |
The OZX file descriptor is still opened in the parent during _get_windows() metadata access and inherited by every worker on fork — but with mmap_preload=True the workers never read from it, so the inherited fd just sits unused in each worker. No concurrent reads → no CRC race.
BatchedConcatDataModule / ConcatDataModule (joint training) is the same per child: safe iff every constituent enables mmap_preload.
Net effect on dynacell:
- Predict: always affected. Currently runs with the leaf default
num_workers: 0, which throttles GPU utilization (~22% on A40, very spiky).
- Production training with
mmap_preload=True: not affected by this bug. The mmap-preload optimization sidesteps the OZX worker reads.
- Smoke / dev / debug runs with
mmap_preload=False: affected the same way as predict. Same goes for any standalone validate invocation that doesn't share a fit's preload buffer.
iohub's OZX support is wired into open_ome_zarr via is_ozx_path/OzxStore (czbiohub-sf/iohub PR #408, currently pinned in this repo's root pyproject.toml under [tool.uv.sources]).
Proposed fixes (in order of effort)
A. multiprocessing_context='spawn' for the DataLoader (recommended starting point)
zarr.storage.ZipStore already implements pickle support — _zip.py:109-119 strips _zf and _lock before pickle, and __setstate__ reopens fresh per process. So spawn-mode workers receive the dataset, drop the parent's open ZipFile during pickle, and re-open their own fd inside the worker. Each worker has an independent file descriptor → no shared seek state → no CRC race.
Change is ~5 lines in HCSDataModule._loader_kwargs:
kw = {
"batch_size": self.batch_size,
"num_workers": self.num_workers,
"pin_memory": self.pin_memory,
"persistent_workers": self.persistent_workers,
}
if self.num_workers > 0:
kw["multiprocessing_context"] = "spawn"
Same change needed in combined.py (ConcatDataModule / BatchedConcatDataModule / CachedConcatDataModule).
Trade-offs:
- One-time ~1-3 s startup per worker (Python interpreter + viscy/torch/lightning import). With
persistent_workers=True this is paid once per epoch boundary.
- Not the same as Lightning's
ddp_spawn strategy warning — that's about GPU rank processes; we keep strategy='ddp'. DataLoader's multiprocessing_context is independent.
- Filesystem caches (
/dev/shm mmap_preload, /tmp scratch) survive: page cache is shared across processes; re-mmap is O(1) page-table setup, no data copy.
- Gating concern:
mmap_preload passes preloaded_fovs (slices into a tensordict.MemoryMappedTensor buffer) into the dataset. If a slice pickles as a regular tensor (by value) instead of a path reference, each worker would receive a full data copy → defeats the optimization. Need a pickle-size measurement on a real mmap_preload=True datamodule before claiming spawn is free for training.
B. Stay on fork, reset the ZipFile in a worker_init_fn
def _ozx_worker_init(worker_id: int) -> None:
info = torch.utils.data.get_worker_info()
for fov in info.dataset.positions:
store = fov.zgroup.store
store._is_open = False
store._zf = None # next read triggers _sync_open in this worker
Pass worker_init_fn=_ozx_worker_init to the DataLoader. Keeps fork (zero startup cost, preserves all in-memory state including mmap_preload views via copy-on-write). Fragile against iohub/zarr internal layout changes.
C. Lazy open in __getitem__ — refactor the dataset to never hold live Position objects
Store (plate_path, position_name) tuples; open (and cache per worker) on first __getitem__. Most invasive, eliminates the fork/spawn distinction entirely, cleanest long-term.
Verification gate
Before landing (A): pickle-size measurement on an mmap_preload=True datamodule.
import pickle, time
dm = HCSDataModule(data_path=..., mmap_preload=True, scratch_dir="/dev/shm/...", ...)
dm.prepare_data(); dm.setup("fit")
t0 = time.perf_counter(); blob = pickle.dumps(dm.train_dataset); t1 = time.perf_counter()
print(f"pickle size = {len(blob)/1e6:.2f} MB, time = {t1-t0:.2f}s")
- KB-MB → spawn is safe; tensors pickled as path references.
- GB-scale → spawn defeats
mmap_preload; fall back to (B).
Suggested plan
- Run the pickle measurement on one
mmap_preload-enabled fit datamodule. Dataset graphs that hold MemoryMappedTensor views need to pickle as path references (KB-MB), not by-value (GB).
- If pickle is clean: ship (A) with
persistent_workers=True for the predict and mmap_preload=False fit/validate code paths; add a regression test that builds a small in-memory .ozx fixture and exercises num_workers>0.
- If pickle bloats: ship (B) — fork is preserved,
mmap_preload keeps its current cost profile, and predict / non-preloaded fit gain a per-worker fd reset.
- Either way, file a follow-up issue for (C) — it's the right architecture independent of fork/spawn and would also let us drop the
mmap_preload-as-OZX-workaround framing.
Priority: predict is the immediate blocker (no preload escape hatch). Production training with mmap_preload=True is not on fire today, but the bug still exists for any path that doesn't preload (smoke leaves, dev iteration, standalone validate), so the fix should cover both.
Out of scope
- iohub itself (the underlying
OzxStore already supports lazy open + pickle correctly via inheritance from zarr's ZipStore).
- Lightning DDP
strategy='ddp_spawn' — different process tree.
- Disk-corruption diagnostics —
testzip() already cleared the file.
HCSDataModule: BadZipFile CRC error reading OZX with num_workers > 0 (fork-shared file descriptor race)
Summary
HCSDataModulecannot read.ozx(RFC-9 packed OME-Zarr) stores whennum_workers > 0. The first batch fails with:The OZX file itself is fine —
zipfile.testzip()walks all entries with no errors. The same predict run succeeds withnum_workers=0. The CRC failure is a runtime read race, not disk corruption.Reproducer
PLATE=2024_11_07 uv run dynacell predict \ -c applications/dynacell/configs/benchmarks/virtual_staining/er/fnet3d_paper/ipsc_confocal/predict__a549_mantis_${PLATE}.yml \ --data.init_args.num_workers 4(Same leaf with
num_workers=0— the leaf default — completes successfully.)Root cause
SlidingWindowDataset.__init__calls_get_windows()(packages/viscy-data/src/viscy_data/sliding_window.py:118-140), which readsimg_arr.frames/img_arr.slicesfor every position. That metadata access triggersOzxStore._sync_open()in the parent process and stores a livezipfile.ZipFileasself._zf.When
DataLoaderthen forks workers (Linux default), every worker inherits the same OS-level file descriptor and seek state. Concurrentread(...)calls from different workers stomp on the shared seek pointer; the bytes that land in the read buffer don't match the central directory's CRC for the requested entry, surfacing asBadZipFile: Bad CRC-32 for file '<chunk-path>'.This is a generic "zipfile is not fork-safe" pattern (same class as h5py + fork). Confirmed it's not file corruption:
zipfile.testzip()returns OK on the same.ozx(391 entries walked, no bad CRC). The error reproduces only when DataLoader workers > 0.Affected workflows
The race is in the worker
__getitem__path.SlidingWindowDataset._read_img_window(packages/viscy-data/src/viscy_data/sliding_window.py:198-210) branches:So whether the bug fires depends on whether
mmap_preloadis staging FOVs into/dev/shm:predict,testmmap_preloadis fit/validate-onlyfit+mmap_preload=True/dev/shmMemoryMappedTensorfit+mmap_preload=Falsevalidate(standalone) +mmap_preload=Falsevalidateriding onfit+mmap_preload=TrueThe OZX file descriptor is still opened in the parent during
_get_windows()metadata access and inherited by every worker on fork — but withmmap_preload=Truethe workers never read from it, so the inherited fd just sits unused in each worker. No concurrent reads → no CRC race.BatchedConcatDataModule/ConcatDataModule(joint training) is the same per child: safe iff every constituent enablesmmap_preload.Net effect on dynacell:
num_workers: 0, which throttles GPU utilization (~22% on A40, very spiky).mmap_preload=True: not affected by this bug. The mmap-preload optimization sidesteps the OZX worker reads.mmap_preload=False: affected the same way as predict. Same goes for any standalonevalidateinvocation that doesn't share a fit's preload buffer.iohub's OZX support is wired into
open_ome_zarrviais_ozx_path/OzxStore(czbiohub-sf/iohub PR #408, currently pinned in this repo's rootpyproject.tomlunder[tool.uv.sources]).Proposed fixes (in order of effort)
A.
multiprocessing_context='spawn'for the DataLoader (recommended starting point)zarr.storage.ZipStorealready implements pickle support —_zip.py:109-119strips_zfand_lockbefore pickle, and__setstate__reopens fresh per process. So spawn-mode workers receive the dataset, drop the parent's open ZipFile during pickle, and re-open their own fd inside the worker. Each worker has an independent file descriptor → no shared seek state → no CRC race.Change is ~5 lines in
HCSDataModule._loader_kwargs:Same change needed in
combined.py(ConcatDataModule/BatchedConcatDataModule/CachedConcatDataModule).Trade-offs:
persistent_workers=Truethis is paid once per epoch boundary.ddp_spawnstrategy warning — that's about GPU rank processes; we keepstrategy='ddp'. DataLoader'smultiprocessing_contextis independent./dev/shmmmap_preload,/tmpscratch) survive: page cache is shared across processes; re-mmap is O(1) page-table setup, no data copy.mmap_preloadpassespreloaded_fovs(slices into atensordict.MemoryMappedTensorbuffer) into the dataset. If a slice pickles as a regular tensor (by value) instead of a path reference, each worker would receive a full data copy → defeats the optimization. Need a pickle-size measurement on a realmmap_preload=Truedatamodule before claiming spawn is free for training.B. Stay on fork, reset the ZipFile in a
worker_init_fnPass
worker_init_fn=_ozx_worker_initto the DataLoader. Keeps fork (zero startup cost, preserves all in-memory state includingmmap_preloadviews via copy-on-write). Fragile against iohub/zarr internal layout changes.C. Lazy open in
__getitem__— refactor the dataset to never hold livePositionobjectsStore
(plate_path, position_name)tuples; open (and cache per worker) on first__getitem__. Most invasive, eliminates the fork/spawn distinction entirely, cleanest long-term.Verification gate
Before landing (A): pickle-size measurement on an
mmap_preload=Truedatamodule.mmap_preload; fall back to (B).Suggested plan
mmap_preload-enabled fit datamodule. Dataset graphs that holdMemoryMappedTensorviews need to pickle as path references (KB-MB), not by-value (GB).persistent_workers=Truefor the predict andmmap_preload=Falsefit/validate code paths; add a regression test that builds a small in-memory.ozxfixture and exercisesnum_workers>0.mmap_preloadkeeps its current cost profile, and predict / non-preloaded fit gain a per-worker fd reset.mmap_preload-as-OZX-workaround framing.Priority: predict is the immediate blocker (no preload escape hatch). Production training with
mmap_preload=Trueis not on fire today, but the bug still exists for any path that doesn't preload (smoke leaves, dev iteration, standalone validate), so the fix should cover both.Out of scope
OzxStorealready supports lazy open + pickle correctly via inheritance from zarr'sZipStore).strategy='ddp_spawn'— different process tree.testzip()already cleared the file.