HCSDataModule: BadZipFile CRC error reading OZX with num_workers > 0 (fork-shared file descriptor race)

# HCSDataModule: BadZipFile CRC error reading OZX with num_workers > 0 (fork-shared file descriptor race)

## Summary

`HCSDataModule` cannot read `.ozx` (RFC-9 packed OME-Zarr) stores when `num_workers > 0`. The first batch fails with:

```
zipfile.BadZipFile: Caught BadZipFile in DataLoader worker process 0.
Original Traceback (most recent call last):
  ...
  File "iohub/core/implementations/zarr_python.py", line 112, in read_oindex
  File "zarr/storage/_zip.py", line 158, in _get
  File "zipfile/__init__.py", line 1033, in _update_crc
zipfile.BadZipFile: Bad CRC-32 for file 'B/1/000000/0/c/0/0/0/0/0'
```

The OZX file itself is fine — `zipfile.testzip()` walks all entries with no errors. The same predict run succeeds with `num_workers=0`. The CRC failure is a runtime read race, not disk corruption.

## Reproducer

```bash
PLATE=2024_11_07
uv run dynacell predict \
  -c applications/dynacell/configs/benchmarks/virtual_staining/er/fnet3d_paper/ipsc_confocal/predict__a549_mantis_${PLATE}.yml \
  --data.init_args.num_workers 4
```

(Same leaf with `num_workers=0` — the leaf default — completes successfully.)

## Root cause

`SlidingWindowDataset.__init__` calls `_get_windows()` (`packages/viscy-data/src/viscy_data/sliding_window.py:118-140`), which reads `img_arr.frames` / `img_arr.slices` for every position. That metadata access triggers `OzxStore._sync_open()` in the **parent process** and stores a live `zipfile.ZipFile` as `self._zf`.

When `DataLoader` then forks workers (Linux default), every worker inherits the same OS-level file descriptor and seek state. Concurrent `read(...)` calls from different workers stomp on the shared seek pointer; the bytes that land in the read buffer don't match the central directory's CRC for the requested entry, surfacing as `BadZipFile: Bad CRC-32 for file '<chunk-path>'`.

This is a generic "zipfile is not fork-safe" pattern (same class as h5py + fork). Confirmed it's not file corruption: `zipfile.testzip()` returns OK on the same `.ozx` (391 entries walked, no bad CRC). The error reproduces only when DataLoader workers > 0.

## Affected workflows

The race is in the worker `__getitem__` path. `SlidingWindowDataset._read_img_window` (`packages/viscy-data/src/viscy_data/sliding_window.py:198-210`) branches:

```python
preloaded = _preloaded if _preloaded is not None else self._preloaded
if preloaded is not None and arr_idx >= 0:
    data = preloaded[arr_idx][t:t+1, :, z:z+self.z_window_size]...   # /dev/shm mmap
    return ...
data = img.oindex[...]   # zarr/OZX read — this is what races under fork
```

So whether the bug fires depends on whether `mmap_preload` is staging FOVs into `/dev/shm`:

| Stage | Reads OZX in worker? | Affected? |
|---|---|---|
| `predict`, `test` | Always — `mmap_preload` is fit/validate-only | **Yes** |
| `fit` + `mmap_preload=True` | No — workers read from `/dev/shm` `MemoryMappedTensor` | No |
| `fit` + `mmap_preload=False` | Yes — direct OZX read in worker | **Yes** |
| `validate` (standalone) + `mmap_preload=False` | Yes | **Yes** |
| `validate` riding on `fit` + `mmap_preload=True` | No — same buffer, validate reads from preloaded views | No |

The OZX file descriptor is still opened in the parent during `_get_windows()` metadata access and inherited by every worker on fork — but with `mmap_preload=True` the workers never read from it, so the inherited fd just sits unused in each worker. No concurrent reads → no CRC race.

`BatchedConcatDataModule` / `ConcatDataModule` (joint training) is the same per child: safe iff every constituent enables `mmap_preload`.

**Net effect on dynacell:**

- **Predict**: always affected. Currently runs with the leaf default `num_workers: 0`, which throttles GPU utilization (~22% on A40, very spiky).
- **Production training with `mmap_preload=True`**: not affected by this bug. The mmap-preload optimization sidesteps the OZX worker reads.
- **Smoke / dev / debug runs with `mmap_preload=False`**: affected the same way as predict. Same goes for any standalone `validate` invocation that doesn't share a fit's preload buffer.

iohub's OZX support is wired into `open_ome_zarr` via `is_ozx_path`/`OzxStore` (czbiohub-sf/iohub PR #408, currently pinned in this repo's root `pyproject.toml` under `[tool.uv.sources]`).

## Proposed fixes (in order of effort)

### A. `multiprocessing_context='spawn'` for the DataLoader (recommended starting point)

`zarr.storage.ZipStore` already implements pickle support — `_zip.py:109-119` strips `_zf` and `_lock` before pickle, and `__setstate__` reopens fresh per process. So spawn-mode workers receive the dataset, drop the parent's open ZipFile during pickle, and re-open their own fd inside the worker. Each worker has an independent file descriptor → no shared seek state → no CRC race.

Change is ~5 lines in `HCSDataModule._loader_kwargs`:

```python
kw = {
    "batch_size": self.batch_size,
    "num_workers": self.num_workers,
    "pin_memory": self.pin_memory,
    "persistent_workers": self.persistent_workers,
}
if self.num_workers > 0:
    kw["multiprocessing_context"] = "spawn"
```

Same change needed in `combined.py` (`ConcatDataModule` / `BatchedConcatDataModule` / `CachedConcatDataModule`).

**Trade-offs:**

- One-time ~1-3 s startup per worker (Python interpreter + viscy/torch/lightning import). With `persistent_workers=True` this is paid once per epoch boundary.
- *Not* the same as Lightning's `ddp_spawn` strategy warning — that's about GPU rank processes; we keep `strategy='ddp'`. DataLoader's `multiprocessing_context` is independent.
- Filesystem caches (`/dev/shm` `mmap_preload`, `/tmp` scratch) survive: page cache is shared across processes; re-mmap is O(1) page-table setup, no data copy.
- **Gating concern**: `mmap_preload` passes `preloaded_fovs` (slices into a `tensordict.MemoryMappedTensor` buffer) into the dataset. If a slice pickles as a regular tensor (by value) instead of a path reference, each worker would receive a full data copy → defeats the optimization. Need a pickle-size measurement on a real `mmap_preload=True` datamodule before claiming spawn is free for training.

### B. Stay on fork, reset the ZipFile in a `worker_init_fn`

```python
def _ozx_worker_init(worker_id: int) -> None:
    info = torch.utils.data.get_worker_info()
    for fov in info.dataset.positions:
        store = fov.zgroup.store
        store._is_open = False
        store._zf = None  # next read triggers _sync_open in this worker
```

Pass `worker_init_fn=_ozx_worker_init` to the DataLoader. Keeps fork (zero startup cost, preserves all in-memory state including `mmap_preload` views via copy-on-write). Fragile against iohub/zarr internal layout changes.

### C. Lazy open in `__getitem__` — refactor the dataset to never hold live `Position` objects

Store `(plate_path, position_name)` tuples; open (and cache per worker) on first `__getitem__`. Most invasive, eliminates the fork/spawn distinction entirely, cleanest long-term.

## Verification gate

Before landing (A): pickle-size measurement on an `mmap_preload=True` datamodule.

```python
import pickle, time
dm = HCSDataModule(data_path=..., mmap_preload=True, scratch_dir="/dev/shm/...", ...)
dm.prepare_data(); dm.setup("fit")
t0 = time.perf_counter(); blob = pickle.dumps(dm.train_dataset); t1 = time.perf_counter()
print(f"pickle size = {len(blob)/1e6:.2f} MB, time = {t1-t0:.2f}s")
```

- KB-MB → spawn is safe; tensors pickled as path references.
- GB-scale → spawn defeats `mmap_preload`; fall back to (B).

## Suggested plan

1. Run the pickle measurement on one `mmap_preload`-enabled fit datamodule. Dataset graphs that hold `MemoryMappedTensor` views need to pickle as path references (KB-MB), not by-value (GB).
2. If pickle is clean: ship (A) with `persistent_workers=True` for the predict and `mmap_preload=False` fit/validate code paths; add a regression test that builds a small in-memory `.ozx` fixture and exercises `num_workers>0`.
3. If pickle bloats: ship (B) — fork is preserved, `mmap_preload` keeps its current cost profile, and predict / non-preloaded fit gain a per-worker fd reset.
4. Either way, file a follow-up issue for (C) — it's the right architecture independent of fork/spawn and would also let us drop the `mmap_preload`-as-OZX-workaround framing.

Priority: predict is the immediate blocker (no preload escape hatch). Production training with `mmap_preload=True` is *not* on fire today, but the bug still exists for any path that doesn't preload (smoke leaves, dev iteration, standalone validate), so the fix should cover both.

## Out of scope

- iohub itself (the underlying `OzxStore` already supports lazy open + pickle correctly via inheritance from zarr's `ZipStore`).
- Lightning DDP `strategy='ddp_spawn'` — different process tree.
- Disk-corruption diagnostics — `testzip()` already cleared the file.


Stage	Reads OZX in worker?	Affected?
`predict`, `test`	Always — `mmap_preload` is fit/validate-only	Yes
`fit` + `mmap_preload=True`	No — workers read from `/dev/shm` `MemoryMappedTensor`	No
`fit` + `mmap_preload=False`	Yes — direct OZX read in worker	Yes
`validate` (standalone) + `mmap_preload=False`	Yes	Yes
`validate` riding on `fit` + `mmap_preload=True`	No — same buffer, validate reads from preloaded views	No

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HCSDataModule: BadZipFile CRC error reading OZX with num_workers > 0 (fork-shared file descriptor race) #417

HCSDataModule: BadZipFile CRC error reading OZX with num_workers > 0 (fork-shared file descriptor race)

Summary

Reproducer

Root cause

Affected workflows

Proposed fixes (in order of effort)

A. `multiprocessing_context='spawn'` for the DataLoader (recommended starting point)

B. Stay on fork, reset the ZipFile in a `worker_init_fn`

C. Lazy open in `getitem` — refactor the dataset to never hold live `Position` objects

Verification gate

Suggested plan

Out of scope

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

HCSDataModule: BadZipFile CRC error reading OZX with num_workers > 0 (fork-shared file descriptor race) #417

Description

HCSDataModule: BadZipFile CRC error reading OZX with num_workers > 0 (fork-shared file descriptor race)

Summary

Reproducer

Root cause

Affected workflows

Proposed fixes (in order of effort)

A. multiprocessing_context='spawn' for the DataLoader (recommended starting point)

B. Stay on fork, reset the ZipFile in a worker_init_fn

C. Lazy open in __getitem__ — refactor the dataset to never hold live Position objects

Verification gate

Suggested plan

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

A. `multiprocessing_context='spawn'` for the DataLoader (recommended starting point)

B. Stay on fork, reset the ZipFile in a `worker_init_fn`

C. Lazy open in `getitem` — refactor the dataset to never hold live `Position` objects