From @MichaelSNelson:
PyTorch's DataLoader supports worker processes via num_workers > 0 for prefetching/augmentation. Inside an Appose worker, anything > 0 hangs — the training loop inside the Appose worker — for batch in DataLoader: — never returns a batch when num_workers > 0. The Python worker (Appose's child) stays alive and responsive on the Appose protocol; it's the DataLoader's own child processes (PyTorch's grandchildren of the JVM) that deadlock before the first batch comes back. Nothing to do with Appose's own main-thread queue. No exception, no log, just silence after training setup completes. Dropping back to num_workers=0 (single-threaded data loading) works fine but leaves the GPU ~30–70% idle between batches during training.
We have a clean repro on current releases: our extension added a user preference exposing num_workers (defaults to 0). When set to 2, the bootstrap log confirms the patch reaches Python — DataLoader bootstrap: forcing num_workers=2, persistent_workers=True — and then training sets up through "Set 65 encoder modules to eval mode" and hangs before batch 1. Zero progress after 3+ minutes.
My mental model of why: PyTorch spawns the workers via torch.multiprocessing (spawn on Windows, fork-or-spawn elsewhere). The children inherit or re-open stdin/stdout, which Appose also uses for its JSON protocol. I can imagine a few failure modes:
- Stdio inheritance deadlock — children holding a ref to the parent's stdin pipe, or spawn's re-import triggering input() in the worker init.
- Some PYTHONEXECUTABLE / PATH issue (I noticed the recent commits in appose-python/main stripping conda activation vars and PYTHONEXECUTABLE, which sounded related). 3. Something Windows-specific in pixi-managed Python that confuses multiprocessing's default spawn logic.
What I'd love to know:
-
Has anyone successfully run torch.multiprocessing-based workers inside an Appose worker? Is this expected to work, or structurally a dead end given Appose's stdio-based IPC?
-
If it's not expected to work, is there a recommended pattern for CPU-bound per-batch preprocessing inside Appose — e.g., threaded prefetch, or a second Appose worker acting as a data producer over shared memory?
-
Would you be open to me filing a reduced GitHub repro if it's something that could be addressed upstream? I can strip this down to a minimal Appose-worker-calls-DataLoader-with-num_workers=2 if it's useful.
Happy to dig in wherever's helpful. In the meantime we're pursuing an in-memory dataset cache to sidestep the I/O bottleneck without multiprocessing.
I put a minimal repro together so you can poke at it: https://github.com/MichaelSNelson/appose-dataloader-repro
From @MichaelSNelson: