Skip to content

Troubleshoot PyTorch with num_workers > 0 #31

@ctrueden

Description

@ctrueden

From @MichaelSNelson:

PyTorch's DataLoader supports worker processes via num_workers > 0 for prefetching/augmentation. Inside an Appose worker, anything > 0 hangs — the training loop inside the Appose worker — for batch in DataLoader: — never returns a batch when num_workers > 0. The Python worker (Appose's child) stays alive and responsive on the Appose protocol; it's the DataLoader's own child processes (PyTorch's grandchildren of the JVM) that deadlock before the first batch comes back. Nothing to do with Appose's own main-thread queue. No exception, no log, just silence after training setup completes. Dropping back to num_workers=0 (single-threaded data loading) works fine but leaves the GPU ~30–70% idle between batches during training.

We have a clean repro on current releases: our extension added a user preference exposing num_workers (defaults to 0). When set to 2, the bootstrap log confirms the patch reaches Python — DataLoader bootstrap: forcing num_workers=2, persistent_workers=True — and then training sets up through "Set 65 encoder modules to eval mode" and hangs before batch 1. Zero progress after 3+ minutes.

My mental model of why: PyTorch spawns the workers via torch.multiprocessing (spawn on Windows, fork-or-spawn elsewhere). The children inherit or re-open stdin/stdout, which Appose also uses for its JSON protocol. I can imagine a few failure modes:

  1. Stdio inheritance deadlock — children holding a ref to the parent's stdin pipe, or spawn's re-import triggering input() in the worker init.
  2. Some PYTHONEXECUTABLE / PATH issue (I noticed the recent commits in appose-python/main stripping conda activation vars and PYTHONEXECUTABLE, which sounded related). 3. Something Windows-specific in pixi-managed Python that confuses multiprocessing's default spawn logic.

What I'd love to know:

  1. Has anyone successfully run torch.multiprocessing-based workers inside an Appose worker? Is this expected to work, or structurally a dead end given Appose's stdio-based IPC?

  2. If it's not expected to work, is there a recommended pattern for CPU-bound per-batch preprocessing inside Appose — e.g., threaded prefetch, or a second Appose worker acting as a data producer over shared memory?

  3. Would you be open to me filing a reduced GitHub repro if it's something that could be addressed upstream? I can strip this down to a minimal Appose-worker-calls-DataLoader-with-num_workers=2 if it's useful.
    Happy to dig in wherever's helpful. In the meantime we're pursuing an in-memory dataset cache to sidestep the I/O bottleneck without multiprocessing.

I put a minimal repro together so you can poke at it: https://github.com/MichaelSNelson/appose-dataloader-repro

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions