Skip to content

Deadlock (probably from the memory management subsystem) #111

@ld-cd

Description

@ld-cd

When run on wheatley with 32 threads we get a deadlock with 450G memory used by various processes during the pixcal step:

Ctrl+C gives garbled lines that look like this:

KeyboardInterrupt
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.12/multiprocessing/pool.py", line 114, in worker
    task = get()
           ^^^^^
  File "/usr/lib/python3.12/multiprocessing/queues.py", line 386, in get
    with self._rlock:
  File "/usr/lib/python3.12/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.12/multiprocessing/pool.py", line 114, in worker
    task = get()
           ^^^^^
  File "/usr/lib/python3.12/multiprocessing/queues.py", line 386, in get
    with self._rlock:
  File "/usr/lib/python3.12/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt

Last lines of log before and after I kill it:

2024-08-24 18:00:56 mkidpipeline.steps.pixcal   :INFO     (apply:432[pid=3835464]) Mask applied in 0.046s
2024-08-24 18:01:11 mkidpipeline.steps.pixcal   :INFO     (threshold:177[pid=3835459]) Masked 882 hot pixels, 767 cold pixels, 0 dead pixels
2024-08-24 18:01:11 mkidpipeline.utils.memory   :DEBUG    (release_ram:81[pid=3835459]) ./out/1602047869.h5 (tid=127117875261568) released 10.2 GB, total reserved 74.3 GB
2024-08-24 18:01:11 mkidpipeline.steps.pixcal   :INFO     (apply:417[pid=3835459]) Applying pixel mask to dither0_3/36 (MKIDObservation): 25.00159001350403s @ 1602047869.057722
2024-08-24 18:01:11 mkidpipeline.photontable    :INFO     (update_header:1071[pid=3835459]) Adding new header key: pixcal.method
2024-08-24 18:01:11 mkidpipeline.photontable    :INFO     (update_header:1071[pid=3835459]) Adding new header key: pixcal.step
2024-08-24 18:01:11 mkidpipeline.steps.pixcal   :INFO     (apply:432[pid=3835459]) Mask applied in 0.060s
2024-08-24 18:01:11 mkidpipeline.steps.pixcal   :INFO     (threshold:177[pid=3835450]) Masked 780 hot pixels, 810 cold pixels, 0 dead pixels
2024-08-24 18:01:11 mkidpipeline.utils.memory   :DEBUG    (release_ram:81[pid=3835450]) ./out/1602047895.h5 (tid=127117875261568) released 10.0 GB, total reserved 64.3 GB
2024-08-24 18:01:11 mkidpipeline.steps.pixcal   :INFO     (apply:417[pid=3835450]) Applying pixel mask to dither0_4/36 (MKIDObservation): 25.000959634780884s @ 1602047895.0989745
2024-08-24 18:01:12 mkidpipeline.photontable    :INFO     (update_header:1071[pid=3835450]) Adding new header key: pixcal.method
2024-08-24 18:01:12 mkidpipeline.photontable    :INFO     (update_header:1071[pid=3835450]) Adding new header key: pixcal.step
2024-08-24 18:01:12 mkidpipeline.steps.pixcal   :INFO     (apply:432[pid=3835450]) Mask applied in 0.059s
2024-08-24 18:01:22 mkidpipeline.steps.pixcal   :INFO     (threshold:177[pid=3835465]) Masked 989 hot pixels, 814 cold pixels, 0 dead pixels
2024-08-24 18:01:22 mkidpipeline.utils.memory   :DEBUG    (release_ram:81[pid=3835465]) ./out/1602047843.h5 (tid=127117875261568) released 10.5 GB, total reserved 53.8 GB
2024-08-24 18:01:22 mkidpipeline.steps.pixcal   :INFO     (apply:417[pid=3835465]) Applying pixel mask to dither0_2/36 (MKIDObservation): 25.00138235092163s @ 1602047843.0487542
2024-08-24 18:01:22 mkidpipeline.photontable    :INFO     (update_header:1071[pid=3835465]) Adding new header key: pixcal.method
2024-08-24 18:01:22 mkidpipeline.photontable    :INFO     (update_header:1071[pid=3835465]) Adding new header key: pixcal.step
2024-08-24 18:01:22 mkidpipeline.steps.pixcal   :INFO     (apply:432[pid=3835465]) Mask applied in 0.058s


^CProcess ForkPoolWorker-171:
Process ForkPoolWorker-166:
Process ForkPoolWorker-198:
Process ForkPoolWorker-178:
Process ForkPoolWorker-186:
Process ForkPoolWorker-197:
Process ForkPoolWorker-161:
Process ForkPoolWorker-194:
Process ForkPoolWorker-192:
Process ForkPoolWorker-175:
Process ForkPoolWorker-193:
Process ForkPoolWorker-185:
Process ForkPoolWorker-196:
Process ForkPoolWorker-174:
Process ForkPoolWorker-164:
Process ForkPoolWorker-173:
Process ForkPoolWorker-176:
Process ForkPoolWorker-195:
Process ForkPoolWorker-189:
Process ForkPoolWorker-190:
Process ForkPoolWorker-170:
Process ForkPoolWorker-172:
Process ForkPoolWorker-177:

Fortunately I was profiling when this happened so we should have some callgraph data
deadlock.log
oom.log

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions