Summary
With --spec-type draft-mtp enabled, if an HTTP client disconnects or times out mid-generation, the affected slot is never released. The server enters an indefinite loop emitting:
srv next: stopping wait for next result due to should_stop condition (adjust the --timeout argument if needed)
srv next: ref: https://github.com/ggml-org/llama.cpp/pull/22907
srv stop: cancel task, id_task = N (N increments — each new request immediately cancelled)
The GPU stays pinned at 100% on the stuck slot while throughput is zero. New requests queue behind the dead slot indefinitely. Recovery requires killing the process (it does not respond to a clean shutdown of the stuck slot).
Observed on a ROCm/HIP build (gfx1151), MXFP4 weights, -np 2 --spec-type draft-mtp --spec-draft-n-max 2, around the b8762 + MTP base (commit 52fb93a). A slot stayed wedged ~14h until manually restarted.
Root cause analysis
The HTTP-side cancellation path looks correct:
server_response_reader::next() (tools/server/server-queue.cpp) returns nullptr when should_stop() is true (client gone), logging the should_stop/PR-22907 lines.
server_response_reader::stop() then posts SERVER_TASK_TYPE_CANCEL tasks to the front of the queue.
- The CANCEL handler in
process_single_task() (tools/server/server-context.cpp) calls slot.release() for the matching slot.
The problem is ordering vs the speculative decode loop: the main server loop only drains the task queue (where CANCEL lands) between decode steps. When a slot is mid-generation inside the MTP speculative path (common_speculative draft generation in update_slots()), control does not return to the task-queue drain promptly, so the posted CANCEL is never processed and slot.release() is never reached. The client keeps retrying/timing out → more CANCEL tasks are posted → the visible infinite "cancel task" spin, while the slot remains held.
In other words: the cancel is enqueued but not consumed because the MTP draft loop does not yield to / check for pending cancellation between draft steps. Non-speculative generation does not exhibit this (it returns to the queue drain frequently enough that the cancel lands).
Suggested fix direction
Have the speculative/MTP generation path check for a pending cancellation (or a should_stop/slot-release signal) between draft steps, so a CANCEL posted while the slot is drafting is honored promptly and the slot is released. Equivalently, ensure update_slots() cannot stay in the speculative loop across an unbounded number of draft iterations without draining high-priority CANCEL tasks.
Reproduction sketch
- Build with the MTP draft support; run
llama-server ... --spec-type draft-mtp --spec-draft-n-max 2.
- Start a long generation request.
- Disconnect the client mid-generation (close the socket before completion).
- Observe: the slot is not released;
should_stop/cancel task warnings repeat indefinitely; the slot stays busy and GPU pinned. A subsequent request queues behind it forever.
Happy to provide full journal excerpts or test against a patch.
Summary
With
--spec-type draft-mtpenabled, if an HTTP client disconnects or times out mid-generation, the affected slot is never released. The server enters an indefinite loop emitting:The GPU stays pinned at 100% on the stuck slot while throughput is zero. New requests queue behind the dead slot indefinitely. Recovery requires killing the process (it does not respond to a clean shutdown of the stuck slot).
Observed on a ROCm/HIP build (gfx1151), MXFP4 weights,
-np 2 --spec-type draft-mtp --spec-draft-n-max 2, around the b8762 + MTP base (commit 52fb93a). A slot stayed wedged ~14h until manually restarted.Root cause analysis
The HTTP-side cancellation path looks correct:
server_response_reader::next()(tools/server/server-queue.cpp) returnsnullptrwhenshould_stop()is true (client gone), logging theshould_stop/PR-22907 lines.server_response_reader::stop()then postsSERVER_TASK_TYPE_CANCELtasks to the front of the queue.process_single_task()(tools/server/server-context.cpp) callsslot.release()for the matching slot.The problem is ordering vs the speculative decode loop: the main server loop only drains the task queue (where CANCEL lands) between decode steps. When a slot is mid-generation inside the MTP speculative path (
common_speculativedraft generation inupdate_slots()), control does not return to the task-queue drain promptly, so the posted CANCEL is never processed andslot.release()is never reached. The client keeps retrying/timing out → more CANCEL tasks are posted → the visible infinite "cancel task" spin, while the slot remains held.In other words: the cancel is enqueued but not consumed because the MTP draft loop does not yield to / check for pending cancellation between draft steps. Non-speculative generation does not exhibit this (it returns to the queue drain frequently enough that the cancel lands).
Suggested fix direction
Have the speculative/MTP generation path check for a pending cancellation (or a
should_stop/slot-release signal) between draft steps, so a CANCEL posted while the slot is drafting is honored promptly and the slot is released. Equivalently, ensureupdate_slots()cannot stay in the speculative loop across an unbounded number of draft iterations without draining high-priority CANCEL tasks.Reproduction sketch
llama-server ... --spec-type draft-mtp --spec-draft-n-max 2.should_stop/cancel taskwarnings repeat indefinitely; the slot stays busy and GPU pinned. A subsequent request queues behind it forever.Happy to provide full journal excerpts or test against a patch.