Skip to content

server: client disconnect mid-generation does not free slot when speculative MTP decode is active → infinite cancel-task spin #20

@hschmied

Description

@hschmied

Summary

With --spec-type draft-mtp enabled, if an HTTP client disconnects or times out mid-generation, the affected slot is never released. The server enters an indefinite loop emitting:

srv  next: stopping wait for next result due to should_stop condition (adjust the --timeout argument if needed)
srv  next: ref: https://github.com/ggml-org/llama.cpp/pull/22907
srv  stop: cancel task, id_task = N   (N increments — each new request immediately cancelled)

The GPU stays pinned at 100% on the stuck slot while throughput is zero. New requests queue behind the dead slot indefinitely. Recovery requires killing the process (it does not respond to a clean shutdown of the stuck slot).

Observed on a ROCm/HIP build (gfx1151), MXFP4 weights, -np 2 --spec-type draft-mtp --spec-draft-n-max 2, around the b8762 + MTP base (commit 52fb93a). A slot stayed wedged ~14h until manually restarted.

Root cause analysis

The HTTP-side cancellation path looks correct:

  • server_response_reader::next() (tools/server/server-queue.cpp) returns nullptr when should_stop() is true (client gone), logging the should_stop/PR-22907 lines.
  • server_response_reader::stop() then posts SERVER_TASK_TYPE_CANCEL tasks to the front of the queue.
  • The CANCEL handler in process_single_task() (tools/server/server-context.cpp) calls slot.release() for the matching slot.

The problem is ordering vs the speculative decode loop: the main server loop only drains the task queue (where CANCEL lands) between decode steps. When a slot is mid-generation inside the MTP speculative path (common_speculative draft generation in update_slots()), control does not return to the task-queue drain promptly, so the posted CANCEL is never processed and slot.release() is never reached. The client keeps retrying/timing out → more CANCEL tasks are posted → the visible infinite "cancel task" spin, while the slot remains held.

In other words: the cancel is enqueued but not consumed because the MTP draft loop does not yield to / check for pending cancellation between draft steps. Non-speculative generation does not exhibit this (it returns to the queue drain frequently enough that the cancel lands).

Suggested fix direction

Have the speculative/MTP generation path check for a pending cancellation (or a should_stop/slot-release signal) between draft steps, so a CANCEL posted while the slot is drafting is honored promptly and the slot is released. Equivalently, ensure update_slots() cannot stay in the speculative loop across an unbounded number of draft iterations without draining high-priority CANCEL tasks.

Reproduction sketch

  1. Build with the MTP draft support; run llama-server ... --spec-type draft-mtp --spec-draft-n-max 2.
  2. Start a long generation request.
  3. Disconnect the client mid-generation (close the socket before completion).
  4. Observe: the slot is not released; should_stop/cancel task warnings repeat indefinitely; the slot stays busy and GPU pinned. A subsequent request queues behind it forever.

Happy to provide full journal excerpts or test against a patch.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions