server: client disconnect mid-generation does not free slot when speculative MTP decode is active → infinite cancel-task spin

## Summary

With `--spec-type draft-mtp` enabled, if an HTTP client disconnects or times out **mid-generation**, the affected slot is never released. The server enters an indefinite loop emitting:

```
srv  next: stopping wait for next result due to should_stop condition (adjust the --timeout argument if needed)
srv  next: ref: https://github.com/ggml-org/llama.cpp/pull/22907
srv  stop: cancel task, id_task = N   (N increments — each new request immediately cancelled)
```

The GPU stays pinned at 100% on the stuck slot while throughput is zero. New requests queue behind the dead slot indefinitely. Recovery requires killing the process (it does not respond to a clean shutdown of the stuck slot).

Observed on a ROCm/HIP build (gfx1151), MXFP4 weights, `-np 2 --spec-type draft-mtp --spec-draft-n-max 2`, around the b8762 + MTP base (commit 52fb93a2b). A slot stayed wedged ~14h until manually restarted.

## Root cause analysis

The HTTP-side cancellation path looks correct:

- `server_response_reader::next()` (tools/server/server-queue.cpp) returns `nullptr` when `should_stop()` is true (client gone), logging the `should_stop`/PR-22907 lines.
- `server_response_reader::stop()` then posts `SERVER_TASK_TYPE_CANCEL` tasks to the front of the queue.
- The CANCEL handler in `process_single_task()` (tools/server/server-context.cpp) calls `slot.release()` for the matching slot.

The problem is **ordering vs the speculative decode loop**: the main server loop only drains the task queue (where CANCEL lands) between decode steps. When a slot is mid-generation inside the MTP speculative path (`common_speculative` draft generation in `update_slots()`), control does not return to the task-queue drain promptly, so the posted CANCEL is never processed and `slot.release()` is never reached. The client keeps retrying/timing out → more CANCEL tasks are posted → the visible infinite "cancel task" spin, while the slot remains held.

In other words: the cancel is **enqueued but not consumed** because the MTP draft loop does not yield to / check for pending cancellation between draft steps. Non-speculative generation does not exhibit this (it returns to the queue drain frequently enough that the cancel lands).

## Suggested fix direction

Have the speculative/MTP generation path check for a pending cancellation (or a `should_stop`/slot-release signal) **between draft steps**, so a CANCEL posted while the slot is drafting is honored promptly and the slot is released. Equivalently, ensure `update_slots()` cannot stay in the speculative loop across an unbounded number of draft iterations without draining high-priority CANCEL tasks.

## Reproduction sketch

1. Build with the MTP draft support; run `llama-server ... --spec-type draft-mtp --spec-draft-n-max 2`.
2. Start a long generation request.
3. Disconnect the client mid-generation (close the socket before completion).
4. Observe: the slot is not released; `should_stop`/`cancel task` warnings repeat indefinitely; the slot stays busy and GPU pinned. A subsequent request queues behind it forever.

Happy to provide full journal excerpts or test against a patch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: client disconnect mid-generation does not free slot when speculative MTP decode is active → infinite cancel-task spin #20

Summary

Root cause analysis

Suggested fix direction

Reproduction sketch

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

server: client disconnect mid-generation does not free slot when speculative MTP decode is active → infinite cancel-task spin #20

Description

Summary

Root cause analysis

Suggested fix direction

Reproduction sketch

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions