Skip to content

Server does not abort prefill when client disconnects #333

@TipKnuckle

Description

@TipKnuckle

Problem

When a client disconnects mid-generation (timeout, user cancel, agent restart), the
server detects the broken stream but continues the full prefill to completion before
acting on it. On long-context sessions this means several minutes of wasted GPU work
before the server is available again.

What happens

The progress callback (server_progress_cb) sets stream_failed = true when a
keepalive write fails, but this flag is only checked after ds4_session_sync()
returns — there's no path to signal the prefill loop to stop early. The loop in
metal_graph_prefill_chunked_range only bails on a Metal GPU error, not on client
state.

And it gets worse when the disconnected client reconnects and retries with a slightly different
prompt (e.g. one extra token from a tool result), the token mismatch triggers another
full prefill from zero immediately after the first one finishes.

Example log of issue

Client disconnected partway through generation. Server finished the full 67580-token
prefill (~257 seconds), then started over on the retry:

0602 19:15:59 ds4-server: chat ctx=0..67580:67580 TOOLS prompt done 256.984s
0602 19:15:59 ds4-server: chat ctx=0..67580:67580 TOOLS stream closed during prefill
0602 19:15:59 ds4-server: live kv cache miss live=67580 prompt=67609 common=67527 reason=token-mismatch
0602 19:15:59 ds4-server: chat ctx=0..67609:67609 TOOLS prompt start
0602 19:15:59 ds4-server: chat ctx=0..67609:67609 TOOLS prefill chunk 0/67609 (0.0%) ...

Cause

ds4_session_progress_fn is typedef void (*)(void*, const char*, int, int) — the
callback returns void, so there's no way to signal abort back to the prefill loop.
stream_failed lives in the server layer and is invisible to ds4.c.

Possible fix

Two approaches:

Option A — cancel flag pointer (smaller diff)
Add a volatile bool *cancel_flag to the session (via ds4_session_set_cancel_flag
or similar). The server sets it when stream_failed is detected in the progress
callback. The chunked prefill loop checks it between 4096-token chunks and returns
early if set.

Option B — progress callback returns bool
Change ds4_session_progress_fn to return bool (false = cancel). The server
callback returns false when stream_failed. Every call site in the prefill loop
checks the return value. Cleaner long-term but touches more call sites including
ds4_distributed.c.

Option A has a smaller diff and doesn't change the public callback signature.

--

On a Mac Studio M4 Max 128GB, q2-imatrix, cli flags: --ctx 192000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 16384

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions