Skip to content

Commit 9b8fc6e

Browse files
unamedkrclaude
andauthored
feat(cli): quantcpp client (SSE streaming) + serve discoverability (#47)
The HTTP server already supported OpenAI-compatible SSE streaming (controlled by `"stream": true` in the request body) but it wasn't discoverable from the CLI. This PR makes it explicit and easy to use. New: `quantcpp client PROMPT [--url ...] [--no-stream]` - Sends a chat completion to a running quantcpp serve endpoint - Default mode is streaming (SSE) — tokens print as they arrive - --no-stream falls back to a single JSON response - Stdlib only (urllib) — no extra dependencies Improved: `quantcpp serve` startup output - Now prints all three endpoints (chat/completions, models, health) - Shows curl examples for both streaming and non-streaming modes - Shows OpenAI Python SDK snippet for drop-in usage Verified end-to-end: server streams token-by-token; client decodes SSE chunks correctly; --no-stream returns single JSON. README (EN/KO) and guide CTA updated to mention `quantcpp client` and the streaming/non-streaming choice. Version: 0.12.0 → 0.12.1. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent dac884f commit 9b8fc6e

6 files changed

Lines changed: 117 additions & 14 deletions

File tree

README.ko.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -28,13 +28,14 @@
2828
```bash
2929
pip install quantcpp
3030

31-
quantcpp pull llama3.2:1b # HuggingFace에서 다운로드
32-
quantcpp run llama3.2:1b # 대화형 채팅
33-
quantcpp serve llama3.2:1b -p 8080 # OpenAI 호환 HTTP 서버
34-
quantcpp list # 캐시된 모델 목록
31+
quantcpp pull llama3.2:1b # HuggingFace에서 다운로드
32+
quantcpp run llama3.2:1b # 대화형 채팅
33+
quantcpp serve llama3.2:1b -p 8080 # OpenAI 호환 HTTP 서버 (SSE 스트리밍)
34+
quantcpp client "안녕" # 스트리밍 클라이언트 → :8080 서버
35+
quantcpp list # 캐시된 모델 목록
3536
```
3637

37-
짧은 별칭: `smollm2:135m`, `qwen3.5:0.8b`, `llama3.2:1b`. `run`/`serve` 첫 실행 시 자동 다운로드. `serve`는 OpenAI 호환 `POST /v1/chat/completions` 엔드포인트를 8080 포트에 제공합니다.
38+
짧은 별칭: `smollm2:135m`, `qwen3.5:0.8b`, `llama3.2:1b`. `run`/`serve` 첫 실행 시 자동 다운로드. `serve`는 OpenAI 호환 `POST /v1/chat/completions` 엔드포인트를 8080 포트에 제공합니다 — 클라이언트가 `"stream": true`를 보내면 SSE 토큰 단위 스트리밍, 생략하면 단일 JSON 응답. 내장 `quantcpp client`는 두 모드 모두 지원 (기본: 스트리밍, `--no-stream`: 단일 응답).
3839

3940
**한 줄 질문:**
4041
```bash

README.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -41,13 +41,14 @@
4141
```bash
4242
pip install quantcpp
4343

44-
quantcpp pull llama3.2:1b # download from HuggingFace
45-
quantcpp run llama3.2:1b # interactive chat
46-
quantcpp serve llama3.2:1b -p 8080 # OpenAI-compatible HTTP server
47-
quantcpp list # show cached models
44+
quantcpp pull llama3.2:1b # download from HuggingFace
45+
quantcpp run llama3.2:1b # interactive chat
46+
quantcpp serve llama3.2:1b -p 8080 # OpenAI-compatible HTTP server (SSE streaming)
47+
quantcpp client "Hi" # streaming client → server on :8080
48+
quantcpp list # show cached models
4849
```
4950

50-
Short aliases: `smollm2:135m`, `qwen3.5:0.8b`, `llama3.2:1b`. Auto-pulls on first `run`/`serve`. The `serve` subcommand exposes `POST /v1/chat/completions` (OpenAI-compatible) on port 8080.
51+
Short aliases: `smollm2:135m`, `qwen3.5:0.8b`, `llama3.2:1b`. Auto-pulls on first `run`/`serve`. The `serve` subcommand exposes `POST /v1/chat/completions` (OpenAI-compatible) on port 8080 — clients pass `"stream": true` for SSE streaming, or omit it for a single JSON response. Built-in `quantcpp client` supports both modes (default: streaming, `--no-stream` for single response).
5152

5253
**One-shot question:**
5354
```bash

bindings/python/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ build-backend = "setuptools.build_meta"
77

88
[project]
99
name = "quantcpp"
10-
version = "0.12.0"
10+
version = "0.12.1"
1111
description = "Single-header LLM inference engine with KV cache compression (7× compression at fp32 parity)"
1212
readme = "README.md"
1313
license = { text = "Apache-2.0" }

bindings/python/quantcpp/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
from importlib.metadata import version as _pkg_version
1616
__version__ = _pkg_version("quantcpp")
1717
except Exception:
18-
__version__ = "0.12.0" # fallback for editable / source-tree imports
18+
__version__ = "0.12.1" # fallback for editable / source-tree imports
1919

2020
import os
2121
import sys

bindings/python/quantcpp/cli.py

Lines changed: 102 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -195,10 +195,92 @@ def cmd_serve(args):
195195
return 2
196196

197197
cmd = [binary, model_path, "-p", str(args.port), "-j", str(args.threads)]
198-
print(f"quant serve {os.path.basename(model_path)} on :{args.port}", file=sys.stderr)
198+
print(f"quantcpp serve {os.path.basename(model_path)} on :{args.port}", file=sys.stderr)
199+
print("", file=sys.stderr)
200+
print("OpenAI-compatible endpoints:", file=sys.stderr)
201+
print(f" POST http://localhost:{args.port}/v1/chat/completions", file=sys.stderr)
202+
print(f" GET http://localhost:{args.port}/v1/models", file=sys.stderr)
203+
print(f" GET http://localhost:{args.port}/health", file=sys.stderr)
204+
print("", file=sys.stderr)
205+
print("Streaming (SSE — token-by-token):", file=sys.stderr)
206+
print(f" curl -N http://localhost:{args.port}/v1/chat/completions \\", file=sys.stderr)
207+
print(" -H 'Content-Type: application/json' \\", file=sys.stderr)
208+
print(' -d \'{"messages":[{"role":"user","content":"Hi"}],"stream":true}\'',
209+
file=sys.stderr)
210+
print("", file=sys.stderr)
211+
print("Non-streaming (single JSON response):", file=sys.stderr)
212+
print(f" curl http://localhost:{args.port}/v1/chat/completions \\", file=sys.stderr)
213+
print(" -H 'Content-Type: application/json' \\", file=sys.stderr)
214+
print(' -d \'{"messages":[{"role":"user","content":"Hi"}]}\'',
215+
file=sys.stderr)
216+
print("", file=sys.stderr)
217+
print("OpenAI Python SDK works as-is:", file=sys.stderr)
218+
print(f" client = OpenAI(base_url='http://localhost:{args.port}/v1', api_key='none')",
219+
file=sys.stderr)
220+
print(" client.chat.completions.create(model='quantcpp', messages=[...], stream=True)",
221+
file=sys.stderr)
222+
print("", file=sys.stderr)
199223
os.execvp(cmd[0], cmd)
200224

201225

226+
def cmd_client(args):
227+
"""Send a chat request to a running quantcpp serve endpoint.
228+
229+
Default mode is streaming (SSE) — tokens print as they arrive.
230+
Use --no-stream for a single JSON response.
231+
"""
232+
import json as _json
233+
import urllib.request
234+
235+
url = args.url.rstrip("/") + "/v1/chat/completions"
236+
payload = {
237+
"model": args.model_name,
238+
"messages": [{"role": "user", "content": args.prompt}],
239+
"max_tokens": args.max_tokens,
240+
"temperature": args.temperature,
241+
"stream": not args.no_stream,
242+
}
243+
body = _json.dumps(payload).encode()
244+
req = urllib.request.Request(
245+
url, data=body,
246+
headers={
247+
"Content-Type": "application/json",
248+
"User-Agent": "quantcpp-client",
249+
},
250+
)
251+
252+
try:
253+
with urllib.request.urlopen(req) as resp:
254+
if args.no_stream:
255+
data = _json.loads(resp.read())
256+
print(data["choices"][0]["message"]["content"])
257+
return 0
258+
259+
# SSE stream — parse `data: {...}\n\n` chunks
260+
for line in resp:
261+
line = line.decode("utf-8", errors="replace").rstrip()
262+
if not line.startswith("data:"):
263+
continue
264+
payload_str = line[5:].strip()
265+
if payload_str == "[DONE]":
266+
break
267+
try:
268+
chunk = _json.loads(payload_str)
269+
delta = chunk["choices"][0]["delta"].get("content", "")
270+
if delta:
271+
print(delta, end="", flush=True)
272+
except Exception:
273+
pass
274+
print()
275+
return 0
276+
except urllib.error.URLError as e:
277+
print(f"connection failed: {e}", file=sys.stderr)
278+
print(f" Is the server running on {args.url}?", file=sys.stderr)
279+
print(f" Start it with: quantcpp serve llama3.2:1b -p {args.url.rsplit(':', 1)[-1].rstrip('/')}",
280+
file=sys.stderr)
281+
return 1
282+
283+
202284
def cmd_chat_default(args):
203285
"""Backwards-compatible default: auto-download Llama-3.2-1B and chat."""
204286
args.model = args.model or "Llama-3.2-1B"
@@ -222,13 +304,17 @@ def main():
222304
list List cached and available models
223305
run MODEL [PROMPT] Chat with a model (auto-pulls if needed)
224306
serve MODEL Start OpenAI-compatible HTTP server
307+
client PROMPT Send a request to a running serve (default: SSE streaming)
225308
226309
examples:
227310
quantcpp pull llama3.2:1b
228311
quantcpp list
229312
quantcpp run llama3.2:1b
230313
quantcpp run llama3.2:1b "What is gravity?"
231314
quantcpp serve llama3.2:1b --port 8080
315+
quantcpp client "What is gravity?" # streams from :8080
316+
quantcpp client "Hi" --url http://localhost:8081
317+
quantcpp client "Hi" --no-stream # single JSON response
232318
233319
backwards-compat (no subcommand):
234320
quantcpp # default chat with Llama-3.2-1B
@@ -261,6 +347,19 @@ def main():
261347
p_serve.add_argument("-p", "--port", type=int, default=8080)
262348
p_serve.add_argument("-j", "--threads", type=int, default=4)
263349

350+
# client
351+
p_client = sub.add_parser("client",
352+
help="Send a chat request to a running quantcpp serve endpoint")
353+
p_client.add_argument("prompt", help="Question to send")
354+
p_client.add_argument("--url", default="http://localhost:8080",
355+
help="Server URL (default: http://localhost:8080)")
356+
p_client.add_argument("--model-name", "-m", default="quantcpp",
357+
help="Model name in the request body (server ignores)")
358+
p_client.add_argument("-n", "--max-tokens", type=int, default=256)
359+
p_client.add_argument("-t", "--temperature", type=float, default=0.7)
360+
p_client.add_argument("--no-stream", action="store_true",
361+
help="Disable SSE streaming (single JSON response)")
362+
264363
# Backwards-compat: top-level args for direct chat
265364
parser.add_argument("prompt", nargs="*", default=None,
266365
help="(default mode) question to ask")
@@ -280,6 +379,8 @@ def main():
280379
return cmd_run(args)
281380
if args.command == "serve":
282381
return cmd_serve(args)
382+
if args.command == "client":
383+
return cmd_client(args)
283384

284385
# No subcommand → backwards-compat default chat
285386
return cmd_chat_default(args)

site/index.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -736,7 +736,7 @@ <h2 style="margin-bottom:1rem" data-i18n="cta.title">Try It Yourself</h2>
736736
quantcpp pull llama3.2:1b
737737
quantcpp run llama3.2:1b
738738
quantcpp serve llama3.2:1b -p 8080
739-
quantcpp list</code></pre>
739+
quantcpp client "Hi" # SSE streaming</code></pre>
740740
</div>
741741
<div>
742742
<div style="font-size:.75rem;color:var(--text2);margin-bottom:.3rem;font-weight:600" data-i18n="cta.label.python">Python API</div>

0 commit comments

Comments
 (0)