You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The HTTP server already supported OpenAI-compatible SSE streaming
(controlled by `"stream": true` in the request body) but it wasn't
discoverable from the CLI. This PR makes it explicit and easy to use.
New: `quantcpp client PROMPT [--url ...] [--no-stream]`
- Sends a chat completion to a running quantcpp serve endpoint
- Default mode is streaming (SSE) — tokens print as they arrive
- --no-stream falls back to a single JSON response
- Stdlib only (urllib) — no extra dependencies
Improved: `quantcpp serve` startup output
- Now prints all three endpoints (chat/completions, models, health)
- Shows curl examples for both streaming and non-streaming modes
- Shows OpenAI Python SDK snippet for drop-in usage
Verified end-to-end: server streams token-by-token; client decodes
SSE chunks correctly; --no-stream returns single JSON.
README (EN/KO) and guide CTA updated to mention `quantcpp client`
and the streaming/non-streaming choice.
Version: 0.12.0 → 0.12.1.
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: README.ko.md
+6-5Lines changed: 6 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -28,13 +28,14 @@
28
28
```bash
29
29
pip install quantcpp
30
30
31
-
quantcpp pull llama3.2:1b # HuggingFace에서 다운로드
32
-
quantcpp run llama3.2:1b # 대화형 채팅
33
-
quantcpp serve llama3.2:1b -p 8080 # OpenAI 호환 HTTP 서버
34
-
quantcpp list # 캐시된 모델 목록
31
+
quantcpp pull llama3.2:1b # HuggingFace에서 다운로드
32
+
quantcpp run llama3.2:1b # 대화형 채팅
33
+
quantcpp serve llama3.2:1b -p 8080 # OpenAI 호환 HTTP 서버 (SSE 스트리밍)
34
+
quantcpp client "안녕"# 스트리밍 클라이언트 → :8080 서버
35
+
quantcpp list # 캐시된 모델 목록
35
36
```
36
37
37
-
짧은 별칭: `smollm2:135m`, `qwen3.5:0.8b`, `llama3.2:1b`. `run`/`serve` 첫 실행 시 자동 다운로드. `serve`는 OpenAI 호환 `POST /v1/chat/completions` 엔드포인트를 8080 포트에 제공합니다.
38
+
짧은 별칭: `smollm2:135m`, `qwen3.5:0.8b`, `llama3.2:1b`. `run`/`serve` 첫 실행 시 자동 다운로드. `serve`는 OpenAI 호환 `POST /v1/chat/completions` 엔드포인트를 8080 포트에 제공합니다 — 클라이언트가 `"stream": true`를 보내면 SSE 토큰 단위 스트리밍, 생략하면 단일 JSON 응답. 내장 `quantcpp client`는 두 모드 모두 지원 (기본: 스트리밍, `--no-stream`: 단일 응답).
quantcpp client "Hi"# streaming client → server on :8080
48
+
quantcpp list # show cached models
48
49
```
49
50
50
-
Short aliases: `smollm2:135m`, `qwen3.5:0.8b`, `llama3.2:1b`. Auto-pulls on first `run`/`serve`. The `serve` subcommand exposes `POST /v1/chat/completions` (OpenAI-compatible) on port 8080.
51
+
Short aliases: `smollm2:135m`, `qwen3.5:0.8b`, `llama3.2:1b`. Auto-pulls on first `run`/`serve`. The `serve` subcommand exposes `POST /v1/chat/completions` (OpenAI-compatible) on port 8080 — clients pass `"stream": true` for SSE streaming, or omit it for a single JSON response. Built-in `quantcpp client` supports both modes (default: streaming, `--no-stream` for single response).
0 commit comments