Skip to content

Commit 66635a0

Browse files
authored
feat(example): Updated server example (batch processing, /v1/responses api, response parsing) (abetlen#2174)
* Add batch processing server * Improve response parser streaming performance * Add /v1/models endpoint * Support custom tools in responses API * Improve responses api compatibility for codex * Improve batch server prompt and context config * Improve batch server scheduling and prompt handling * fix apply_patch tool * Improve batch server schemas and metrics * Refactor sequence cache helpers * Fix server type diagnostics * feat: add llama.cpp extension bindings * feat: add MTP support to batch server * feat: improve draft-mtp handling in batch server * feat: cap MTP draft context outputs * fix: preserve held streaming tokens * feat: add load-time LoRA support to batch server * feat: add multimodal support to batch server * refactor: rename batch item kinds * refactor: type sampled mtp updates * refactor: structure sampled mtp batch processing * refactor: clarify batch item construction * refactor: type batch item kind * refactor: clarify sampled pending index * refactor: clarify output index naming * refactor: rename logits index resolver * refactor: colocate sampled mtp state * refactor: inline sampled mtp helpers * refactor: use row-expanded multimodal prompt identity * test: remove multimodal prompt plan tests * refactor: narrow mtmd processor dependencies * refactor: group prompt segment media fields * refactor: centralize sequence state copy * refactor: keep disk cache storage only * refactor: split batch item payloads * refactor: centralize pending request failure cleanup * refactor: centralize sequence claiming * refactor: key sequence disk cache compatibility * refactor: decouple completion request preparation * refactor: name prepared completion parts * refactor: return prepared completion parts * refactor: localize media cache key building * refactor: remove unused request id override * refactor: simplify prompt segment row capacity * refactor: inline prompt row clamp * refactor: inline disconnect cancellation response * refactor: simplify recurrent draft capacity * refactor: define builtin grammar rule as dataclass * refactor: type chat template conversions * docs: mark llama_cpp_ext experimental * feat: restrict multimodal media sources * docs: add server example README and config * docs: document server example configuration * docs: update server README * docs: document server wheel setup and clients * docs: add server model configs * docs: add server chat templates and response schemas * docs: keep batch processing server example * docs: add server example changelog entry * docs: mention multi-token prediction in changelog
1 parent ed83366 commit 66635a0

9 files changed

Lines changed: 17517 additions & 0 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
88
## [Unreleased]
99

1010
- feat: update llama.cpp to ggml-org/llama.cpp@5a69c9743
11+
- feat(example): Updated server example (batch processing, multi-token prediction, `/v1/responses` api, response parsing) by @abetlen in #2174
1112

1213
## [0.3.26]
1314

0 commit comments

Comments
 (0)