[pull] main from abetlen:main#115
Merged
Merged
Conversation
…es` api, response parsing) (#2174) * Add batch processing server * Improve response parser streaming performance * Add /v1/models endpoint * Support custom tools in responses API * Improve responses api compatibility for codex * Improve batch server prompt and context config * Improve batch server scheduling and prompt handling * fix apply_patch tool * Improve batch server schemas and metrics * Refactor sequence cache helpers * Fix server type diagnostics * feat: add llama.cpp extension bindings * feat: add MTP support to batch server * feat: improve draft-mtp handling in batch server * feat: cap MTP draft context outputs * fix: preserve held streaming tokens * feat: add load-time LoRA support to batch server * feat: add multimodal support to batch server * refactor: rename batch item kinds * refactor: type sampled mtp updates * refactor: structure sampled mtp batch processing * refactor: clarify batch item construction * refactor: type batch item kind * refactor: clarify sampled pending index * refactor: clarify output index naming * refactor: rename logits index resolver * refactor: colocate sampled mtp state * refactor: inline sampled mtp helpers * refactor: use row-expanded multimodal prompt identity * test: remove multimodal prompt plan tests * refactor: narrow mtmd processor dependencies * refactor: group prompt segment media fields * refactor: centralize sequence state copy * refactor: keep disk cache storage only * refactor: split batch item payloads * refactor: centralize pending request failure cleanup * refactor: centralize sequence claiming * refactor: key sequence disk cache compatibility * refactor: decouple completion request preparation * refactor: name prepared completion parts * refactor: return prepared completion parts * refactor: localize media cache key building * refactor: remove unused request id override * refactor: simplify prompt segment row capacity * refactor: inline prompt row clamp * refactor: inline disconnect cancellation response * refactor: simplify recurrent draft capacity * refactor: define builtin grammar rule as dataclass * refactor: type chat template conversions * docs: mark llama_cpp_ext experimental * feat: restrict multimodal media sources * docs: add server example README and config * docs: document server example configuration * docs: update server README * docs: document server wheel setup and clients * docs: add server model configs * docs: add server chat templates and response schemas * docs: keep batch processing server example * docs: add server example changelog entry * docs: mention multi-token prediction in changelog
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )