Skip to content

Rebase on latest llama.cpp#116

Open
zoq wants to merge 1015 commits intotetherto:temp-7248from
zoq:temp-7248-rebase
Open

Rebase on latest llama.cpp#116
zoq wants to merge 1015 commits intotetherto:temp-7248from
zoq:temp-7248-rebase

Conversation

@zoq
Copy link
Copy Markdown

@zoq zoq commented Mar 29, 2026

Align call sites with API changes.

allozaur and others added 30 commits February 12, 2026 12:21
* webui: UI primitives and polish (non-MCP)

* chore: update webui build output
This commit adds support for using the pr2wt.sh (pull request to
workspace) script with forks of upstream llama.cpp.
…gml-org#19556)

* feat: Enable adding System Prompt per-chat

* fix: Save draft message in Chat Form when adding System Prompt from new chat view

* fix: Proper system message deletion logic

* chore: Formatting

* chore: update webui build output
* Updated documentation

Model is no longer a parameter

* llama : fix trailing whitespace in comment

---------

Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* opencl: add q4_1 mv

* opencl: clean up

* opencl: add flattened q4_1 mv

* opencl: clean up

* opencl: add basic q4_1 mm

* opencl: fix whitespace

* opencl: add general q4_0 mm
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Using the same conversion function ensures a consistent matching between
the regex pattern and the text.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* fix conv state update for llama-server parallel serving

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
* Do not mutate cgraph for fused ADDs

1. We should try to minimize in-place changes to the incoming
   ggml_cgraph where possible (those should happen in graph_optimize)
2. Modifying in-place leads to an additional, unnecessary graph capture
   step as we store the properties before modifying the graph in-place
   in the cuda-backend

* Assert ggml_tensor is trivially copyable

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* chore: update webui build output

* chore: update webui build output

* fix: Scroll issues in DropdownMenuSearchable

* webui: fix redirect to root ignoring base path

* fix: Word wrapping

* fix: remove obsolete modality UI tests causing CI failures

- Remove VisionModality/AudioModality test stories
- Remove mockServerProps usage and imports
- Simplify Default test (remove dropdown interaction checks)
- Simplify FileAttachments test (remove mocks)

* feat: Improve formatting performance time

---------

Co-authored-by: Pascal <admin@serveurperso.com>
* CUDA: loop over ne2*ne3 in case it overflows

* use fastdiv
* fix vulkan ggml_acc only works in 3d but not 4d

* removed clamp in test_acc_block

* use the correct stride and its test case

* cuda : fix "supports op" condition

* change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check

* version without boundary check

* revert back to boundary check version

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
…gml-org#19460)

* model: support GLM MoE DSA arch

* working version

* pyright

* keep indexer tensors

* add indexer gguf params

* loaded now

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* update

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* minor fix and cleanup

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* common : remove legacy .json to .etag migration code

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* common : simplify common_download_file_single_online

This commit also force a redownload if the file exists
but has no .etag file.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
…gml-org#19583)

* ggml-hexagon: fa improvements

ggml-hexagon: optimize flash attention calculations with improved variable handling

ggml-hexagon: streamline flash attention operations by removing redundant checks for FP32

ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx2 by simplifying variable handling for unused elements

ggml-hexagon: optimize flash attention by changing slope vector type to F16

* hexfa: fixed test-backend-ops failurs due to leftover element handling

* hexagon: refactor and optimize fa to use local context struct

* ggml-hexagon: optimize flash-attention using hvx_vec_expf

Use HVX for online softmax.

---------

Co-authored-by: chraac <chraac@gmail.com>
This commit allows Qualcomm native vulkan driver to be used on Windows
instead of Mesa Dozen.
zoq and others added 5 commits March 29, 2026 16:33
* Add TQ2_0 and TQ1_0 support to the Metal backend. (tetherto#85)

* Add TQ2_0 and TQ1_0 support to the Metal backend.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Add tq2_0/q8_0 fallback aliases for loongarch/riscv.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Resolve macro function for tq2_0/q8_0/q8_1 and split into two seperate functions.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Add missing backslash to fix the macOS CI workflow.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* The Metal compiler doesn't allow constant address space on local variables.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Fix visionOS builds with LLAMA_HTTPLIB=OFF.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Fix WASM WebGPU builds with DLLAMA_BUILD_TOOLS=OFF.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

---------

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Add inference support for BitNet models using Vulkan (tetherto#98)

* ggml-vulkan: Add TQ2_0 dequantize and mul_mat vec

* ggml-vulkan: Enable coopmat support for Android

* ggml-vulkan: Add mul_mm path for TQ2_0

* SET_ROWS and GET_ROWS has no TQ2_0 support yet.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Vulkan: Fix TQ2_0 mul_mm pipeline

* Add support for microsoft/bitnet-b1.58-2B-4T (HF to GGUF).

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Vulkan: TQ2_0 x Q8_1 MUL_MAT perf improvements

* Vulkan: Add TQ1_0 infra

* Vulkan: Add MUL_MAT_MAT and MUL_MAT_VEC support for TQ1

* Make sure we report the supported ops + datatypes.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

---------

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>
Co-authored-by: vineet <vineet.suryan@collabora.com>
Co-authored-by: Marcus Edel <marcus.edel@collabora.com>
Co-authored-by: Italo Nicola <italo.nicola@collabora.com>

* Ignore GGML_OP_SET_ROWS parameters during gradient calculation, since there is no effect on the output gradients.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Add lora finetuning from adapter

* Add: create new lora adapter for target modules to finetune if no lora is provided

* Fix identical loss over epochs; fix garbage lora initization

Signed-off-by: vineet <vineet.suryan@collabora.com>

* Remove lora training from finetune.cpp

Signed-off-by: vineet <vineet.suryan@collabora.com>

* Add adapter saving & other lora target modules

Signed-off-by: vineet <vineet.suryan@collabora.com>

* Add finetune-lora for lora finetuning in examples

Signed-off-by: vineet <vineet.suryan@collabora.com>

* Update README with finetune-lora

Signed-off-by: vineet <vineet.suryan@collabora.com>

* Add dequantization to out_prod cuda kernel

Signed-off-by: vineet <vineet.suryan@collabora.com>

* CPU: add support for fp16_fp32 OUT_PROD op

* Remove unused variable val_split.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Explicitly define the optimizer, to fix missing initializer for member issue.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* finetune-lora: Add checkpoint saving & resuming from saved checkpoint
This commit adds checkpointing for fine-tuning:
- Add checkpoint saving every N steps with --checkpoint-save-steps
- Save complete training state: model weights, optimizer state, metadata
- Implement two-phase optimizer state loading to avoid memory issues
- Add --resume-from and --auto-resume functionality
- Store optimizer momentum/variance tensors in GGUF format
- Add checkpoint validation for rank, alpha, and target modules
- Update README.md with checkpointing documentation

The optimizer state loading: iteration count is loaded during initialization,
while tensor data (grad_m, grad_v) is loaded after ggml_opt_alloc creates
the proper tensor structures.

* Add simple test to choose the right datatype based on the supported OUT_PROD datatype implementation.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Add OUT_PROD, RMS_NORM_BACK, SILU_BACK metal shader.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* lora: Fix LoRA K/V gradient flow with gradient-connected kv cache retrieval

Add get_k_lora() and get_v_lora() methods that use concatenation

instead of ggml_view_4d to maintain gradient connectivity during

training. This ensures LoRA K/V parameters receive proper gradients

while preserving causal attention behavior.

* lora: Add Instruction Finetuning support

- Add masked loss computation on assistant responses only

- Implement Vulkan masked cross-entropy loss shader & count_equal shader
- Support default ChatML template & custom jinja chat templates

* Add SOFT_MAX_BACK metal kernel.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Extend swift example app with finetuning support.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Fix Q4 OUT_PROD iq upper handling.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Add learning rate scheduler: constant (default), linear, and cosine.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Add warmup-ratio parameter to match HF training.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* lora: Fix lr assertion on step 0

* lora: Fix training start from step 2

* Added

* Updating code to enable mid-epoch cancellation

* cpp lint applied

* Fix geglu_back implementation

- Fix CPU implementation: now correctly computes gelu_backward(gate, grad) instead of
splitting computation across two halves
- Update Vulkan shader to match corrected implementation with proper gelu_backward
- Add a test for geglu_back op

The previous implementation incorrectly assumed geglu_back operated on concatenated
tensors and split them. The correct implementation computes the GELU backward pass
element-wise on the gate values.

* Gemma Chat Template Support for LoRA Finetuning

- Add auto-detection for Gemma format (<start_of_turn>model\n...<end_of_turn>)
- Falls back to ChatML format for other models
- Uses models default chat-template i.e. no need for jinja chat-template

This enables instruction finetuning on any model.

* Fixed ibatch Mismatch in llama_opt_epoch Resume

* CPP lint ran

* lora: Update readme; add architecture overview

* Add guide about how to support a new model.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Simplify main README to focus on LoRA finetuning. (tetherto#71)

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Vulkan: add support for fp32 OUT_PROD op

* Vulkan: add support for f16_f32 OUT_PROD op

* Vulkan: Add Q4_0/Q8_0 OUT_PROD Vulkan support

* vulkan: Add initial cross entropy loss backward shader

Signed-off-by: vineet <vineet.suryan@collabora.com>

* vulkan: Fix cross-entropy-loss-back dispatch size and wg denominator

Signed-off-by: vineet <vineet.suryan@collabora.com>

* vulkan: Change uint32 cast to int32 for outprod; allows android compilation

Signed-off-by: vineet <vineet.suryan@collabora.com>

* vulkan: Set specialization constants to { 0 } for out_prod

This fixes the vkDeviceLostError on Mali

* vulkan: Set out_prod pipeline disable_robustness to true

* Fix out_prod; vulkan ci issues

* Add GEGLU backward (Vulkan) to enable Gemma training.

* Vulkan: Clean up OUT_PROD shader and pipelines

Shouldn't change any behavior since currently nb00 is always 1.
Robustness is usually disabled for Q8/Q4 shaders since having it enabled
impacts performance more significantly for those types than F16/F32.

* Vulkan: Improve Q8 OUT_PROD performance

Increase OUT_PROD Q8 performance through improving memory locality.

* metal: port OUT_PROD, SILU_BACK, SOFT_MAX_BACK, RMS_NORM_BACK ops to split architecture

* Backport shader.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Initialize sin_sign in rope kargs to fix broken positional encoding.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Fix Windows build by using path::string() for wchar_t conversion

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Fix format specifiers for int64_t portability.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Add missing resume_from_batch arg to llama_opt_epoch call.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Fix TQ2_0 dequantization.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Fix slow ReBAR reads on discrete GPUs and relax contiguity checks for backward pass.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Use VMA random-access host alloc, skip n_ctx padding and host-buft override during training.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Fix loss calculation and TQ2_0 dequantization.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* ggml-vulkan: workaround for Adreno MUL_MAT Q6_K

* ggml-vulkan: workaround for Adreno MUL_MAT TQ1

* vulkan: revert graph_optimize skip for prompt processing

* vulkan: ensure host coherent memory on UMA devices

Signed-off-by: vineet <vineet.suryan@collabora.com>

* ggml-vulkan: fix GGML_VULKAN_CHECK_RESULTS

* ggml-vulkan: skip CROSS_ENTROPY_LOSS_MASKED for check_results

* ggml-vulkan: skip COUNT_EQUAL_MASKED for check_results

* ggml-vulkan: improve OUT_PROD Q4 performance

* Fix LLAMA_LORA_TARGET_ALL bitmask

Signed-off-by: makaveli10 <vineet.suryan@collabora.com>

* Preserve C API compatibility for llama_opt_epoch

Add llama_opt_epoch_resume function for the resume-from-batch
use case and update callers accordingly.

Signed-off-by: makaveli10 <vineet.suryan@collabora.com>

* lora: enhance LoRA init safety and simplify caller

- Add overflow and error checks for snprintf when generating LoRA tensor names
- Encapsulate tensor pointer validation within llama_lora_init_tensor_weights()
  and return bool to simplify the caller

Signed-off-by: makaveli10 <vineet.suryan@collabora.com>

* training: add llama_opt_default_params and use it in examples

Signed-off-by: makaveli10 <vineet.suryan@collabora.com>

* training: add reproducible seed, improve safety and style in LoRA training

Signed-off-by: makaveli10 <vineet.suryan@collabora.com>

* training: refactor LoRA tensor init to use exceptions

Signed-off-by: makaveli10 <vineet.suryan@collabora.com>

* ggml-opt: refactor batch memory copying to use lambda

Signed-off-by: makaveli10 <vineet.suryan@collabora.com>

* fix: typo in ggml.c & README

Signed-off-by: makaveli10 <vineet.suryan@collabora.com>

* training: document masking constraints and fix metadata extension

Signed-off-by: makaveli10 <vineet.suryan@collabora.com>

* tests: add ops tests for cross_entropy_loss_masked

Signed-off-by: makaveli10 <vineet.suryan@collabora.com>

* Add bounds check for --chat-template argument parsing & remove stray backslash

Signed-off-by: makaveli10 <vineet.suryan@collabora.com>

* training: add TODO for refactoring CLI argument parsing

Signed-off-by: makaveli10 <vineet.suryan@collabora.com>

* training: add a comment about dropout not being used yet

Signed-off-by: makaveli10 <vineet.suryan@collabora.com>

* training: add static_assert to catch llama_layer padding issues

Signed-off-by: makaveli10 <vineet.suryan@collabora.com>

* graph: restore ggml_view_4d for non-contiguous Q tensor support

Signed-off-by: makaveli10 <vineet.suryan@collabora.com>

* ggml-vulkan: Add buffer sync to cross_entropy_loss_masked_back op

* ggml-vulkan: add support for tiling as a workaround for memory issues

* The Metal ADD shader already uses strides for indexing, so non-contiguous tensors work correctly.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Wrap tensor and make it contiguous.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Add comment on graph_max_nodes bump for LoRA finetuning.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* fix: resume_from_batch=0 incorrectly treated as no-resume in opt_epoch

llama_opt_epoch_resume accepts a resume_from_batch parameter where -1
means "no resume, start from the beginning." However, opt_epoch used
`resume_from_batch > 0` to distinguish resume from non-resume, which
means resume_from_batch=0 (a valid value meaning "batch 0 was the last
completed, start from batch 1") was silently treated as no-resume,
causing the entire epoch to replay from the start.

This affects any caller that pauses training after the first batch of
an epoch (globalStep=1, or any globalStep that is a multiple of
stepsPerEpoch + 1), since the computed resume batch offset modulo
stepsPerEpoch is 0.

Fix: change `> 0` to `>= 0` in both the idata start position and the
idata_in_loop calculation, so that -1 remains the only sentinel for
"no resume."

Made-with: Cursor

* Fix memory leak in optimizer state loading.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Disable command-buffer concurrency by default on iOS.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* Override and default to n_cb=2 on iOS.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* fix: restore context state for inference after training cleanup

Save and restore n_ctx_train in opt_init/opt_cleanup to prevent
training from permanently modifying the model's context length.
Reset the scheduler and clear the previous graph result in opt_cleanup
so the context can be reused for inference after finetuning.

Made-with: Cursor

* Add @autoreleasepool to encode_async block to prevent ObjC object accumulation on GCD worker threads.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* fix: keep output tensor on CPU for iOS to avoid Metal buffer limits

On iOS, cap GPU-offloaded layers at n_layer (excluding the output layer)
to prevent exceeding Metal memory constraints on mobile devices.

Made-with: Cursor

* Remove unused variable 'tensor_name'.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

* training: fix LLAMA_LORA_TARGET_ALL for ISO C compliance

Signed-off-by: makaveli10 <vineet.suryan@collabora.com>

* ci: disable native CPU optimizations for x64-cpu-low-perf builds

Signed-off-by: makaveli10 <vineet.suryan@collabora.com>

* ci: increase timeout for ubuntu-24-cmake-vulkan tests

Signed-off-by: makaveli10 <vineet.suryan@collabora.com>

* Add resume-from-checkpoint support to Metal LoRA fine-tuning

Signed-off-by: makaveli10 <vineet.suryan@collabora.com>

* Fix missing parameters for llama_swift_finetune_options

Signed-off-by: Italo Nicola <italo.nicola@collabora.com>

* tests: disable TQ2_0 tests in test-backend-ops due to llvmpipe bug

Temporarily disable TQ2_0 quantization tests to work around a bug in
llvmpipe. Tests pass successfully on all real Vulkan hardware
(Nvidia, ARM GPUs) but fail on llvmpipe with high error values.

Signed-off-by: makaveli10 <vineet.suryan@collabora.com>

* Enable the tests again.

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>

---------

Signed-off-by: Marcus Edel <marcus.edel@collabora.com>
Signed-off-by: vineet <vineet.suryan@collabora.com>
Signed-off-by: makaveli10 <vineet.suryan@collabora.com>
Signed-off-by: Italo Nicola <italo.nicola@collabora.com>
Co-authored-by: gianni <gianfranco.cordella@tether.io>
Co-authored-by: vineet <vineet.suryan@collabora.com>
Co-authored-by: Italo Nicola <italo.nicola@collabora.com>
Co-authored-by: Nidhin <nidhinpd811@gmail.com>
Co-authored-by: Alexandros Frantzis <alexandros.frantzis@collabora.com>
Co-authored-by: gianni-cor <gianfrancocordella@gmail.com>
…ild, and Vulkan T4 segfault tolerance (driver bug workaround) (tetherto#113)

* fix: download LFS blobs via curl in tokenizer test

Git clone does not fetch LFS blobs on CI runners without git-lfs
configured. Fall back to curling each .gguf file from the HuggingFace
resolve endpoint when the local copy is not a valid GGUF.

* ci: fix ccache key to include CPU feature hash

The ubuntu-cpu-cmake job builds with GGML_NATIVE=ON (-march=native)
but the ccache key was CPU-agnostic. GitHub runners with different CPUs
(e.g. Intel w/ AVX-512 vs AMD w/o) shared the same cache, so ccache
served objects compiled for the wrong architecture, causing SIGILL at
runtime. Hash GCC's -march=native preprocessor defines into the key so
each CPU architecture gets its own cache.

* ci: tolerate Vulkan T4 driver segfault-on-exit in tests

NVIDIA Tesla T4 with driver 570.x has a known bug where any test that
initializes the Vulkan backend can non-deterministically segfault during
exit/cleanup inside libnvidia-gpucomp.so. The crash is in the driver's
atexit handlers racing with its own [vkps] Update thread -- all test
cases pass but the process may crash on teardown.

This is the same issue that caused upstream llama.cpp to disable their
T4 Vulkan CI node entirely (ggml-org#10528, ggml-org#10989).

Workaround: detect T4 via nvidia-smi, and if the only test failures are
SegFault exceptions, treat the run as success.

* ci: address PR tetherto#111 review feedback

- Use sed + numbered-line matching instead of grep -A 100 with
  explicit failure-type patterns for detecting non-SegFault CTest
  failures; catches all failure types (SIGILL, OTHER_FAULT, etc.)
  instead of only Exception|Failed|Timeout.
- Fail early with a clear error message when curl fails to download
  a .gguf LFS file, instead of silently leaving the pointer file in
  place.
Signed-off-by: Marcus Edel <marcus.edel@collabora.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.