Open
Conversation
* webui: UI primitives and polish (non-MCP) * chore: update webui build output
This commit adds support for using the pr2wt.sh (pull request to workspace) script with forks of upstream llama.cpp.
…gml-org#19556) * feat: Enable adding System Prompt per-chat * fix: Save draft message in Chat Form when adding System Prompt from new chat view * fix: Proper system message deletion logic * chore: Formatting * chore: update webui build output
* Updated documentation Model is no longer a parameter * llama : fix trailing whitespace in comment --------- Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* opencl: add q4_1 mv * opencl: clean up * opencl: add flattened q4_1 mv * opencl: clean up * opencl: add basic q4_1 mm * opencl: fix whitespace * opencl: add general q4_0 mm
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Using the same conversion function ensures a consistent matching between the regex pattern and the text. Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* fix conv state update for llama-server parallel serving --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
* Do not mutate cgraph for fused ADDs 1. We should try to minimize in-place changes to the incoming ggml_cgraph where possible (those should happen in graph_optimize) 2. Modifying in-place leads to an additional, unnecessary graph capture step as we store the properties before modifying the graph in-place in the cuda-backend * Assert ggml_tensor is trivially copyable * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Aman Gupta <amangupta052@gmail.com> --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* chore: update webui build output * chore: update webui build output * fix: Scroll issues in DropdownMenuSearchable * webui: fix redirect to root ignoring base path * fix: Word wrapping * fix: remove obsolete modality UI tests causing CI failures - Remove VisionModality/AudioModality test stories - Remove mockServerProps usage and imports - Simplify Default test (remove dropdown interaction checks) - Simplify FileAttachments test (remove mocks) * feat: Improve formatting performance time --------- Co-authored-by: Pascal <admin@serveurperso.com>
* CUDA: loop over ne2*ne3 in case it overflows * use fastdiv
* fix vulkan ggml_acc only works in 3d but not 4d * removed clamp in test_acc_block * use the correct stride and its test case * cuda : fix "supports op" condition * change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check * version without boundary check * revert back to boundary check version --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
…gml-org#19460) * model: support GLM MoE DSA arch * working version * pyright * keep indexer tensors * add indexer gguf params * loaded now * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * update * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * minor fix and cleanup --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* common : remove legacy .json to .etag migration code Signed-off-by: Adrien Gallouët <angt@huggingface.co> * common : simplify common_download_file_single_online This commit also force a redownload if the file exists but has no .etag file. Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>
…gml-org#19583) * ggml-hexagon: fa improvements ggml-hexagon: optimize flash attention calculations with improved variable handling ggml-hexagon: streamline flash attention operations by removing redundant checks for FP32 ggml-hexagon: optimize hvx_dot_f16_f16_aa_rx2 by simplifying variable handling for unused elements ggml-hexagon: optimize flash attention by changing slope vector type to F16 * hexfa: fixed test-backend-ops failurs due to leftover element handling * hexagon: refactor and optimize fa to use local context struct * ggml-hexagon: optimize flash-attention using hvx_vec_expf Use HVX for online softmax. --------- Co-authored-by: chraac <chraac@gmail.com>
This commit allows Qualcomm native vulkan driver to be used on Windows instead of Mesa Dozen.
* Add TQ2_0 and TQ1_0 support to the Metal backend. (tetherto#85) * Add TQ2_0 and TQ1_0 support to the Metal backend. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Add tq2_0/q8_0 fallback aliases for loongarch/riscv. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Resolve macro function for tq2_0/q8_0/q8_1 and split into two seperate functions. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Add missing backslash to fix the macOS CI workflow. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * The Metal compiler doesn't allow constant address space on local variables. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Fix visionOS builds with LLAMA_HTTPLIB=OFF. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Fix WASM WebGPU builds with DLLAMA_BUILD_TOOLS=OFF. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> --------- Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Add inference support for BitNet models using Vulkan (tetherto#98) * ggml-vulkan: Add TQ2_0 dequantize and mul_mat vec * ggml-vulkan: Enable coopmat support for Android * ggml-vulkan: Add mul_mm path for TQ2_0 * SET_ROWS and GET_ROWS has no TQ2_0 support yet. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Vulkan: Fix TQ2_0 mul_mm pipeline * Add support for microsoft/bitnet-b1.58-2B-4T (HF to GGUF). Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Vulkan: TQ2_0 x Q8_1 MUL_MAT perf improvements * Vulkan: Add TQ1_0 infra * Vulkan: Add MUL_MAT_MAT and MUL_MAT_VEC support for TQ1 * Make sure we report the supported ops + datatypes. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> --------- Signed-off-by: Marcus Edel <marcus.edel@collabora.com> Co-authored-by: vineet <vineet.suryan@collabora.com> Co-authored-by: Marcus Edel <marcus.edel@collabora.com> Co-authored-by: Italo Nicola <italo.nicola@collabora.com> * Ignore GGML_OP_SET_ROWS parameters during gradient calculation, since there is no effect on the output gradients. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Add lora finetuning from adapter * Add: create new lora adapter for target modules to finetune if no lora is provided * Fix identical loss over epochs; fix garbage lora initization Signed-off-by: vineet <vineet.suryan@collabora.com> * Remove lora training from finetune.cpp Signed-off-by: vineet <vineet.suryan@collabora.com> * Add adapter saving & other lora target modules Signed-off-by: vineet <vineet.suryan@collabora.com> * Add finetune-lora for lora finetuning in examples Signed-off-by: vineet <vineet.suryan@collabora.com> * Update README with finetune-lora Signed-off-by: vineet <vineet.suryan@collabora.com> * Add dequantization to out_prod cuda kernel Signed-off-by: vineet <vineet.suryan@collabora.com> * CPU: add support for fp16_fp32 OUT_PROD op * Remove unused variable val_split. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Explicitly define the optimizer, to fix missing initializer for member issue. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * finetune-lora: Add checkpoint saving & resuming from saved checkpoint This commit adds checkpointing for fine-tuning: - Add checkpoint saving every N steps with --checkpoint-save-steps - Save complete training state: model weights, optimizer state, metadata - Implement two-phase optimizer state loading to avoid memory issues - Add --resume-from and --auto-resume functionality - Store optimizer momentum/variance tensors in GGUF format - Add checkpoint validation for rank, alpha, and target modules - Update README.md with checkpointing documentation The optimizer state loading: iteration count is loaded during initialization, while tensor data (grad_m, grad_v) is loaded after ggml_opt_alloc creates the proper tensor structures. * Add simple test to choose the right datatype based on the supported OUT_PROD datatype implementation. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Add OUT_PROD, RMS_NORM_BACK, SILU_BACK metal shader. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * lora: Fix LoRA K/V gradient flow with gradient-connected kv cache retrieval Add get_k_lora() and get_v_lora() methods that use concatenation instead of ggml_view_4d to maintain gradient connectivity during training. This ensures LoRA K/V parameters receive proper gradients while preserving causal attention behavior. * lora: Add Instruction Finetuning support - Add masked loss computation on assistant responses only - Implement Vulkan masked cross-entropy loss shader & count_equal shader - Support default ChatML template & custom jinja chat templates * Add SOFT_MAX_BACK metal kernel. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Extend swift example app with finetuning support. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Fix Q4 OUT_PROD iq upper handling. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Add learning rate scheduler: constant (default), linear, and cosine. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Add warmup-ratio parameter to match HF training. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * lora: Fix lr assertion on step 0 * lora: Fix training start from step 2 * Added * Updating code to enable mid-epoch cancellation * cpp lint applied * Fix geglu_back implementation - Fix CPU implementation: now correctly computes gelu_backward(gate, grad) instead of splitting computation across two halves - Update Vulkan shader to match corrected implementation with proper gelu_backward - Add a test for geglu_back op The previous implementation incorrectly assumed geglu_back operated on concatenated tensors and split them. The correct implementation computes the GELU backward pass element-wise on the gate values. * Gemma Chat Template Support for LoRA Finetuning - Add auto-detection for Gemma format (<start_of_turn>model\n...<end_of_turn>) - Falls back to ChatML format for other models - Uses models default chat-template i.e. no need for jinja chat-template This enables instruction finetuning on any model. * Fixed ibatch Mismatch in llama_opt_epoch Resume * CPP lint ran * lora: Update readme; add architecture overview * Add guide about how to support a new model. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Simplify main README to focus on LoRA finetuning. (tetherto#71) Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Vulkan: add support for fp32 OUT_PROD op * Vulkan: add support for f16_f32 OUT_PROD op * Vulkan: Add Q4_0/Q8_0 OUT_PROD Vulkan support * vulkan: Add initial cross entropy loss backward shader Signed-off-by: vineet <vineet.suryan@collabora.com> * vulkan: Fix cross-entropy-loss-back dispatch size and wg denominator Signed-off-by: vineet <vineet.suryan@collabora.com> * vulkan: Change uint32 cast to int32 for outprod; allows android compilation Signed-off-by: vineet <vineet.suryan@collabora.com> * vulkan: Set specialization constants to { 0 } for out_prod This fixes the vkDeviceLostError on Mali * vulkan: Set out_prod pipeline disable_robustness to true * Fix out_prod; vulkan ci issues * Add GEGLU backward (Vulkan) to enable Gemma training. * Vulkan: Clean up OUT_PROD shader and pipelines Shouldn't change any behavior since currently nb00 is always 1. Robustness is usually disabled for Q8/Q4 shaders since having it enabled impacts performance more significantly for those types than F16/F32. * Vulkan: Improve Q8 OUT_PROD performance Increase OUT_PROD Q8 performance through improving memory locality. * metal: port OUT_PROD, SILU_BACK, SOFT_MAX_BACK, RMS_NORM_BACK ops to split architecture * Backport shader. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Initialize sin_sign in rope kargs to fix broken positional encoding. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Fix Windows build by using path::string() for wchar_t conversion Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Fix format specifiers for int64_t portability. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Add missing resume_from_batch arg to llama_opt_epoch call. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Fix TQ2_0 dequantization. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Fix slow ReBAR reads on discrete GPUs and relax contiguity checks for backward pass. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Use VMA random-access host alloc, skip n_ctx padding and host-buft override during training. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Fix loss calculation and TQ2_0 dequantization. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * ggml-vulkan: workaround for Adreno MUL_MAT Q6_K * ggml-vulkan: workaround for Adreno MUL_MAT TQ1 * vulkan: revert graph_optimize skip for prompt processing * vulkan: ensure host coherent memory on UMA devices Signed-off-by: vineet <vineet.suryan@collabora.com> * ggml-vulkan: fix GGML_VULKAN_CHECK_RESULTS * ggml-vulkan: skip CROSS_ENTROPY_LOSS_MASKED for check_results * ggml-vulkan: skip COUNT_EQUAL_MASKED for check_results * ggml-vulkan: improve OUT_PROD Q4 performance * Fix LLAMA_LORA_TARGET_ALL bitmask Signed-off-by: makaveli10 <vineet.suryan@collabora.com> * Preserve C API compatibility for llama_opt_epoch Add llama_opt_epoch_resume function for the resume-from-batch use case and update callers accordingly. Signed-off-by: makaveli10 <vineet.suryan@collabora.com> * lora: enhance LoRA init safety and simplify caller - Add overflow and error checks for snprintf when generating LoRA tensor names - Encapsulate tensor pointer validation within llama_lora_init_tensor_weights() and return bool to simplify the caller Signed-off-by: makaveli10 <vineet.suryan@collabora.com> * training: add llama_opt_default_params and use it in examples Signed-off-by: makaveli10 <vineet.suryan@collabora.com> * training: add reproducible seed, improve safety and style in LoRA training Signed-off-by: makaveli10 <vineet.suryan@collabora.com> * training: refactor LoRA tensor init to use exceptions Signed-off-by: makaveli10 <vineet.suryan@collabora.com> * ggml-opt: refactor batch memory copying to use lambda Signed-off-by: makaveli10 <vineet.suryan@collabora.com> * fix: typo in ggml.c & README Signed-off-by: makaveli10 <vineet.suryan@collabora.com> * training: document masking constraints and fix metadata extension Signed-off-by: makaveli10 <vineet.suryan@collabora.com> * tests: add ops tests for cross_entropy_loss_masked Signed-off-by: makaveli10 <vineet.suryan@collabora.com> * Add bounds check for --chat-template argument parsing & remove stray backslash Signed-off-by: makaveli10 <vineet.suryan@collabora.com> * training: add TODO for refactoring CLI argument parsing Signed-off-by: makaveli10 <vineet.suryan@collabora.com> * training: add a comment about dropout not being used yet Signed-off-by: makaveli10 <vineet.suryan@collabora.com> * training: add static_assert to catch llama_layer padding issues Signed-off-by: makaveli10 <vineet.suryan@collabora.com> * graph: restore ggml_view_4d for non-contiguous Q tensor support Signed-off-by: makaveli10 <vineet.suryan@collabora.com> * ggml-vulkan: Add buffer sync to cross_entropy_loss_masked_back op * ggml-vulkan: add support for tiling as a workaround for memory issues * The Metal ADD shader already uses strides for indexing, so non-contiguous tensors work correctly. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Wrap tensor and make it contiguous. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Add comment on graph_max_nodes bump for LoRA finetuning. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * fix: resume_from_batch=0 incorrectly treated as no-resume in opt_epoch llama_opt_epoch_resume accepts a resume_from_batch parameter where -1 means "no resume, start from the beginning." However, opt_epoch used `resume_from_batch > 0` to distinguish resume from non-resume, which means resume_from_batch=0 (a valid value meaning "batch 0 was the last completed, start from batch 1") was silently treated as no-resume, causing the entire epoch to replay from the start. This affects any caller that pauses training after the first batch of an epoch (globalStep=1, or any globalStep that is a multiple of stepsPerEpoch + 1), since the computed resume batch offset modulo stepsPerEpoch is 0. Fix: change `> 0` to `>= 0` in both the idata start position and the idata_in_loop calculation, so that -1 remains the only sentinel for "no resume." Made-with: Cursor * Fix memory leak in optimizer state loading. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Disable command-buffer concurrency by default on iOS. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * Override and default to n_cb=2 on iOS. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * fix: restore context state for inference after training cleanup Save and restore n_ctx_train in opt_init/opt_cleanup to prevent training from permanently modifying the model's context length. Reset the scheduler and clear the previous graph result in opt_cleanup so the context can be reused for inference after finetuning. Made-with: Cursor * Add @autoreleasepool to encode_async block to prevent ObjC object accumulation on GCD worker threads. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * fix: keep output tensor on CPU for iOS to avoid Metal buffer limits On iOS, cap GPU-offloaded layers at n_layer (excluding the output layer) to prevent exceeding Metal memory constraints on mobile devices. Made-with: Cursor * Remove unused variable 'tensor_name'. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> * training: fix LLAMA_LORA_TARGET_ALL for ISO C compliance Signed-off-by: makaveli10 <vineet.suryan@collabora.com> * ci: disable native CPU optimizations for x64-cpu-low-perf builds Signed-off-by: makaveli10 <vineet.suryan@collabora.com> * ci: increase timeout for ubuntu-24-cmake-vulkan tests Signed-off-by: makaveli10 <vineet.suryan@collabora.com> * Add resume-from-checkpoint support to Metal LoRA fine-tuning Signed-off-by: makaveli10 <vineet.suryan@collabora.com> * Fix missing parameters for llama_swift_finetune_options Signed-off-by: Italo Nicola <italo.nicola@collabora.com> * tests: disable TQ2_0 tests in test-backend-ops due to llvmpipe bug Temporarily disable TQ2_0 quantization tests to work around a bug in llvmpipe. Tests pass successfully on all real Vulkan hardware (Nvidia, ARM GPUs) but fail on llvmpipe with high error values. Signed-off-by: makaveli10 <vineet.suryan@collabora.com> * Enable the tests again. Signed-off-by: Marcus Edel <marcus.edel@collabora.com> --------- Signed-off-by: Marcus Edel <marcus.edel@collabora.com> Signed-off-by: vineet <vineet.suryan@collabora.com> Signed-off-by: makaveli10 <vineet.suryan@collabora.com> Signed-off-by: Italo Nicola <italo.nicola@collabora.com> Co-authored-by: gianni <gianfranco.cordella@tether.io> Co-authored-by: vineet <vineet.suryan@collabora.com> Co-authored-by: Italo Nicola <italo.nicola@collabora.com> Co-authored-by: Nidhin <nidhinpd811@gmail.com> Co-authored-by: Alexandros Frantzis <alexandros.frantzis@collabora.com> Co-authored-by: gianni-cor <gianfrancocordella@gmail.com>
…ild, and Vulkan T4 segfault tolerance (driver bug workaround) (tetherto#113) * fix: download LFS blobs via curl in tokenizer test Git clone does not fetch LFS blobs on CI runners without git-lfs configured. Fall back to curling each .gguf file from the HuggingFace resolve endpoint when the local copy is not a valid GGUF. * ci: fix ccache key to include CPU feature hash The ubuntu-cpu-cmake job builds with GGML_NATIVE=ON (-march=native) but the ccache key was CPU-agnostic. GitHub runners with different CPUs (e.g. Intel w/ AVX-512 vs AMD w/o) shared the same cache, so ccache served objects compiled for the wrong architecture, causing SIGILL at runtime. Hash GCC's -march=native preprocessor defines into the key so each CPU architecture gets its own cache. * ci: tolerate Vulkan T4 driver segfault-on-exit in tests NVIDIA Tesla T4 with driver 570.x has a known bug where any test that initializes the Vulkan backend can non-deterministically segfault during exit/cleanup inside libnvidia-gpucomp.so. The crash is in the driver's atexit handlers racing with its own [vkps] Update thread -- all test cases pass but the process may crash on teardown. This is the same issue that caused upstream llama.cpp to disable their T4 Vulkan CI node entirely (ggml-org#10528, ggml-org#10989). Workaround: detect T4 via nvidia-smi, and if the only test failures are SegFault exceptions, treat the run as success. * ci: address PR tetherto#111 review feedback - Use sed + numbered-line matching instead of grep -A 100 with explicit failure-type patterns for detecting non-SegFault CTest failures; catches all failure types (SIGILL, OTHER_FAULT, etc.) instead of only Exception|Failed|Timeout. - Fail early with a clear error message when curl fails to download a .gguf LFS file, instead of silently leaving the pointer file in place.
Signed-off-by: Marcus Edel <marcus.edel@collabora.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Align call sites with API changes.