Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
310 commits
Select commit Hold shift + click to select a range
b734044
ggml-hexagon: add PAD op HVX kernel (#23078)
pdhinaka May 18, 2026
9a532ae
hexagon: add support for TRI op (#22822)
pdhinaka May 18, 2026
c3e9ade
rpc : keep last_graph_uid in the device context (#23273)
rgerganov May 19, 2026
439f1b1
sycl: add GGML_SYCL_USE_ASYNC_MEM_OP env toggle (#22153)
aicss-genai May 19, 2026
f1c1c5c
convert : filter lora tensor names (#23077)
CISC May 19, 2026
aabee04
[SCYL] add chapter for performance reference in SYCL.md (#23315)
arthw May 19, 2026
c85a242
ggml-webgpu : extend GDN for K>1 (#23299)
reeselevine May 19, 2026
d2e179a
llama-eval : add per-task summary stats (#23151)
ggerganov May 19, 2026
cd963fe
save-load-state : refactor tests and improve readability (#23196)
ggerganov May 19, 2026
3c81c8d
server : print graphs reused in slot timings (#23279)
ggerganov May 19, 2026
ccee426
server-context: guarantee there is at least 1 token to decode (#23280)
ServeurpersoCom May 19, 2026
00c461c
ci : install server kleidiai runner dependencies (#23259)
CISC May 19, 2026
4b262ab
ci : install libssl-dev (#23325)
CISC May 19, 2026
6db1304
ui: Bump packages + address build warnings (#23300)
allozaur May 19, 2026
d14ce3d
llama : MTP clean-up (#23269)
ggerganov May 19, 2026
baf3cc6
model : clarify MTP layer comment in qwen35.cpp [no ci] (#23338)
danbev May 19, 2026
ac76808
hexagon: enable support for NORM op (#23319)
aparmp-quic May 19, 2026
b7393a4
convert : update mtp related help (#23334)
CISC May 19, 2026
7256fce
common: fix --fit verbosity with --verbosity 4 (#23282)
JohannesGaessler May 19, 2026
57cb35c
common: fix --help for --verbosity (#23278)
JohannesGaessler May 19, 2026
a807867
github: mention --log-file in issue templates (#23277)
JohannesGaessler May 19, 2026
67ace02
refactor: Chat Screen UI rendering (#23333)
allozaur May 19, 2026
17d22a3
hexagon: add MROPE and IMROPE support in HTP rope op (#23317)
aparmp-quic May 19, 2026
b28a2f3
opencl: add MoE support for q4_k, q5_k, q6_k on Adreno (#23303)
shaofeiqi May 19, 2026
b39a7bf
ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps (#23349)
ravel7524 May 20, 2026
871b0b7
snapdragon: update toolchain to v0.6 (#23369)
max-krasnyansky May 20, 2026
57ebaf4
metal : optimize pad + cpy (#23354)
ggerganov May 20, 2026
585080d
fix: Div wrapper no pointer events on hidden (#23390)
allozaur May 20, 2026
5028447
ui: Refactor `isMobile` as reactive value in `viewport` store (#23330)
allozaur May 20, 2026
7e50ef7
docker : copy conversion files (#23370)
CISC May 20, 2026
e2b129e
mtmd: fit_params now take into account mmproj (#21489)
ngxson May 20, 2026
e6b4acf
refactor: Move text attachments up before the message content in chat…
allozaur May 20, 2026
29f1482
app : introduce the llama unified executable (#23296)
angt May 20, 2026
e947228
Programmatic Dependent Launch (PDL) for more performance on newer NVI…
aendk May 20, 2026
c9872a2
hexagon: HMX quantized matmul rework (#23368)
max-krasnyansky May 20, 2026
6ce9671
feat: Add WAV MIME type variants and improve audio format detection (…
allozaur May 20, 2026
acd604f
vulkan: optimize operations in the IM2COL shader (#22685)
daniandtheweb May 20, 2026
a8681a0
mtmd : DeepSeek-OCR image processing fixes, img_tool::resize padding …
sfallah May 20, 2026
510b5c2
common/speculative : fix nullptr crash in get_devices_str (#23386)
ggerganov May 20, 2026
3a6db74
opencl: refactor backend initilization (#23318)
lhez May 20, 2026
ad27757
Move to backend sampling for MTP draft path (#23287)
gaugarg-nv May 20, 2026
3a479c9
ui: Add max image size option (#22849)
stduhpf May 20, 2026
6a257d4
mtmd, model : merge HunyuanOCR into HunyuanVL and fix OCR vision prec…
wendadawen May 20, 2026
ce02093
app : show version (#23426)
angt May 21, 2026
0be8468
hexagon: ssm-conv fix for large prompts (#23307)
tboinovski1 May 21, 2026
eeeaf61
llama-graph: fix null-buffer crash in llm_graph_input_attn_kv_iswa fo…
ssfdre38 May 21, 2026
2754ce1
ggml : Check the right iface method before using the fallback 2d get …
TheBlueMatt May 21, 2026
5e932a1
ui: Improve Git Hooks for UI development (#23403)
allozaur May 21, 2026
2fc8d18
doc: fix spec mtp typo (#23435)
ruixiang63 May 21, 2026
7ea23dd
vocab : add Carbon-3B (HybridDNATokenizer) support (#23410)
kashif May 21, 2026
12e5d99
mtp: use inp_out_ids for skipping logit computation (#23433)
am17an May 21, 2026
1d7ab2b
app : add batched-bench, fit-params, quantize & perplexity (#23459)
angt May 21, 2026
c902171
server: re-inject subcommand when router spawns children under unifie…
ServeurpersoCom May 21, 2026
52fb93a
server : free draft/MTP resources on sleep to fix VRAM leak (#23461)
am17an May 21, 2026
a1a69f7
metal : optimize concat kernel and fix set kernel threads (#23411)
ggerganov May 21, 2026
b65bb4b
server: expose prompt token counts in /slots endpoint (#23454)
ScrewTSW May 21, 2026
40d5358
tests : move save-load-state from examples to tests (#23336)
ggerganov May 21, 2026
5306f4b
fix(flash-attn): replace f32 with kv_type and q_type (#23372)
Constannnnnt May 21, 2026
47c0eda
vulkan: fuse snake activation (mul, sin, sqr, mul, add) (#22855)
ServeurpersoCom May 21, 2026
ee7c305
Update WebGPU support and add link to blog/demo (#23483)
reeselevine May 21, 2026
bb28c1f
cmake : remove STATIC from impl libraries, enable LLAMA_BUILD_APP by …
ggerganov May 21, 2026
4f0e43d
CUDA: fix PDL CC check for JIT compilation (#23471)
JohannesGaessler May 21, 2026
bbce619
cmake : add install() for impl libraries + fix apple builds (#23511)
ggerganov May 22, 2026
afcda09
vocab : fix HybridDNA tokenizer (#23466)
kashif May 22, 2026
9c92e96
cmake : build router app only during standalone builds (#23521)
fairydreaming May 22, 2026
99d4026
ggml-zendnn : add Q8_0 quantization support (#23414)
z-sachin May 22, 2026
95feeab
docs: Update documentation with Granite 4.0/4.1 (#23404)
jesus-talavera-ibm May 22, 2026
8cc67ef
SYCL: add BF16 to DMMV kernel path (~4x tg speedup on Intel Arc) (#21…
PMZFX May 22, 2026
56f16f2
SYCL : gated_delta_net K>1 (#23174)
karavayev May 22, 2026
bcfd198
sycl : Level Zero detection in ggml_sycl_init (#23097)
sanmai May 22, 2026
cc9e331
SYCL: improve MoE prefill throughput (#23142)
sanmai May 22, 2026
ef570f6
perplexity : fix integer overflow (#23496)
fairydreaming May 22, 2026
1acee6b
server: only parse empty msg if continuing an assistant msg (#23506)
aldehir May 22, 2026
0f3cb3f
opencl: generalize Adreno MoE kernels on M (#23449)
shawngu-quic May 23, 2026
95405ac
vulkan: fix windows find_package of SPIRV-Headers (#23215)
jeffbolznv May 23, 2026
a497476
ggml : Check the right iface method before using the fallback 2d get …
dskwe May 23, 2026
b0df4c0
model : add NVFP4 MTP scale tensors (#23563)
michaelw9999 May 23, 2026
c0c7e14
requirements : bump torch to 2.11.0 (#23503)
adityasingh2400 May 23, 2026
b22ff4b
cmake/ui : refactor the build (#23352)
aldehir May 23, 2026
cec51c7
snapdragon: update windows toolchain to use hsdk v6.6.0.0 (#23552)
aparmp-quic May 24, 2026
1c0f6db
hexagon: apply repl optimization in flash attn softmax as #22993 (#23…
njsyw1997 May 24, 2026
f306111
opencl: batch profiling to improve speed and prevent memory leaks (#2…
shaofeiqi May 24, 2026
fff63b5
TP: fix entirely zero-sized slices per device (#23525)
JohannesGaessler May 24, 2026
83eebe9
server: add margin for draft model for `fit` (#23485)
am17an May 24, 2026
63248fc
cmake : fix ui build (#23592)
aldehir May 24, 2026
5d246a7
convert : minor fixes for numpy 2.x (#23571)
CISC May 24, 2026
549b9d8
ci : update build-self-hosted.yml (#23616)
ggerganov May 24, 2026
28123a3
ci : move most slim jobs to self-hosted runners (#23619)
ggerganov May 25, 2026
6d57c26
perplexity : fix even more integer overflows (#23623)
fairydreaming May 25, 2026
e2ef8fe
server: fix checkpoints creation (#22929)
jacekpoplawski May 25, 2026
9627d0f
vendor : update cpp-httplib to 0.45.1 (#23639)
cabelo May 25, 2026
b964876
ui: media attachments before text (#23467)
sfallah May 25, 2026
826539c
ggml : Parallelize quant LUT init (#23595)
jeffbolznv May 25, 2026
d55fb97
ci : install host compiler on android-ndk build (#23630)
aldehir May 25, 2026
314e729
llama : document that only one on-device state can be saved per seque…
TimNN May 25, 2026
062d311
ci : fix pre-tokenizer-hashes check (#23651)
CISC May 25, 2026
5fdf07e
ci : update spacemit toolchain url and enhance curl command (#23642)
alex-spacemit May 25, 2026
6c4cbdc
server: MTP layer kv-cache should respect draft type ctk (#23646)
am17an May 25, 2026
66efd13
ggml: `gguf_init_from_callback` and `gguf_init_from_buffer` (#22341)
giladgd May 25, 2026
ae251b5
TP: fix ggml context size calculation (#22616)
JohannesGaessler May 25, 2026
fa97041
ggml-alloc: fix out-of-bounds read in ggml_dyn_tallocr_remove_block (…
Dev-X25874 May 21, 2026
b251f74
ggml.h: correct ggml_silu_back arg docstring (a=dy, b=x) (ggml/1500)
OriPekelman May 21, 2026
ce5890b
ggml : bump version to 0.12.1 (ggml/1508)
ggerganov May 25, 2026
22307b3
sync : ggml
ggerganov May 25, 2026
45158f4
ggml : bump version to 0.13.0 (ggml/1510)
ggerganov May 25, 2026
d161ea7
sync : ggml
ggerganov May 25, 2026
a4d2d4a
convert : add compressed-tensors NVFP4 support (#21095)
michaelw9999 May 25, 2026
5a4126a
ui: fix stop/continue during an agentic loop (#23356)
ServeurpersoCom May 25, 2026
c1f1e28
CUDA: add fast walsh-hadamard transform (#23615)
am17an May 25, 2026
328874d
model: tag ffn_latent as MUL_MAT to fix buft probe (#23664)
ServeurpersoCom May 25, 2026
302e2c2
ci : reduce PR jobs by matching backend paths (#23675)
ggerganov May 25, 2026
4bead4e
snapdragon: bump toolchain docker to v0.7 to fix ui build issues (#23…
max-krasnyansky May 25, 2026
35c9b1f
metal : add apple device id (#23566)
forforever73 May 25, 2026
192d8ae
CUDA: missing PDL sync for FWHT, better fallback (#23690)
JohannesGaessler May 26, 2026
54121f7
[WebGPU] Check batch_compute_passes before sending passes when not do…
nikhilJain17 May 26, 2026
1506d39
ggml-webgpu: Add MMVQ path for Q4/Q8/Q2_K/Q4_K and clean up legacy MU…
yomaytk May 26, 2026
c9d9829
model : add support for talkie-1930-13b (#22596)
niklassheth May 26, 2026
7623de1
tests: test-backend-ops -j <N> to run tests in parallel (#23637)
jeffbolznv May 26, 2026
581d020
SYCL: implement ggml_sycl_pool_vmm (#22862)
sanmai May 26, 2026
6fe90de
models : Attach Mistral3 NVFP4 weight scales (#23629)
michaelw9999 May 26, 2026
dbe9c0c
convert : support Gemma4ForCausalLM architecture (#23682)
aoleg May 26, 2026
3dc7684
ci : reduce (disable SYCL and CANN builds/releases) (#23705)
ggerganov May 26, 2026
ef41a69
ci : move sanitizer jobs to self-hosted runners (#23713)
ggerganov May 26, 2026
678d43d
ci : move more CPU jobs to self-hosted runners (#23715)
ggerganov May 26, 2026
ef66bfa
hexagon: add support for CONCAT op (#23648)
max-krasnyansky May 26, 2026
3a3ed15
ci : remove vulkan SDK dep from webgpu job (#23718)
ggerganov May 26, 2026
7799d31
vulkan: optimize conv2d and implement coopmat1 support (#22620)
jeffbolznv May 26, 2026
5190c2e
ci : move macos jobs to the apple workflow + fix names (#23721)
ggerganov May 26, 2026
35a74c8
ci : add `[no release]` keyword + fix sanitizer builds (#23728)
ggerganov May 26, 2026
08bc21b
ci : move [no release] check to dedicated check_release job (#23734)
ggerganov May 26, 2026
0d18aaa
ci : do not allocate ccache for 3rd-party hosted runners (#23730)
ggerganov May 26, 2026
b4c0549
ggml-zendnn : fixed naming of matmul function (#20964)
truecoder34 May 26, 2026
7085492
server : fix the log message when using SSL (#23393)
rgerganov May 27, 2026
9777256
convert: add MiniCPM5 tokenizer support (#23384)
zhangtao2-1 May 27, 2026
1d971bb
docs : fix duplicated "the" in granitevision and model-conversion doc…
quyentonndbs May 27, 2026
0d227ec
ci : add ccache to server builds + fix undefined sanitizer build (#23…
ggerganov May 27, 2026
4d8cc0c
vulkan: avoid preferring transfer queue on AMD UMA devices (#22455)
winstonma May 27, 2026
b3a739c
ci : remove wasm test (#23733)
CISC May 27, 2026
9f0e4b1
ci : fix windows ccaches (#23777)
ggerganov May 27, 2026
6b4e4bd
common : fix env names to all have LLAMA_ARG_ prefix (#23778)
ggerganov May 27, 2026
2d0656f
ci : bump cuda release to 13.3 (#23749)
CISC May 27, 2026
fda8528
CUDA: restrict PDL to CTK >= 12.3 due to MSVC issues (#23742)
ORippler May 27, 2026
87b0a60
pyproject : add conversion folder and update dependencies (#23746)
CISC May 27, 2026
617255d
vendor : update cpp-httplib to 0.46.0 (#23650)
cabelo May 27, 2026
ba4dd0b
ci : move ARM jobs to self-hosted + disable kleidiai mac release (#23…
ggerganov May 27, 2026
837bb6b
vulkan: add REPEAT op support for f16 to f16. (#23298)
l8bloom May 27, 2026
b36eefc
vulkan: use GL_NV_cooperative_matrix_decode_vector for faster matmul …
jeffbolznv May 27, 2026
c6e4088
vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 (#22887)
TheBlueMatt May 27, 2026
c40006a
ggml-webgpu: Fix how to dispatch WG to some ops (#23750)
yomaytk May 27, 2026
aa50b2c
hexagon: add support for Q4_1 in MUL_MAT and MUL_MAT_ID (#23647)
max-krasnyansky May 27, 2026
f12cc6d
ggml-webgpu: remove legacy constants (#23672)
reeselevine May 27, 2026
8ad8aef
opencl: OP_GATED_DELTA_NET (#23312)
ymcki May 28, 2026
939a7dd
Hexagon: OP_GATED_DELTA_NET K>1 support (#23531)
ymcki May 28, 2026
491c4d7
ci : refactor (#23789)
ggerganov May 28, 2026
e31cdaa
ggml: fixed Arm SVE usage bug in vec.h, vec.cpp (#22841)
martin-klacer-arm May 28, 2026
c522908
convert : add FP8 to Q8 conversion (#23250)
ynankani May 28, 2026
48e7eae
perplexity : fix format specifier in LOG_ERR (#23788)
angt May 28, 2026
09e7b76
cuda : fix KQ mask offset integer overflow in fattn MMA kernel (#23610)
fairydreaming May 28, 2026
e8d2567
docker : add ZenDNN Dockerfile (#23716)
z-sachin May 28, 2026
d205df6
server, ui : Add support for HTTP ETags in llama-server (#23701)
mtavenrath May 28, 2026
91eb8f4
vulkan: Fix memory logger unsafe iterator access (#23667)
winstonma May 28, 2026
7c48fb8
vulkan: fix wrong index variable in inner loop (#23665)
winstonma May 28, 2026
bb771cb
chat : add Granite 4.1 chat template (#23518)
jesus-talavera-ibm May 28, 2026
48e7078
vulkan: fast path for walsh-hadamard transform (#23687)
jeffbolznv May 28, 2026
a919001
hexagon: minor refresh for HMX FA and MM (#23796)
max-krasnyansky May 28, 2026
0b24686
server: minor tweaks to use more cpp features (#23785)
mfuntowicz May 28, 2026
bc81d47
CUDA: route batch>=4 quantized matmul to MMQ on AMD MFMA hardware (#2…
jadenmach2 May 28, 2026
d7be461
mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for …
yaohengxu May 28, 2026
30af6e2
ggml: auto apply iGPU flag CUDA/HIP if integrated device (#23007)
fl0rianr May 28, 2026
d374e71
test-llama-archs: fix table format [no release] (#23810)
JohannesGaessler May 28, 2026
7fb1e70
arg: Add LLAMA_ARG_API_KEY_FILE environment variable for --api-key-fi…
kucharskim May 28, 2026
dd15579
ci : change Vulkan builds to Release to reduce ccache (#23820)
ggerganov May 28, 2026
d6be315
mtmd: fix gemma 4 audio rms norm eps (#23815)
ngxson May 28, 2026
0b56d28
mtmd: n_head_kv defaults to n_head (#23782)
sfallah May 28, 2026
479a9a1
app : improve help output (#23805)
angt May 28, 2026
445b7ce
ci : releases use Github-hosted builds for the UI (#23823)
ggerganov May 28, 2026
2f6c815
ui: fix audio and video modality detection (#23756)
ValdikSS May 28, 2026
3ef2369
ci : run ui publish on ubuntu-slim (#23818)
CISC May 28, 2026
408ae2b
opencl: move backend info printing into its own function (#23702)
lhez May 28, 2026
c8914ad
mtmd: fix gemma 4 projector pre_norm (#23822)
ngxson May 28, 2026
751ebd1
mtmd-debug: add color and rainbow mode (#23829)
ngxson May 28, 2026
19e92c3
hexagon: basic/generic op fusion support and RMS_NORM+MUL fusion (#23…
max-krasnyansky May 28, 2026
33c718d
meta : Add missing `buffer` set in allreduce fallback !COMPUTE clear …
TheBlueMatt May 29, 2026
241cbd4
cuda : disables launch_fattn PDL enrollment due to compiler bug (#23825)
aendk May 29, 2026
98e480a
app : move licences to llama-app (#23824)
angt May 29, 2026
eef59a7
llama: add llm_graph_input_mtp (#23643)
am17an May 29, 2026
b000431
ngram-mod : Add missing include (#23857)
oazizi000 May 29, 2026
ea02bc3
ggml : bump version to 0.13.1 (ggml/1523)
ggerganov May 29, 2026
fe12e42
sync : ggml
ggerganov May 29, 2026
031ddb2
llama: use f16 mask for FA to save VRAM (#23764)
am17an May 29, 2026
1f0aa2a
model : support for DeepseekV32ForCausalLM with generic DeepSeek Spar…
fairydreaming May 29, 2026
cb47092
server: bump timeout to 3600s (#23842)
ngxson May 29, 2026
6ed481e
CUDA: Check PTX version on host side to guard PDL dispatch (#23530)
ORippler May 29, 2026
da3f990
mtmd: Add DeepSeekOCR 2 Support (#20975)
sfallah May 29, 2026
06d26df
download: add option to skip_download (#23059)
ngxson May 29, 2026
dc71236
ci : update macos release to use macos-26 runner (#23878)
ggerganov May 29, 2026
b5f5228
server: remove obsolete scripts (#23870)
ngxson May 29, 2026
764f1e6
graph : ensure DS32 kq_mask_lid is F32 (#23864)
CISC May 29, 2026
2084434
vocab : support tokenizer for LFM2.5-8B-A1B (#23826)
tdakhran May 29, 2026
22d66b5
ui: handle audio/vnd.wave as audio WAV file (#23754)
ValdikSS May 29, 2026
5a46b46
app: add llama update self updater (#23865)
ServeurpersoCom May 29, 2026
689a9a4
server-bench : add speed-bench for speculative decoding benchmarking …
ruixiang63 May 29, 2026
b22da25
ggml-webgpu: add q4_0/q8_0 SET_ROWS (#23760)
reeselevine May 29, 2026
151f3a9
ggml-webgpu: Check earlier for WebGPU required features (#23879)
reeselevine May 29, 2026
0821c5f
server: in SSE mode, send HTTP headers when slot starts (#23884)
ngxson May 29, 2026
1738129
llama : do not skip iGPU when only RPC devices are present (#23868)
rgerganov May 30, 2026
d4204b0
ci : clear cache instead of "no timestamp" keys + fix macos (#23895)
ggerganov May 30, 2026
3375285
ci : fix s390x release job (#23898)
ggerganov May 30, 2026
6e093b8
vulkan: add Flash Attention support for BFloat16 KV cache (#23420)
0cc4m May 30, 2026
d48a56e
ggml : add some lsx support (#23798)
MQ-mengqing May 30, 2026
4c4e91b
ci : update ios-xcode release job to macos-26 (#23906)
ggerganov May 30, 2026
e674b12
test: (test-llama-archs) log the config name first (#23885)
ngxson May 30, 2026
2d9b7c8
metal : restore im2col implementation for large kernels (#23901)
ggerganov May 30, 2026
8b0e0db
TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs (#23843)
JohannesGaessler May 30, 2026
d38d50e
ui: exclude generated build dirs from prettier and eslint so lint err…
ServeurpersoCom May 30, 2026
d6588da
opencl: support bf16 by converting to f16 (#23839)
lhez May 30, 2026
aa46bda
Support `-fa auto` in llama-bench (#23714)
gaugarg-nv May 30, 2026
d749821
webui: add custom CSS injection via config (#23904)
ServeurpersoCom May 30, 2026
22cadc1
llama: only use one iGPU device by default (#23897)
0cc4m May 31, 2026
e6123e2
docs : update ZenDNN docs for Q8 support (#23791)
truecoder34 May 31, 2026
3292da0
ui: fix ETag truncation with MSVC compiler (#23917)
EZForever May 31, 2026
d4c8e2c
vocab : add tokenizer support for jina-embeddings-v2-base-zh (#18756)
o7si May 31, 2026
399739d
ci : limit trigger paths for the CPU workflow (#23938)
ggerganov May 31, 2026
6f165c1
server : handle If-None-Match weak ETags (#23916)
EZForever May 31, 2026
af6528e
ci: remove redundant or duplicate jobs (#23927)
netrunnereve Jun 1, 2026
44e211c
sycl : Optimize Q3_K mul_mat by reorder (#23725)
arthw Jun 1, 2026
4162522
[SYCL] Add more types in GET_ROWS OP (#23710)
arthw Jun 1, 2026
a511424
[SYCL] Support Q4_1, Q5_0, Q5_1 in Flash-attention (#23812)
arthw Jun 1, 2026
e22b0de
ci : add missing Linux label to cpu-x64-high-perf runner (#23958)
ggerganov Jun 1, 2026
5254a79
common : support manually triggering the reasoning budget end sequenc…
aldehir Jun 1, 2026
f8c0a19
vulkan: Removed unused functions (#23175)
winstonma Jun 1, 2026
1962000
vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints (#23…
TheBlueMatt Jun 1, 2026
48b88c3
model: Add EXAONE 4.5 implementations (#21733)
nuxlear Jun 1, 2026
02a5701
security : disable private disclosures (#23963)
ggerganov Jun 1, 2026
8e6fff8
TP: quantized KV cache support (#23792)
JohannesGaessler Jun 1, 2026
5aba536
vocab: add normalizer.lowercase support to WPM (#23899)
o7si Jun 1, 2026
bef69f1
vulkan: reduce host memory lock contention (#23376)
winstonma Jun 1, 2026
55ac090
vulkan: don't hold the device mutex while compiling pipelines (#23641)
jeffbolznv Jun 1, 2026
95b8b8e
metal: template GLU kernels to support f16/f32 (#23882)
shrivasshankar Jun 1, 2026
de6f727
llama: limit max outputs of `llama_context` (#23861)
am17an Jun 1, 2026
335abed
vendor : update cpp-httplib to 0.46.1 (#23980)
angt Jun 1, 2026
27d9ed8
opencl: add basic support for q5_0 and q5_1 (#23548)
shaofeiqi Jun 1, 2026
5aa3a64
nix : add nix-nodejs facilities to build Web UI (#23846)
choener Jun 1, 2026
5dcb711
speculative : fix n_outputs_max and remove draft-simple auto-enable (…
ggerganov Jun 1, 2026
b8275a8
revert to using global_invocation_id for cpy shader (#23955)
yomaytk Jun 1, 2026
210a657
opencl: fix compiler warnings for non-adreno path (#23922)
lhez Jun 2, 2026
1fd5f48
clean up unused variables warnings (#23975)
anavp-nvidia Jun 2, 2026
354ebac
server: real-time reasoning interruption via control endpoint (#23971)
ServeurpersoCom Jun 2, 2026
b7c91ed
Merge upstream llama.cpp updates into spacemit-mtmd
co-seven Jun 2, 2026
3c8321c
fix server converter warning errors
co-seven Jun 3, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
17 changes: 17 additions & 0 deletions .devops/cann.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@
# Define the CANN base image for easier version updates later
ARG CHIP_TYPE=910b
ARG CANN_BASE_IMAGE=quay.io/ascend/cann:8.5.0-${CHIP_TYPE}-openeuler24.03-py3.11
ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A

# ==============================================================================
# BUILD STAGE
Expand Down Expand Up @@ -55,6 +58,7 @@ RUN mkdir -p /app/lib && \
RUN mkdir -p /app/full && \
cp build/bin/* /app/full/ && \
cp *.py /app/full/ && \
cp -r conversion /app/full/ && \
cp -r gguf-py /app/full/ && \
cp -r requirements /app/full/ && \
cp requirements.txt /app/full/
Expand All @@ -67,6 +71,19 @@ RUN mkdir -p /app/full && \
# ==============================================================================
FROM ${CANN_BASE_IMAGE} AS base

ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A
ARG IMAGE_URL=https://github.com/ggml-org/llama.cpp
ARG IMAGE_SOURCE=https://github.com/ggml-org/llama.cpp
LABEL org.opencontainers.image.created=$BUILD_DATE \
org.opencontainers.image.version=$APP_VERSION \
org.opencontainers.image.revision=$APP_REVISION \
org.opencontainers.image.title="llama.cpp" \
org.opencontainers.image.description="LLM inference in C/C++" \
org.opencontainers.image.url=$IMAGE_URL \
org.opencontainers.image.source=$IMAGE_SOURCE

# -- Install runtime dependencies --
RUN yum install -y libgomp curl && \
yum clean all && \
Expand Down
17 changes: 17 additions & 0 deletions .devops/cpu.Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
ARG UBUNTU_VERSION=24.04
ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A

FROM ubuntu:$UBUNTU_VERSION AS build

Expand Down Expand Up @@ -27,6 +30,7 @@ RUN mkdir -p /app/lib && \
RUN mkdir -p /app/full \
&& cp build/bin/* /app/full \
&& cp *.py /app/full \
&& cp -r conversion /app/full \
&& cp -r gguf-py /app/full \
&& cp -r requirements /app/full \
&& cp requirements.txt /app/full \
Expand All @@ -35,6 +39,19 @@ RUN mkdir -p /app/full \
## Base image
FROM ubuntu:$UBUNTU_VERSION AS base

ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A
ARG IMAGE_URL=https://github.com/ggml-org/llama.cpp
ARG IMAGE_SOURCE=https://github.com/ggml-org/llama.cpp
LABEL org.opencontainers.image.created=$BUILD_DATE \
org.opencontainers.image.version=$APP_VERSION \
org.opencontainers.image.revision=$APP_REVISION \
org.opencontainers.image.title="llama.cpp" \
org.opencontainers.image.description="LLM inference in C/C++" \
org.opencontainers.image.url=$IMAGE_URL \
org.opencontainers.image.source=$IMAGE_SOURCE

RUN apt-get update \
&& apt-get install -y libgomp1 curl \
&& apt autoremove -y \
Expand Down
18 changes: 18 additions & 0 deletions .devops/cuda.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ ARG BASE_CUDA_DEV_CONTAINER=nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VER

ARG BASE_CUDA_RUN_CONTAINER=nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu${UBUNTU_VERSION}

ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A

FROM ${BASE_CUDA_DEV_CONTAINER} AS build

# CUDA architecture to build for (defaults to all supported archs)
Expand All @@ -32,6 +36,7 @@ RUN mkdir -p /app/lib && \
RUN mkdir -p /app/full \
&& cp build/bin/* /app/full \
&& cp *.py /app/full \
&& cp -r conversion /app/full \
&& cp -r gguf-py /app/full \
&& cp -r requirements /app/full \
&& cp requirements.txt /app/full \
Expand All @@ -40,6 +45,19 @@ RUN mkdir -p /app/full \
## Base image
FROM ${BASE_CUDA_RUN_CONTAINER} AS base

ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A
ARG IMAGE_URL=https://github.com/ggml-org/llama.cpp
ARG IMAGE_SOURCE=https://github.com/ggml-org/llama.cpp
LABEL org.opencontainers.image.created=$BUILD_DATE \
org.opencontainers.image.version=$APP_VERSION \
org.opencontainers.image.revision=$APP_REVISION \
org.opencontainers.image.title="llama.cpp" \
org.opencontainers.image.description="LLM inference in C/C++" \
org.opencontainers.image.url=$IMAGE_URL \
org.opencontainers.image.source=$IMAGE_SOURCE

RUN apt-get update \
&& apt-get install -y libgomp1 curl \
&& apt autoremove -y \
Expand Down
17 changes: 17 additions & 0 deletions .devops/intel.Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
ARG ONEAPI_VERSION=2025.3.3-0-devel-ubuntu24.04
ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A

## Build Image

Expand Down Expand Up @@ -33,13 +36,27 @@ RUN mkdir -p /app/lib && \
RUN mkdir -p /app/full \
&& cp build/bin/* /app/full \
&& cp *.py /app/full \
&& cp -r conversion /app/full \
&& cp -r gguf-py /app/full \
&& cp -r requirements /app/full \
&& cp requirements.txt /app/full \
&& cp .devops/tools.sh /app/full/tools.sh

FROM intel/deep-learning-essentials:$ONEAPI_VERSION AS base

ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A
ARG IMAGE_URL=https://github.com/ggml-org/llama.cpp
ARG IMAGE_SOURCE=https://github.com/ggml-org/llama.cpp
LABEL org.opencontainers.image.created=$BUILD_DATE \
org.opencontainers.image.version=$APP_VERSION \
org.opencontainers.image.revision=$APP_REVISION \
org.opencontainers.image.title="llama.cpp" \
org.opencontainers.image.description="LLM inference in C/C++" \
org.opencontainers.image.url=$IMAGE_URL \
org.opencontainers.image.source=$IMAGE_SOURCE

ARG IGC_VERSION=v2.20.5
ARG IGC_VERSION_FULL=2_2.20.5+19972
ARG COMPUTE_RUNTIME_VERSION=25.40.35563.10
Expand Down
17 changes: 17 additions & 0 deletions .devops/llama-cli-cann.Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
ARG ASCEND_VERSION=8.5.0-910b-openeuler22.03-py3.10
ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A

FROM ascendai/cann:$ASCEND_VERSION AS build

Expand Down Expand Up @@ -28,6 +31,20 @@ RUN echo "Building with static libs" && \

# TODO: use image with NNRT
FROM ascendai/cann:$ASCEND_VERSION AS runtime

ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A
ARG IMAGE_URL=https://github.com/ggml-org/llama.cpp
ARG IMAGE_SOURCE=https://github.com/ggml-org/llama.cpp
LABEL org.opencontainers.image.created=$BUILD_DATE \
org.opencontainers.image.version=$APP_VERSION \
org.opencontainers.image.revision=$APP_REVISION \
org.opencontainers.image.title="llama.cpp" \
org.opencontainers.image.description="LLM inference in C/C++" \
org.opencontainers.image.url=$IMAGE_URL \
org.opencontainers.image.source=$IMAGE_SOURCE

COPY --from=build /app/build/bin/llama-cli /app/build/bin/llama-completion /

ENV LC_ALL=C.utf8
Expand Down
18 changes: 18 additions & 0 deletions .devops/musa.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ ARG BASE_MUSA_DEV_CONTAINER=mthreads/musa:${MUSA_VERSION}-devel-ubuntu${UBUNTU_V

ARG BASE_MUSA_RUN_CONTAINER=mthreads/musa:${MUSA_VERSION}-runtime-ubuntu${UBUNTU_VERSION}-amd64

ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A

FROM ${BASE_MUSA_DEV_CONTAINER} AS build

# MUSA architecture to build for (defaults to all supported archs)
Expand Down Expand Up @@ -37,6 +41,7 @@ RUN mkdir -p /app/lib && \
RUN mkdir -p /app/full \
&& cp build/bin/* /app/full \
&& cp *.py /app/full \
&& cp -r conversion /app/full \
&& cp -r gguf-py /app/full \
&& cp -r requirements /app/full \
&& cp requirements.txt /app/full \
Expand All @@ -45,6 +50,19 @@ RUN mkdir -p /app/full \
## Base image
FROM ${BASE_MUSA_RUN_CONTAINER} AS base

ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A
ARG IMAGE_URL=https://github.com/ggml-org/llama.cpp
ARG IMAGE_SOURCE=https://github.com/ggml-org/llama.cpp
LABEL org.opencontainers.image.created=$BUILD_DATE \
org.opencontainers.image.version=$APP_VERSION \
org.opencontainers.image.revision=$APP_REVISION \
org.opencontainers.image.title="llama.cpp" \
org.opencontainers.image.description="LLM inference in C/C++" \
org.opencontainers.image.url=$IMAGE_URL \
org.opencontainers.image.source=$IMAGE_SOURCE

RUN apt-get update \
&& apt-get install -y libgomp1 curl \
&& apt autoremove -y \
Expand Down
29 changes: 28 additions & 1 deletion .devops/nix/package.nix
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
glibc,
config,
stdenv,
stdenvNoCC,
runCommand,
cmake,
ninja,
Expand All @@ -19,6 +20,8 @@
openssl,
shaderc,
spirv-headers,
nodejs,
importNpmLock,
useBlas ?
builtins.all (x: !x) [
useCuda
Expand Down Expand Up @@ -130,7 +133,31 @@ effectiveStdenv.mkDerivation (finalAttrs: {
src = lib.cleanSource ../../.;
};

postPatch = ''
# Builds the webui locally, taking care not to require updating any sha256 hash.
webui = stdenvNoCC.mkDerivation {
pname = "webui";
version = llamaVersion;
src = lib.cleanSource ../../tools/ui;

nativeBuildInputs = [
nodejs
importNpmLock.linkNodeModulesHook
];

# no sha256 required when using buildNodeModules
npmDeps = importNpmLock.buildNodeModules {
npmRoot = ../../tools/ui;
inherit nodejs;
};

installPhase = ''
LLAMA_UI_OUT_DIR=$out npm run build --offline
'';
};

postPatch = lib.optionalString useWebUi ''
cp -r ${finalAttrs.webui} tools/ui/dist
chmod -R u+w tools/ui/dist
'';

# With PR#6015 https://github.com/ggml-org/llama.cpp/pull/6015,
Expand Down
17 changes: 17 additions & 0 deletions .devops/openvino.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,10 @@ ARG LIBZE1_VERSION=1.27.0-1~24.04~ppa2
ARG http_proxy=
ARG https_proxy=

ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A

## Build Image
FROM ubuntu:${UBUNTU_VERSION} AS build

Expand Down Expand Up @@ -77,6 +81,7 @@ RUN mkdir -p /app/lib && \
RUN mkdir -p /app/full \
&& cp build/ReleaseOV/bin/* /app/full/ \
&& cp *.py /app/full \
&& cp -r conversion /app/full \
&& cp -r gguf-py /app/full \
&& cp -r requirements /app/full \
&& cp requirements.txt /app/full \
Expand All @@ -88,6 +93,18 @@ FROM ubuntu:${UBUNTU_VERSION} AS base
# Pass proxy args to runtime stage
ARG http_proxy
ARG https_proxy
ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A
ARG IMAGE_URL=https://github.com/ggml-org/llama.cpp
ARG IMAGE_SOURCE=https://github.com/ggml-org/llama.cpp
LABEL org.opencontainers.image.created=$BUILD_DATE \
org.opencontainers.image.version=$APP_VERSION \
org.opencontainers.image.revision=$APP_REVISION \
org.opencontainers.image.title="llama.cpp" \
org.opencontainers.image.description="LLM inference in C/C++" \
org.opencontainers.image.url=$IMAGE_URL \
org.opencontainers.image.source=$IMAGE_SOURCE

RUN apt-get update \
&& apt-get install -y libgomp1 libtbb12 curl wget ocl-icd-libopencl1 \
Expand Down
18 changes: 18 additions & 0 deletions .devops/rocm.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ ARG AMDGPU_VERSION=7.2.1
# Target the ROCm build image
ARG BASE_ROCM_DEV_CONTAINER=rocm/dev-ubuntu-${UBUNTU_VERSION}:${ROCM_VERSION}-complete

ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A

### Build image
FROM ${BASE_ROCM_DEV_CONTAINER} AS build

Expand Down Expand Up @@ -49,6 +53,7 @@ RUN mkdir -p /app/lib \
RUN mkdir -p /app/full \
&& cp build/bin/* /app/full \
&& cp *.py /app/full \
&& cp -r conversion /app/full \
&& cp -r gguf-py /app/full \
&& cp -r requirements /app/full \
&& cp requirements.txt /app/full \
Expand All @@ -57,6 +62,19 @@ RUN mkdir -p /app/full \
## Base image
FROM ${BASE_ROCM_DEV_CONTAINER} AS base

ARG BUILD_DATE=N/A
ARG APP_VERSION=N/A
ARG APP_REVISION=N/A
ARG IMAGE_URL=https://github.com/ggml-org/llama.cpp
ARG IMAGE_SOURCE=https://github.com/ggml-org/llama.cpp
LABEL org.opencontainers.image.created=$BUILD_DATE \
org.opencontainers.image.version=$APP_VERSION \
org.opencontainers.image.revision=$APP_REVISION \
org.opencontainers.image.title="llama.cpp" \
org.opencontainers.image.description="LLM inference in C/C++" \
org.opencontainers.image.url=$IMAGE_URL \
org.opencontainers.image.source=$IMAGE_SOURCE

RUN apt-get update \
&& apt-get install -y libgomp1 curl \
&& apt autoremove -y \
Expand Down
Loading
Loading