Sync fork to upstream master (+604 commits) — Gemma4 verified#29
Open
lalalune wants to merge 623 commits into
Open
Sync fork to upstream master (+604 commits) — Gemma4 verified#29lalalune wants to merge 623 commits into
lalalune wants to merge 623 commits into
Conversation
* vulkan: add fwht support for Intel with shmem reduction * don't use N as workgroup size * disable subgroup shuffle on MoltenVK AMD * disable fwht shader on Intel Windows due to driver bug
* opencl: allow multiple workgroups for large rows * opencl: improve small cpy * opencl: packed concat for small input * opencl: tweak flat q6_K gemv, increase N_DST and remap threads
…put_tokens API (#23913) * mtmd: add "placeholder bitmap" for counting tokens w/o preprocessing * fast path skip preproc for placeholder * fix build * correct the api * add server endpoint + tests * add object name * update docs * add proxy handling * fix build * fix audio input path * use is_placeholder in process_mtmd_prompt() * nits * nits (2) * docs: clarify chat/completions/input_tokens is not official * fix merge problem
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* feat: add video support for Qwen3.5 * various clean up * revise the design * fix llava-uhd case * nits * nits 2 --------- Co-authored-by: andrewmd5 <1297077+andrewmd5@users.noreply.github.com>
…#24234) * common/chat : fix LFM2 reasoning round-trip and stray <think> leak * Gate by reasoning format and whether the template supports <think>
A fitted target context can end up smaller than the draft default, the oversized assistant views then overflow the shared K/V tensors and trip the ggml_view_4d size assert during graph reserve.
Mistral explicitly sets `moe` and `llama_4_scaling` to `null` in params.json, breaking `key in dict` checks during conversion. Replace with `dict.get(key) is not None` where this matters. Fixes `convert-hf-to-gguf.py --mistral-format Mistral-Medium-3.5-128B`
* common : relax sampler name matching Currently, in some cases, the alternative names for samplers (like `top-k` and `min-p` instead of the canonical `top_k` and `min_p`) are not always recognized by the `common_sampler_types_from_names` function in `common/sampling.cpp`. This PR changes the signature of this function to remove the `bool allow_alt_names` flag, and removes all occurences of the flag from call sites. Therefore, the function will now always match all known names. I also changed the logic of the function to unconditionally check the provided sampler names against both the canonical and alternative names, and to be case-insensitive. This fixes an issue I was seeing wherein samplers specified in the `llama-server` UI were not recognized as valid when the alternative names were used. * add more alt names * cont. fix * cast to unsigned char for correctness * common : unify sampler name mapping * annotate canonical vs. alt sampler name mappings per @CISC * Update common/sampling.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * common : auto-generate sampler name aliases per @ngxson * use merged map for matching * use `.merge` instead of iterating * nit: simplify comment * nit: use insert everywhere, not index assignment --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* update compute runtime from 25 to 26 in docker * add comment with old driver for multiple GPUs
* cuda: reset device in get_memory function if no backend is active * also count device and host buffers * exclude hip and musa from counting and device reset * use device mutex instead of atomic * undo backend_free function move
This allows vec4 loads of the B elements. Also increase BK to 64 when this is enabled. Neither of these alone is consistently faster, but together these give a nice speedup. In ggml-vulkan.cpp, we need to make sure the B matrix alignment and stride are multiples of 4.
* wip * ok: lazy bitmap API * remember to free lazy text * wip * add mtmd_helper_video * support video input on server (base64 input) * add MTMD_VIDEO config * add timestamp * update CLI * cli: allow auto-completion for video * add --video arg * fix build * update docs * rename as suggested
…anges
libelizainference.so now links against the merged tree (66 FFI symbols, 9 M3
backend-selector symbols):
- clip.cpp clip_encode_float_image: clip_image_f32 encapsulated nx_/ny_/buf
upstream → use set_size() + cpy_buf() public setters.
- eliza-inference-ffi.cpp: mtmd_helper_bitmap_init_from_buf gained a 4th
'placeholder' bool and now returns a {bitmap,video_ctx} wrapper → pass false
+ take .bitmap.
(The earlier -fPIC link failure was a standalone-cmake config gap, not a code
issue — the production build hook sets CMAKE_POSITION_INDEPENDENT_CODE.)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced Jun 22, 2026
…the merged tree The merged tree's CUDA backend now COMPILES clean (248/248 targets, 0 errors) by aligning ggml-cuda to upstream's refactored type system (block_q8_1_layout template, ggml_cuda_kernel_launch params, fattn type-exhaustive instances). Drops the legacy QJL/PolarQuant/TBQ3_TCQ CUDA KV kernels + their custom-type template instances (q1_0_g128, fattn-vec tbq3_0/tbq4_0): these are head_dim=128 and Gemma-irrelevant (Gemma uses stock q8_0 KV — M6). Preserving them on CUDA would need a separate reconciliation of our custom GGML_TYPE_* integration with upstream's type refactor; out of scope for the Gemma cutover. Verified: builds with CMAKE_CUDA_ARCHITECTURES=90-virtual. Runtime on the local RTX 5080 (sm_120) still needs CUDA 12.8+/13 — the CUDA 12.0 runtime cannot enumerate Blackwell (confirmed: ggml_cuda_init 'no CUDA-capable device'); not a code issue. Metal/Vulkan kernels unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… over merged tree Gemma 4 E2B VERIFIED running on Vulkan (Intel Arrow Lake iGPU, pp64 74.99 t/s, -ngl 99) against the merged tree. Aligned ggml-vulkan.cpp to upstream's SPIR-V handling; dropped custom QJL/Polar Vulkan dispatch (Gemma uses stock KV — M6). All 464 GLSL shaders compile via glslc. Vulkan == Android Mali/Adreno backend, de-risking the Pixel path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ization Replace the stub StyleTTS2 forward (deterministic ~475Hz buzz, ASR="") with the vendored CrispStrobe/CrispASR MIT ggml Kokoro/StyleTTS2 implementation (src/kokoro-crispasr.cpp + core/* headers + gguf_loader), behind a thin eliza_kokoro:: adapter so the fused FFI (tools/omnivoice eliza_inference_kokoro_*) and the kokoro-tts CLI keep their ABI. Retires the old manual kokoro/predictor/istft/phonemes sources. Make the FFI/CLI text arg accept RAW ENGLISH and phonemize internally: - Kokoro-82M was trained on hexgrad/misaki's English phoneme inventory (ʤ ʧ A O W I Y, no ː length marks), not raw espeak/CMU IPA. The built-in G2P (core/g2p_en.h, LTS + auto-loaded CMUdict) and espeak both emit raw IPA (dʒ tʃ oʊ eɪ aʊ aɪ ɔɪ, ɑː ɔː uː iː), which kokoro_phonemes_to_ids silently mis-tokenizes (dʒ → d+ʒ, wrong embeddings) → wrong words. - Add tools/kokoro/include/kokoro-misaki.h: a single greedy longest-first IPA→misaki pass (affricates → ʤ/ʧ, diphthongs → A/I/O/W/Y, r-coloured ɜː/ɝ/ɚ → ɜɹ/əɹ, drop ː). US default; en-gb maps oʊ/əʊ → Q. Mapping cross-checked codepoint-for-codepoint against reference Python KPipeline(lang_code='a') output. Wired into phonemize_cached for English after builtin/espeak/popen. - g2p_en::text_to_ipa now preserves sentence/clause punctuation (.,!?;:), which Kokoro's vocab includes and uses for prosody (was dropped). - kokoro-adapter routes by content: an already-IPA/misaki string (carries IPA stress/length marks or U+0250–U+02AF letters) goes verbatim to kokoro_synthesize_phonemes; raw text goes to kokoro_synthesize (internal G2P + misaki). Pre-phonemized callers keep working. ggml_col2im_1d: fix the output-length formula to (T_in-1)*s + K - p0 (single- sided crop, matching the decomposed ConvTranspose1d contract in core/conv.h that the iSTFTNet generator needs) in both ggml.c and the CPU kernel. Verified end-to-end on Linux through the fused libelizainference.so: raw text "Hello world." and "The quick brown fox jumps over the lazy dog." synthesize and ASR-transcribe word-perfect; 14/15-sentence battery 98.9% word-match (the one miss is the to/too homophone, not a phoneme error); internal phonemes match KPipeline (e.g. "həlˈO wˈɜɹld.", "ðə kwˈɪk bɹˈWn fˈɑks ʤˈʌmps ˈOvəɹ ðə lˈAzi dˈɔɡ."). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… garbles audio) kokoro_context_default_params() sets use_gpu=true, so on a fused engine lib that carries a GPU backend (Vulkan on Android/Mali, CUDA on desktop) ggml_backend_init_best() pins Kokoro's predictor + text-encoder + PLBERT graph onto the GPU. The GPU kernels for Kokoro's op set (conv_transpose_1d / col2im_1d / istft + the StyleTTS2 prosody predictor) do not match the CPU reference numerically, producing garbled low-frequency audio instead of speech. Pin the whole Kokoro forward to CPU in the eliza_kokoro adapter (the single FFI integration point) unless KOKORO_GEN_GPU / KOKORO_GEN_FORCE_METAL is set to opt a verified backend back in. Kokoro-82M is tiny (~80M params, <100 ms CPU TTFB), so CPU is both correct and fast; the LLM keeps its own GPU backend. Verified on a Pixel 9a (Mali-G715, arm64 Vulkan-variant libelizainference): the device kokoro synth, captured over the bionic UDS and run back through Qwen3-ASR, returns "Hello, world." (2/2 words) — before the fix the same path returned garbled "Hmm"/"Yeah" (peak amplitude 0.14 vs 0.25 after). Linux CPU ASR roundtrip unchanged (2/2). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
qwen3a.cpp does block-diagonal attention by batching each conv chunk as a separate ggml_view_4d element (build_vit called with a null mask), so the graph registers no attn_mask input. The prior code uploaded a [n_pos,n_pos] mask with divergent 200/25 chunk geometry, aborting clip_image_batch_encode at 'Failed to get tensor attn_mask' on every Qwen3-ASR call. Removing the upload makes on-device ASR work (verified: eliza-1-asr transcribes real speech + kokoro TTS output correctly through the rebuilt libelizainference). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d loader
The CrispASR fused-lib Kokoro loader (require(c, "bert.embd.tok.weight") …)
required un-prefixed canonical tensor names, but the published bundle GGUF
(elizaos/eliza-1 .../kokoro-82m-v1_0-Q4_K_M.gguf) and the in-tree converter
emit mainline-llama.cpp names ('kokoro.bert.layer.attn_q.weight',
'kokoro.bert.token_embd.weight', 'kokoro.predictor.de.lstm0.…',
'kokoro.gen.resblocks.0.convs1.0.weight', …). The loader rejected every
shipped GGUF ('required tensor missing: bert.embd.tok.weight' + ~15 more) so
Kokoro never loaded on desktop.
Add canonicalize_tensor_name(): a fixed set of ordered structural rewrites that
maps any mainline 'kokoro.'-prefixed name onto the loader's canonical name, then
normalize the loaded tensor map once at kokoro_init_from_file. Already-canonical
CrispASR/StyleTTS2 names pass through unchanged, so the on-device GGUF keeps
working. Collisions are a hard error; unrecognized 'kokoro.' names fall through
prefix-stripped and the sanity check still fails loudly if a required weight is
genuinely absent. Drop the unused 'bert.pooler.weight' sanity requirement (the
synthesis path reads last_hidden_state, never the pooled output, and mainline
GGUFs ship no pooler).
Verified: fused libelizainference.so rebuilds clean (LLAMA_BUILD_KOKORO=ON) and
all BERT/predictor/text_enc require()/sanity names now resolve from the real
shipped GGUF's kokoro.* tensors.
Refs elizaOS/eliza#9588
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… bert.* GGUF (#9588) The CrispASR forward (a11adb5) vendored the runtime but not its converters, so nothing in-repo produced the bert.*/pred.*/dec.gen.* GGUF the loader requires — Kokoro silently fell back to OmniVoice/stub. Vendor CrispStrobe/CrispASR's (MIT) convert-kokoro-to-gguf.py + voice converter, with a torch-portability fix (tensor-or-tensor -> explicit is None). Verified: hexgrad/Kokoro-82M -> 459-tensor bert.* GGUF + af_bella voice -> kokoro-real-smoke loads + synthesizes 88800 samples @ 24kHz; ASR round-trip of the output is an exact match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…egfault) (#9588) canonicalize_tensor_name() only applied the .norm1.fc.->.adain1. / .norm2.fc.-> .adain2. rewrites in the 'decoder.' branch, not the 'predictor.' branch. But the predictor F0/N AdainResBlk1d stacks carry the same AdaIN1d norms, named norm1.fc/norm2.fc upstream (e.g. predictor.F0.0.norm1.fc.weight). Without the rewrite they stayed pred.F0.<i>.norm1.fc.weight while run_stack() requires pred.F0.<i>.adain1.weight, so they slipped past the soft sanity_check_weights and surfaced as a NULL-tensor segfault in kokoro_adain_resblk -> ggml_mul_mat during synthesis of the real shipped 457-tensor mainline GGUF. Verified on Linux x64: rebuilt libelizainference, ran kokoro-real-smoke against the real shipped kokoro-82m-v1_0-Q4_K_M.gguf bundle — the 'pred.F0.*.adain* required tensor missing' errors and the kokoro_adain_resblk segfault are gone; synthesis now progresses past the predictor to the decoder (where a separate F16-im2col-vs-Q4-conv-quant issue remains, tracked in #9588). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… path (#9588) After the predictor-AdaIN mapping fix, synthesis of the real shipped Q4_K_M GGUF hit GGML_ASSERT(src0->type == GGML_TYPE_F16) in ggml_compute_forward_im2col_f16. Cause: ggml_conv_1d routes through an F16-destination im2col whose CPU kernel requires an F16 weight, but decoder.F0_conv.weight and decoder.N_conv.weight ship F32 in the mainline GGUF (the only 2 F32 conv kernels; the other 84 are F16). Cast a non-F16 conv weight to F16 in small_conv1d (the F0/N conv helper) so it matches the convention every other conv already uses. Verified the assert is gone and the full pipeline now runs end-to-end producing 3.17s of 24kHz PCM. NOTE: audio CORRECTNESS is not yet verified — synthesized with hand-assembled test artifacts (a raw->GGUF-converted voice + a vocab-augmented GGUF), the output is constant HF noise (spectral centroid 7.3kHz, flat envelope), not intelligible speech. This likely reflects the synthetic test inputs rather than the fix (both fixes are unambiguous crash-eliminations), but it must be re-verified against the real repackaged bundle (correct kokoro-voice GGUF + the 178-entry vocab embedded) before this is considered to produce correct speech. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…588) Extends the F0/N-only F16 conv cast to all ggml_conv_1d sites (kokoro_conv1d_kd, kokoro_conv1d_ks, and the inline text-enc/AdaIN/proj/asr_res sites), so an all-F32 GGUF (e.g. a fresh convert_kokoro_pth_to_gguf.py output) loads without GGML_ASSERT(src0->type==F16) in ggml_compute_forward_im2col_f16. The guard skips already-F16 kernels, so it is zero-cost for the shipped F16/quantized models. NOTE: this is robustness only — it does NOT fix the mainline-format noise. Verified: a fresh .pth->in-tree-converter GGUF (correct 3-D F0/N dims) now synthesizes 79200 samples without asserting, but the audio is still noise (spectral centroid 7.3kHz, flat envelope — identical to the shipped artifact), so the mainline-kokoro.* format is mishandled by this loader for a reason deeper than conv dims/dtype. The CrispASR-format path remains the working route. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…odel), remove Qwen3-ASR Run the gemma4a USM-audio mmproj over the SHARED ctx->llm_model (mirroring the vision describe_image path) instead of loading a separate Qwen3-ASR LM. Replace the hardcoded Qwen <|im_start|>/<asr_text> chat scaffold + stop-tokens with the Gemma chat template (<start_of_turn>... <__media__> ... <end_of_turn>); mtmd auto-inserts the gemma4a <|audio|> markers. Legacy separate-LM path kept as a fallback. Public ABI unchanged. Verified end-to-end through libelizainference.so (FFI, not mtmd-cli): freeman.wav -> 'If you go into different cultures, they have different concepts of creation...' (correct English, 18 words, PASS). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mat noise) (#9588) ROOT CAUSE of the desktop-Kokoro noise, found + verified end-to-end: the fused loader's forward pass requires the weight matrices in F16. convert_kokoro_pth_to_gguf.py emitted everything as F32, so the model loaded + synthesized but produced inaudible noise (not degraded speech). The shipped HF bundle is quantized → same noise. _add_tensor now matches the working CrispASR export's dtype scheme exactly: ndim>=2 weight matrices -> F16, 1-D biases/norms -> F32 (verified: CrispASR ships 112 2-D + 141 3-D tensors as F16 and all 206 1-D as F32, zero exceptions). Verified: re-ran this converter on the canonical hexgrad/Kokoro-82M kokoro-v1_0.pth -> output is F16:252/F32:205 -> synthesizes SPEECH (spectral centroid 3383Hz, envelope-cv 1.314, identical to CrispASR's working output), vs all-F32 noise (centroid 7269Hz, flat). No model surgery or loader workaround needed — the converter dtype scheme is the whole fix. To ship: regenerate the elizaos/eliza-1 bundle Kokoro GGUF with this converter (it will be F16-weighted) instead of the current quantized/F32 artifact. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…I + C-API embedding proof harness The M4 LiteRT-LM backend targeted the C++ API (litert::lm::Engine, libc++ ABI — fragile to link against the prebuilt lib). Re-target the ELIZA_ENABLE_LITERT real path to the stable C API (litert_lm_*, c/engine.h): engine_settings_create + set_enable_speculative_decoding(true) for MTP + engine_create + create_session on an NPU->GPU->CPU ladder; prefill -> detokenize+run_prefill+run_decode_async; next -> pop one chunk from a thread-safe queue fed by LiteRtLmStreamCallback (push-stream mapped to the FFI pull contract). No libc++ C++-ABI symbols. Stub branch (gate off) unchanged. litert-capi-smoke.cpp: standalone linkage proof — loads a Gemma-4 .litertlm via the C API and generates. Verified on Linux x86_64 against the prebuilt liblitert-lm.so: prints 'The capital of France is Paris.' Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rness Mirrors the LiteRT LLM backend for speech-to-text. litert_asr_transcribe takes 16kHz mono fp32 PCM -> in-memory PCM16 WAV -> base64 -> LiteRT-LM conversation audio path (engine audio_backend=cpu + send_message with an audio blob + 'Transcribe' prompt) -> transcription text. Gated ELIZA_ENABLE_LITERT (stub when off). Error contract honored (heap *out_error, no logging). Verified on Linux x86_64 against the prebuilt liblitert-lm.so: transcribes jfk.wav -> 'And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.' This is the Qwen3-ASR replacement via the Gemma USM audio encoder. Wiring into eliza_inference_asr_* (resolve .litertlm from bundle dir, warm-engine cache) is the next step. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ribe (warm engine) The LiteRT ASR path was a standalone proof; this makes it the live ASR when a .litertlm bundle is present. litert-asr now exposes a warm-engine handle (engine_open loads the model+USM encoder once; engine_transcribe opens only a fresh conversation per call). eliza_inference_asr_transcribe (gate ELIZA_ENABLE_LITERT) probes <bundle>/text/*.litertlm, lazily opens+caches the engine on the context, and delegates; otherwise falls through to the fused Qwen3-ASR path unchanged. Verified: gate-on build links libelizainference.so with the ASR symbols folded in + liblitert-lm.so DT_NEEDED; gate-off build is byte-for-byte unchanged (zero litert_asr symbols, no liblitert-lm dep). Device-verified earlier on Pixel 9a. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Full merge of ggml-org/master into the eliza fork (off the M3 multi-backend FFI tip).
Scope: 1599 files, +117,562 / −51,782; 604 upstream commits;
upstream/masteris now a git ancestor (full merge, not cherry-pick). Supersedes the standalone KV-checkpoint PR #27 (that fix is included).Resolution: 105 conflicts — take-theirs for CI/docs/sycl/hexagon/server/mtmd infra (incl. upstream's Gemma4 audio preprocessor
gemma4ua+ int64 mel widening); preserved our custom ggml types (TBQ3_0/QJL1_256/Q4_POLAR) + ops (ATTN_SCORE_QJL/FUSED_ATTN_QJL_TBQ), the gemma4 loaders, and the M3/M4/M5 FFI backend seam (tools/omnivoice/src/{llm-backend,eliza-inference-ffi,backends/*}). De-risk: the Qwen-hybrid kernels (gated_delta_net/qwen35/fattn) took upstream cleanly because the Gemma cutover abandons them (stock q8_0 KV).Compile reconciliation (upstream hparams refactor):
nextn_predict_layers→n_layer_nextn,is_recurrent→is_recr,n_layerfield→n_layer(); restored a droppedif (has_draft_simple)in speculative.cpp; de-duped the EAGLE3 rope-type case (upstream now ships real EAGLE3).Verified: CPU build green (build 10632); Gemma 4 E2B Q8_0 runs — pp64 56.6→67.46 t/s (+19% from upstream), tg32 5.95. GPU backends (CUDA/Metal/Vulkan) resolved-to-compile, runtime DEVICE-VERIFY-pending (no CUDA-13/Mac/Pixel in CI session).
Part of elizaOS/eliza#9033.
🤖 Generated with Claude Code