Skip to content

Sync fork to upstream master (+604 commits) — Gemma4 verified#29

Open
lalalune wants to merge 623 commits into
mainfrom
eliza/upstream-merge-m1
Open

Sync fork to upstream master (+604 commits) — Gemma4 verified#29
lalalune wants to merge 623 commits into
mainfrom
eliza/upstream-merge-m1

Conversation

@lalalune

Copy link
Copy Markdown
Member

Full merge of ggml-org/master into the eliza fork (off the M3 multi-backend FFI tip).

Scope: 1599 files, +117,562 / −51,782; 604 upstream commits; upstream/master is now a git ancestor (full merge, not cherry-pick). Supersedes the standalone KV-checkpoint PR #27 (that fix is included).

Resolution: 105 conflicts — take-theirs for CI/docs/sycl/hexagon/server/mtmd infra (incl. upstream's Gemma4 audio preprocessor gemma4ua + int64 mel widening); preserved our custom ggml types (TBQ3_0/QJL1_256/Q4_POLAR) + ops (ATTN_SCORE_QJL/FUSED_ATTN_QJL_TBQ), the gemma4 loaders, and the M3/M4/M5 FFI backend seam (tools/omnivoice/src/{llm-backend,eliza-inference-ffi,backends/*}). De-risk: the Qwen-hybrid kernels (gated_delta_net/qwen35/fattn) took upstream cleanly because the Gemma cutover abandons them (stock q8_0 KV).

Compile reconciliation (upstream hparams refactor): nextn_predict_layers→n_layer_nextn, is_recurrent→is_recr, n_layer field→n_layer(); restored a dropped if (has_draft_simple) in speculative.cpp; de-duped the EAGLE3 rope-type case (upstream now ships real EAGLE3).

Verified: CPU build green (build 10632); Gemma 4 E2B Q8_0 runs — pp64 56.6→67.46 t/s (+19% from upstream), tg32 5.95. GPU backends (CUDA/Metal/Vulkan) resolved-to-compile, runtime DEVICE-VERIFY-pending (no CUDA-13/Mac/Pixel in CI session).

Part of elizaOS/eliza#9033.

🤖 Generated with Claude Code

ngxson and others added 30 commits June 5, 2026 18:12
* vulkan: add fwht support for Intel with shmem reduction

* don't use N as workgroup size

* disable subgroup shuffle on MoltenVK AMD

* disable fwht shader on Intel Windows due to driver bug
* opencl: allow multiple workgroups for large rows

* opencl: improve small cpy

* opencl: packed concat for small input

* opencl: tweak flat q6_K gemv, increase N_DST and remap threads
…put_tokens API (#23913)

* mtmd: add "placeholder bitmap" for counting tokens w/o preprocessing

* fast path skip preproc for placeholder

* fix build

* correct the api

* add server endpoint + tests

* add object name

* update docs

* add proxy handling

* fix build

* fix audio input path

* use is_placeholder in process_mtmd_prompt()

* nits

* nits (2)

* docs: clarify chat/completions/input_tokens is not official

* fix merge problem
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* feat: add video support for Qwen3.5

* various clean up

* revise the design

* fix llava-uhd case

* nits

* nits 2

---------

Co-authored-by: andrewmd5 <1297077+andrewmd5@users.noreply.github.com>
…#24234)

* common/chat : fix LFM2 reasoning round-trip and stray <think> leak
* Gate by reasoning format and whether the template supports <think>
A fitted target context can end up smaller than the draft default, the
oversized assistant views then overflow the shared K/V tensors and trip
the ggml_view_4d size assert during graph reserve.
Mistral explicitly sets `moe` and `llama_4_scaling` to `null` in
params.json, breaking `key in dict` checks during conversion. Replace
with `dict.get(key) is not None` where this matters.

Fixes `convert-hf-to-gguf.py --mistral-format Mistral-Medium-3.5-128B`
* common : relax sampler name matching

Currently, in some cases, the alternative names for samplers (like
`top-k` and `min-p` instead of the canonical `top_k` and `min_p`) are
not always recognized by the `common_sampler_types_from_names` function
in `common/sampling.cpp`.

This PR changes the signature of this function to remove the `bool
allow_alt_names` flag, and removes all occurences of the flag from call
sites. Therefore, the function will now always match all known names.

I also changed the logic of the function to unconditionally check the
provided sampler names against both the canonical and alternative names,
and to be case-insensitive.

This fixes an issue I was seeing wherein samplers specified in the
`llama-server` UI were not recognized as valid when the alternative
names were used.

* add more alt names

* cont. fix

* cast to unsigned char for correctness

* common : unify sampler name mapping

* annotate canonical vs. alt sampler name mappings per @CISC

* Update common/sampling.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* common : auto-generate sampler name aliases per @ngxson

* use merged map for matching

* use `.merge` instead of iterating

* nit: simplify comment

* nit: use insert everywhere, not index assignment

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* update compute runtime from 25 to 26 in docker

* add comment with old driver for multiple GPUs
* cuda: reset device in get_memory function if no backend is active

* also count device and host buffers

* exclude hip and musa from counting and device reset

* use device mutex instead of atomic

* undo backend_free function move
This allows vec4 loads of the B elements. Also increase BK to 64 when this is
enabled. Neither of these alone is consistently faster, but together these give
a nice speedup.

In ggml-vulkan.cpp, we need to make sure the B matrix alignment and stride are
multiples of 4.
* wip

* ok: lazy bitmap API

* remember to free lazy text

* wip

* add mtmd_helper_video

* support video input on server (base64 input)

* add MTMD_VIDEO config

* add timestamp

* update CLI

* cli: allow auto-completion for video

* add --video arg

* fix build

* update docs

* rename as suggested
…anges

libelizainference.so now links against the merged tree (66 FFI symbols, 9 M3
backend-selector symbols):
- clip.cpp clip_encode_float_image: clip_image_f32 encapsulated nx_/ny_/buf
  upstream → use set_size() + cpy_buf() public setters.
- eliza-inference-ffi.cpp: mtmd_helper_bitmap_init_from_buf gained a 4th
  'placeholder' bool and now returns a {bitmap,video_ctx} wrapper → pass false
  + take .bitmap.

(The earlier -fPIC link failure was a standalone-cmake config gap, not a code
issue — the production build hook sets CMAKE_POSITION_INDEPENDENT_CODE.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
lalalune and others added 15 commits June 22, 2026 14:21
…the merged tree

The merged tree's CUDA backend now COMPILES clean (248/248 targets, 0 errors) by
aligning ggml-cuda to upstream's refactored type system (block_q8_1_layout
template, ggml_cuda_kernel_launch params, fattn type-exhaustive instances).

Drops the legacy QJL/PolarQuant/TBQ3_TCQ CUDA KV kernels + their custom-type
template instances (q1_0_g128, fattn-vec tbq3_0/tbq4_0): these are head_dim=128
and Gemma-irrelevant (Gemma uses stock q8_0 KV — M6). Preserving them on CUDA
would need a separate reconciliation of our custom GGML_TYPE_* integration with
upstream's type refactor; out of scope for the Gemma cutover.

Verified: builds with CMAKE_CUDA_ARCHITECTURES=90-virtual. Runtime on the local
RTX 5080 (sm_120) still needs CUDA 12.8+/13 — the CUDA 12.0 runtime cannot
enumerate Blackwell (confirmed: ggml_cuda_init 'no CUDA-capable device'); not a
code issue. Metal/Vulkan kernels unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… over merged tree

Gemma 4 E2B VERIFIED running on Vulkan (Intel Arrow Lake iGPU, pp64 74.99 t/s,
-ngl 99) against the merged tree. Aligned ggml-vulkan.cpp to upstream's SPIR-V
handling; dropped custom QJL/Polar Vulkan dispatch (Gemma uses stock KV — M6).
All 464 GLSL shaders compile via glslc. Vulkan == Android Mali/Adreno backend,
de-risking the Pixel path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ization

Replace the stub StyleTTS2 forward (deterministic ~475Hz buzz, ASR="") with
the vendored CrispStrobe/CrispASR MIT ggml Kokoro/StyleTTS2 implementation
(src/kokoro-crispasr.cpp + core/* headers + gguf_loader), behind a thin
eliza_kokoro:: adapter so the fused FFI (tools/omnivoice eliza_inference_kokoro_*)
and the kokoro-tts CLI keep their ABI. Retires the old manual
kokoro/predictor/istft/phonemes sources.

Make the FFI/CLI text arg accept RAW ENGLISH and phonemize internally:

- Kokoro-82M was trained on hexgrad/misaki's English phoneme inventory
  (ʤ ʧ A O W I Y, no ː length marks), not raw espeak/CMU IPA. The built-in
  G2P (core/g2p_en.h, LTS + auto-loaded CMUdict) and espeak both emit raw
  IPA (dʒ tʃ oʊ eɪ aʊ aɪ ɔɪ, ɑː ɔː uː iː), which kokoro_phonemes_to_ids
  silently mis-tokenizes (dʒ → d+ʒ, wrong embeddings) → wrong words.
- Add tools/kokoro/include/kokoro-misaki.h: a single greedy longest-first
  IPA→misaki pass (affricates → ʤ/ʧ, diphthongs → A/I/O/W/Y, r-coloured
  ɜː/ɝ/ɚ → ɜɹ/əɹ, drop ː). US default; en-gb maps oʊ/əʊ → Q. Mapping
  cross-checked codepoint-for-codepoint against reference Python
  KPipeline(lang_code='a') output. Wired into phonemize_cached for English
  after builtin/espeak/popen.
- g2p_en::text_to_ipa now preserves sentence/clause punctuation (.,!?;:),
  which Kokoro's vocab includes and uses for prosody (was dropped).
- kokoro-adapter routes by content: an already-IPA/misaki string (carries
  IPA stress/length marks or U+0250–U+02AF letters) goes verbatim to
  kokoro_synthesize_phonemes; raw text goes to kokoro_synthesize (internal
  G2P + misaki). Pre-phonemized callers keep working.

ggml_col2im_1d: fix the output-length formula to (T_in-1)*s + K - p0 (single-
sided crop, matching the decomposed ConvTranspose1d contract in core/conv.h
that the iSTFTNet generator needs) in both ggml.c and the CPU kernel.

Verified end-to-end on Linux through the fused libelizainference.so: raw text
"Hello world." and "The quick brown fox jumps over the lazy dog." synthesize
and ASR-transcribe word-perfect; 14/15-sentence battery 98.9% word-match
(the one miss is the to/too homophone, not a phoneme error); internal phonemes
match KPipeline (e.g. "həlˈO wˈɜɹld.", "ðə kwˈɪk bɹˈWn fˈɑks ʤˈʌmps ˈOvəɹ ðə
lˈAzi dˈɔɡ.").

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… garbles audio)

kokoro_context_default_params() sets use_gpu=true, so on a fused engine lib that
carries a GPU backend (Vulkan on Android/Mali, CUDA on desktop)
ggml_backend_init_best() pins Kokoro's predictor + text-encoder + PLBERT graph
onto the GPU. The GPU kernels for Kokoro's op set (conv_transpose_1d /
col2im_1d / istft + the StyleTTS2 prosody predictor) do not match the CPU
reference numerically, producing garbled low-frequency audio instead of speech.

Pin the whole Kokoro forward to CPU in the eliza_kokoro adapter (the single FFI
integration point) unless KOKORO_GEN_GPU / KOKORO_GEN_FORCE_METAL is set to opt
a verified backend back in. Kokoro-82M is tiny (~80M params, <100 ms CPU TTFB),
so CPU is both correct and fast; the LLM keeps its own GPU backend.

Verified on a Pixel 9a (Mali-G715, arm64 Vulkan-variant libelizainference): the
device kokoro synth, captured over the bionic UDS and run back through Qwen3-ASR,
returns "Hello, world." (2/2 words) — before the fix the same path returned
garbled "Hmm"/"Yeah" (peak amplitude 0.14 vs 0.25 after). Linux CPU ASR
roundtrip unchanged (2/2).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
qwen3a.cpp does block-diagonal attention by batching each conv chunk as a
separate ggml_view_4d element (build_vit called with a null mask), so the
graph registers no attn_mask input. The prior code uploaded a [n_pos,n_pos]
mask with divergent 200/25 chunk geometry, aborting clip_image_batch_encode
at 'Failed to get tensor attn_mask' on every Qwen3-ASR call. Removing the
upload makes on-device ASR work (verified: eliza-1-asr transcribes real
speech + kokoro TTS output correctly through the rebuilt libelizainference).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d loader

The CrispASR fused-lib Kokoro loader (require(c, "bert.embd.tok.weight") …)
required un-prefixed canonical tensor names, but the published bundle GGUF
(elizaos/eliza-1 .../kokoro-82m-v1_0-Q4_K_M.gguf) and the in-tree converter
emit mainline-llama.cpp names ('kokoro.bert.layer.attn_q.weight',
'kokoro.bert.token_embd.weight', 'kokoro.predictor.de.lstm0.…',
'kokoro.gen.resblocks.0.convs1.0.weight', …). The loader rejected every
shipped GGUF ('required tensor missing: bert.embd.tok.weight' + ~15 more) so
Kokoro never loaded on desktop.

Add canonicalize_tensor_name(): a fixed set of ordered structural rewrites that
maps any mainline 'kokoro.'-prefixed name onto the loader's canonical name, then
normalize the loaded tensor map once at kokoro_init_from_file. Already-canonical
CrispASR/StyleTTS2 names pass through unchanged, so the on-device GGUF keeps
working. Collisions are a hard error; unrecognized 'kokoro.' names fall through
prefix-stripped and the sanity check still fails loudly if a required weight is
genuinely absent. Drop the unused 'bert.pooler.weight' sanity requirement (the
synthesis path reads last_hidden_state, never the pooled output, and mainline
GGUFs ship no pooler).

Verified: fused libelizainference.so rebuilds clean (LLAMA_BUILD_KOKORO=ON) and
all BERT/predictor/text_enc require()/sanity names now resolve from the real
shipped GGUF's kokoro.* tensors.

Refs elizaOS/eliza#9588

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… bert.* GGUF (#9588)

The CrispASR forward (a11adb5) vendored the runtime but not its
converters, so nothing in-repo produced the bert.*/pred.*/dec.gen.* GGUF
the loader requires — Kokoro silently fell back to OmniVoice/stub. Vendor
CrispStrobe/CrispASR's (MIT) convert-kokoro-to-gguf.py + voice converter,
with a torch-portability fix (tensor-or-tensor -> explicit is None).

Verified: hexgrad/Kokoro-82M -> 459-tensor bert.* GGUF + af_bella voice ->
kokoro-real-smoke loads + synthesizes 88800 samples @ 24kHz; ASR round-trip
of the output is an exact match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…egfault) (#9588)

canonicalize_tensor_name() only applied the .norm1.fc.->.adain1. / .norm2.fc.->
.adain2. rewrites in the 'decoder.' branch, not the 'predictor.' branch. But the
predictor F0/N AdainResBlk1d stacks carry the same AdaIN1d norms, named
norm1.fc/norm2.fc upstream (e.g. predictor.F0.0.norm1.fc.weight). Without the
rewrite they stayed pred.F0.<i>.norm1.fc.weight while run_stack() requires
pred.F0.<i>.adain1.weight, so they slipped past the soft sanity_check_weights and
surfaced as a NULL-tensor segfault in kokoro_adain_resblk -> ggml_mul_mat during
synthesis of the real shipped 457-tensor mainline GGUF.

Verified on Linux x64: rebuilt libelizainference, ran kokoro-real-smoke against
the real shipped kokoro-82m-v1_0-Q4_K_M.gguf bundle — the 'pred.F0.*.adain*
required tensor missing' errors and the kokoro_adain_resblk segfault are gone;
synthesis now progresses past the predictor to the decoder (where a separate
F16-im2col-vs-Q4-conv-quant issue remains, tracked in #9588).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… path (#9588)

After the predictor-AdaIN mapping fix, synthesis of the real shipped Q4_K_M GGUF
hit GGML_ASSERT(src0->type == GGML_TYPE_F16) in ggml_compute_forward_im2col_f16.
Cause: ggml_conv_1d routes through an F16-destination im2col whose CPU kernel
requires an F16 weight, but decoder.F0_conv.weight and decoder.N_conv.weight ship
F32 in the mainline GGUF (the only 2 F32 conv kernels; the other 84 are F16).
Cast a non-F16 conv weight to F16 in small_conv1d (the F0/N conv helper) so it
matches the convention every other conv already uses.

Verified the assert is gone and the full pipeline now runs end-to-end producing
3.17s of 24kHz PCM. NOTE: audio CORRECTNESS is not yet verified — synthesized with
hand-assembled test artifacts (a raw->GGUF-converted voice + a vocab-augmented
GGUF), the output is constant HF noise (spectral centroid 7.3kHz, flat envelope),
not intelligible speech. This likely reflects the synthetic test inputs rather
than the fix (both fixes are unambiguous crash-eliminations), but it must be
re-verified against the real repackaged bundle (correct kokoro-voice GGUF + the
178-entry vocab embedded) before this is considered to produce correct speech.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…588)

Extends the F0/N-only F16 conv cast to all ggml_conv_1d sites (kokoro_conv1d_kd,
kokoro_conv1d_ks, and the inline text-enc/AdaIN/proj/asr_res sites), so an
all-F32 GGUF (e.g. a fresh convert_kokoro_pth_to_gguf.py output) loads without
GGML_ASSERT(src0->type==F16) in ggml_compute_forward_im2col_f16. The guard skips
already-F16 kernels, so it is zero-cost for the shipped F16/quantized models.

NOTE: this is robustness only — it does NOT fix the mainline-format noise.
Verified: a fresh .pth->in-tree-converter GGUF (correct 3-D F0/N dims) now
synthesizes 79200 samples without asserting, but the audio is still noise
(spectral centroid 7.3kHz, flat envelope — identical to the shipped artifact),
so the mainline-kokoro.* format is mishandled by this loader for a reason deeper
than conv dims/dtype. The CrispASR-format path remains the working route.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…odel), remove Qwen3-ASR

Run the gemma4a USM-audio mmproj over the SHARED ctx->llm_model (mirroring
the vision describe_image path) instead of loading a separate Qwen3-ASR
LM. Replace the hardcoded Qwen <|im_start|>/<asr_text> chat scaffold +
stop-tokens with the Gemma chat template (<start_of_turn>... <__media__>
... <end_of_turn>); mtmd auto-inserts the gemma4a <|audio|> markers.
Legacy separate-LM path kept as a fallback. Public ABI unchanged.

Verified end-to-end through libelizainference.so (FFI, not mtmd-cli):
freeman.wav -> 'If you go into different cultures, they have different
concepts of creation...' (correct English, 18 words, PASS).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mat noise) (#9588)

ROOT CAUSE of the desktop-Kokoro noise, found + verified end-to-end: the fused
loader's forward pass requires the weight matrices in F16. convert_kokoro_pth_to_gguf.py
emitted everything as F32, so the model loaded + synthesized but produced inaudible
noise (not degraded speech). The shipped HF bundle is quantized → same noise.

_add_tensor now matches the working CrispASR export's dtype scheme exactly:
ndim>=2 weight matrices -> F16, 1-D biases/norms -> F32 (verified: CrispASR ships
112 2-D + 141 3-D tensors as F16 and all 206 1-D as F32, zero exceptions).

Verified: re-ran this converter on the canonical hexgrad/Kokoro-82M kokoro-v1_0.pth
-> output is F16:252/F32:205 -> synthesizes SPEECH (spectral centroid 3383Hz,
envelope-cv 1.314, identical to CrispASR's working output), vs all-F32 noise
(centroid 7269Hz, flat). No model surgery or loader workaround needed — the
converter dtype scheme is the whole fix.

To ship: regenerate the elizaos/eliza-1 bundle Kokoro GGUF with this converter
(it will be F16-weighted) instead of the current quantized/F32 artifact.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…I + C-API embedding proof harness

The M4 LiteRT-LM backend targeted the C++ API (litert::lm::Engine, libc++ ABI
— fragile to link against the prebuilt lib). Re-target the ELIZA_ENABLE_LITERT
real path to the stable C API (litert_lm_*, c/engine.h): engine_settings_create
+ set_enable_speculative_decoding(true) for MTP + engine_create + create_session
on an NPU->GPU->CPU ladder; prefill -> detokenize+run_prefill+run_decode_async;
next -> pop one chunk from a thread-safe queue fed by LiteRtLmStreamCallback
(push-stream mapped to the FFI pull contract). No libc++ C++-ABI symbols. Stub
branch (gate off) unchanged.

litert-capi-smoke.cpp: standalone linkage proof — loads a Gemma-4 .litertlm via
the C API and generates. Verified on Linux x86_64 against the prebuilt
liblitert-lm.so: prints 'The capital of France is Paris.'

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rness

Mirrors the LiteRT LLM backend for speech-to-text. litert_asr_transcribe takes
16kHz mono fp32 PCM -> in-memory PCM16 WAV -> base64 -> LiteRT-LM conversation
audio path (engine audio_backend=cpu + send_message with an audio blob +
'Transcribe' prompt) -> transcription text. Gated ELIZA_ENABLE_LITERT (stub when
off). Error contract honored (heap *out_error, no logging).

Verified on Linux x86_64 against the prebuilt liblitert-lm.so: transcribes
jfk.wav -> 'And so my fellow Americans ask not what your country can do for you,
ask what you can do for your country.' This is the Qwen3-ASR replacement via the
Gemma USM audio encoder. Wiring into eliza_inference_asr_* (resolve .litertlm
from bundle dir, warm-engine cache) is the next step.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ribe (warm engine)

The LiteRT ASR path was a standalone proof; this makes it the live ASR when a
.litertlm bundle is present. litert-asr now exposes a warm-engine handle
(engine_open loads the model+USM encoder once; engine_transcribe opens only a
fresh conversation per call). eliza_inference_asr_transcribe (gate ELIZA_ENABLE_LITERT)
probes <bundle>/text/*.litertlm, lazily opens+caches the engine on the context,
and delegates; otherwise falls through to the fused Qwen3-ASR path unchanged.

Verified: gate-on build links libelizainference.so with the ASR symbols folded
in + liblitert-lm.so DT_NEEDED; gate-off build is byte-for-byte unchanged (zero
litert_asr symbols, no liblitert-lm dep). Device-verified earlier on Pixel 9a.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.