Local performance using llama.cpp is slow.
As an example, using Qwen3.5 2b, on my 8gb Snapdragon 8gen 3 device.
Asking a sample question of "what's the capital of ...."
Operit can take a minimum of 1 minute vs the same on Google's AI edge gallery app with same model can take less than 10s.
My guess is that llama.cpp you have, only works with CPU, while Google can use GPU or npu.
I suggest to update llama.cpp + make it clear how it will be running in the users hardware prior to setup, as running on CPU on these low powered devices is essentially not worth the trouble.
Local performance using llama.cpp is slow.
As an example, using Qwen3.5 2b, on my 8gb Snapdragon 8gen 3 device.
Asking a sample question of "what's the capital of ...."
Operit can take a minimum of 1 minute vs the same on Google's AI edge gallery app with same model can take less than 10s.
My guess is that llama.cpp you have, only works with CPU, while Google can use GPU or npu.
I suggest to update llama.cpp + make it clear how it will be running in the users hardware prior to setup, as running on CPU on these low powered devices is essentially not worth the trouble.