After optimizing lately my vllm startup I started optimizing performance of vLLM during inference. Thing it had abysmal performance spanning like 2-4ts. Or even 0.5ts with 180k context. That was really terrible and I started to think that I wont be able to get better performance out of Strix Halo inside my Desktop Framework.
But… I remembered people praising it and its great usability with agents and Qwen models. How is that possible since it have so terrible performance.
I tested new version of Qwen 3.6 MOE in GGUF format with MTP (it was not available till recently in llama.cpp). On Vulkan I was able to achieve even 60t/s! Wow!
How is that possible? I understand that llama.cpp is optimized for single user usage but still… 60t/s vs 4? That is gigantic difference.
And then I tried it on agent and it was often almost all the time with more complex tool usage like regex replace for example.
I did another web search and some people were praising dense model as more correct. I decided to give it a go and it works!
It is much slower achieving only 20-30t/s in llama.cpp, but it is better to have slower but very often outputting correct text model instead of the wrong one like with MOE.
Just to have comparison I tested both llama.cpp with Qwen 3.6 27B:
============ Serving Benchmark Result ============
Successful requests: 100
Failed requests: 0
Maximum request concurrency: 1
Benchmark duration (s): 3473.48
Total input tokens: 2915
Total generated tokens: 25600
Request throughput (req/s): 0.03
Output token throughput (tok/s): 7.37
Peak output token throughput (tok/s): 4.00
Peak concurrent requests: 2.00
Total token throughput (tok/s): 8.21
---------------Time to First Token----------------
Mean TTFT (ms): 703.98
Median TTFT (ms): 718.68
P99 TTFT (ms): 737.81
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 133.42
Median TPOT (ms): 124.01
P99 TPOT (ms): 158.99
---------------Inter-token Latency----------------
Mean ITL (ms): 367.61
Median ITL (ms): 355.41
P99 ITL (ms): 1000.07
==================================================
Less than 10t/s. And this is for small prompts about 20-30t. For over 100k context it would be most probably 4t/s.
But then I tested llama.cpp.
============ Serving Benchmark Result ============
Successful requests: 100
Failed requests: 0
Maximum request concurrency: 1
Benchmark duration (s): 1029.14
Total input tokens: 2915
Total generated tokens: 25600
Request throughput (req/s): 0.10
Output token throughput (tok/s): 24.88
Peak output token throughput (tok/s): 52.00
Peak concurrent requests: 2.00
Total token throughput (tok/s): 27.71
---------------Time to First Token----------------
Mean TTFT (ms): 315.49
Median TTFT (ms): 194.85
P99 TTFT (ms): 1046.75
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 39.10
Median TPOT (ms): 29.51
P99 TPOT (ms): 120.42
---------------Inter-token Latency----------------
Mean ITL (ms): 38.82
Median ITL (ms): 0.02
P99 ITL (ms): 362.10
==================================================
27 t/s. This is huge difference. And llama.cpp can have similar throughput of tokens even with 200k context. Yes only for one user but on the other hand I can run two models like that and have agent use one and my AI assistant use another one.
Still this is very peculiar situation, why there is such difference. I will test it more but even if I messed up some of kernel parameters for GTT, which I do not think I did, with docker image that I am using it should rather OOM instead of getting so slow. Also with docker image all necessary libraries should be bundled there.
I will keep testing it but right now I am again using llama mostly for inference.