After optimizing lately my vllm startup I started optimizing performance of vLLM during inference. Thing it had abysmal performance spanning like 2-4ts. Or even 0.5ts with 180k context. That was really terrible and I started to think that I wont be able to get better performance out of Strix Halo inside my Desktop Framework.
But… I remembered people praising it and its great usability with agents and Qwen models. How is that possible since it have so terrible performance.
I tested new version of Qwen 3.6 MOE in GGUF format with MTP (it was not available till recently in llama.cpp). On Vulkan I was able to achieve even 60t/s! Wow!
How is that possible? I understand that llama.cpp is optimized for single user usage but still… 60t/s vs 4? That is gigantic difference.
And then I tried it on agent and it was often almost all the time with more complex tool usage like regex replace for example.
I did another web search and some people were praising dense model as more correct. I decided to give it a go and it works!
It is much slower achieving only 20-30t/s in llama.cpp, but it is better to have slower but very often outputting correct text model instead of the wrong one like with MOE.
Just to have comparison I tested both llama.cpp with Qwen 3.6 27B:
============ Serving Benchmark Result ============ Successful requests: 100 Failed requests: 0 Maximum request concurrency: 1 Benchmark duration (s): 3473.48 Total input tokens: 2915 Total generated tokens: 25600 Request throughput (req/s): 0.03 Output token throughput (tok/s): 7.37 Peak output token throughput (tok/s): 4.00 Peak concurrent requests: 2.00 Total token throughput (tok/s): 8.21 ---------------Time to First Token---------------- Mean TTFT (ms): 703.98 Median TTFT (ms): 718.68 P99 TTFT (ms): 737.81 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 133.42 Median TPOT (ms): 124.01 P99 TPOT (ms): 158.99 ---------------Inter-token Latency---------------- Mean ITL (ms): 367.61 Median ITL (ms): 355.41 P99 ITL (ms): 1000.07 ==================================================
Less than 10t/s. And this is for small prompts about 20-30t. For over 100k context it would be most probably 4t/s.
But then I tested llama.cpp.
============ Serving Benchmark Result ============ Successful requests: 100 Failed requests: 0 Maximum request concurrency: 1 Benchmark duration (s): 1029.14 Total input tokens: 2915 Total generated tokens: 25600 Request throughput (req/s): 0.10 Output token throughput (tok/s): 24.88 Peak output token throughput (tok/s): 52.00 Peak concurrent requests: 2.00 Total token throughput (tok/s): 27.71 ---------------Time to First Token---------------- Mean TTFT (ms): 315.49 Median TTFT (ms): 194.85 P99 TTFT (ms): 1046.75 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 39.10 Median TPOT (ms): 29.51 P99 TPOT (ms): 120.42 ---------------Inter-token Latency---------------- Mean ITL (ms): 38.82 Median ITL (ms): 0.02 P99 ITL (ms): 362.10 ==================================================
27 t/s. This is huge difference. And llama.cpp can have similar throughput of tokens even with 200k context. Yes only for one user but on the other hand I can run two models like that and have agent use one and my AI assistant use another one.
Still this is very peculiar situation, why there is such difference. I will test it more but even if I messed up some of kernel parameters for GTT, which I do not think I did, with docker image that I am using it should rather OOM instead of getting so slow. Also with docker image all necessary libraries should be bundled there.
I will keep testing it but right now I am again using llama mostly for inference.

