Lllama.cpp vs vLLM running Qwen 3.6 27B

After optimizing lately my vllm startup I started optimizing performance of vLLM during inference. Thing it had abysmal performance spanning like 2-4ts. Or even 0.5ts with 180k context. That was really terrible and I started to think that I wont be able to get better performance out of Strix Halo inside my Desktop Framework.

But… I remembered people praising it and its great usability with agents and Qwen models. How is that possible since it have so terrible performance.

I tested new version of Qwen 3.6 MOE in GGUF format with MTP (it was not available till recently in llama.cpp). On Vulkan I was able to achieve even 60t/s! Wow!

How is that possible? I understand that llama.cpp is optimized for single user usage but still… 60t/s vs 4? That is gigantic difference.

And then I tried it on agent and it was often almost all the time with more complex tool usage like regex replace for example.

I did another web search and some people were praising dense model as more correct. I decided to give it a go and it works!

It is much slower achieving only 20-30t/s in llama.cpp, but it is better to have slower but very often outputting correct text model instead of the wrong one like with MOE.

Just to have comparison I tested both llama.cpp with Qwen 3.6 27B:

============ Serving Benchmark Result ============                                                                                                                                                                                                  
Successful requests:                     100                                                                                                                                                                                                        
Failed requests:                         0                                                                                                                                                                                                          
Maximum request concurrency:             1                                                                                                                                                                                                          
Benchmark duration (s):                  3473.48                                                                                                                                                                                                    
Total input tokens:                      2915                                                                                                                                                                                                       
Total generated tokens:                  25600     
Request throughput (req/s):              0.03      
Output token throughput (tok/s):         7.37      
Peak output token throughput (tok/s):    4.00      
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          8.21      
---------------Time to First Token----------------
Mean TTFT (ms):                          703.98    
Median TTFT (ms):                        718.68    
P99 TTFT (ms):                           737.81    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          133.42    
Median TPOT (ms):                        124.01    
P99 TPOT (ms):                           158.99    
---------------Inter-token Latency----------------
Mean ITL (ms):                           367.61    
Median ITL (ms):                         355.41    
P99 ITL (ms):                            1000.07   
==================================================

Less than 10t/s. And this is for small prompts about 20-30t. For over 100k context it would be most probably 4t/s.

But then I tested llama.cpp.

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  1029.14   
Total input tokens:                      2915      
Total generated tokens:                  25600     
Request throughput (req/s):              0.10      
Output token throughput (tok/s):         24.88     
Peak output token throughput (tok/s):    52.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          27.71     
---------------Time to First Token----------------
Mean TTFT (ms):                          315.49    
Median TTFT (ms):                        194.85    
P99 TTFT (ms):                           1046.75   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          39.10     
Median TPOT (ms):                        29.51     
P99 TPOT (ms):                           120.42    
---------------Inter-token Latency----------------
Mean ITL (ms):                           38.82     
Median ITL (ms):                         0.02      
P99 ITL (ms):                            362.10    
==================================================

27 t/s. This is huge difference. And llama.cpp can have similar throughput of tokens even with 200k context. Yes only for one user but on the other hand I can run two models like that and have agent use one and my AI assistant use another one.

Still this is very peculiar situation, why there is such difference. I will test it more but even if I messed up some of kernel parameters for GTT, which I do not think I did, with docker image that I am using it should rather OOM instead of getting so slow. Also with docker image all necessary libraries should be bundled there.

I will keep testing it but right now I am again using llama mostly for inference.

Leave a Reply

Your email address will not be published. Required fields are marked *

Solve : *
18 + 5 =