Swapping vLLM to Llama.cpp – InternetException

Yesterday I posted about vLLM being nice but optimized for different usage I am being really interested in. I commented on Llama Swap GH issue asking for help my problem of vLLM being shutdown before it actually started. During writing that comment I was thinking about Llama.cpp and it inability to run Qwen 3.5. It was at the beginning of March when I started (again) experimenting with LLMs. I tried to run Qwen models with release version of Llama.cpp and it was failing with ‘model architecture of Qwen3 is not found’ (or something like that; I do not really remember). Because of that I switched to using vLLM, even if it could not run Qwen 3.5, it could run Qwen 3 just fine from the docker image. Sglang seemed like not a good choice for my needs.

Because of that I was still using vLLM when I switched to running models on Desktop Framework motherboard. It was totally fine for about two weeks until I started experimenting with Llama.swap yesterday and it started to pain me that it is so slow to start.

I was writing the comment on Llama-swap issue and though: “it was actually few weeks maybe it works now…”. I decided to try it out again.

I downloaded new release of llama.cpp for ROCm and tried to run my main model I am using now for interference: Qwen/Qwen3.5-35B-A3B. It worked.

And it was quick to start! Really quick!

And it was working almost the same!

And it was using less memory!

Somehow chatting with the model felt a bit worse. Maybe how llama.cpp works or quantisation was worse or settings were a bit different, but… It was starting in several seconds! Outstanding.

I looked for some benchmarking tool for Open AI API models and I have found “LLM API Throughput Benchmark“. It does not seem to be like the best thing out there, but I just wanted to have rough idea about the performance of vLLM and llama.cpp running the same models. I run Qwen in version 2B in vLLM and Llama.cpp:

vLLM:

################################################################################                                   LLM API Throughput Benchmark                                                      https://github.com/Yoosu-L/llmapibenchmark                                                    Time: 2026-03-25 21:27:39 UTC+0                                  ################################################################################
Model: unsloth/Qwen3.5-2B        | Latency: 0.60 ms
Input: 38                        | Output:  512 tokens
                                                                                         | Conc | Gen TPS | Prompt TPS | Min TTFT(s) | Max TTFT(s) | Success | Total(s) |         |:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|         |    1 |   68.49 |     224.32 |        0.17 |        0.17 | 100.00% |     3.18 |
|    2 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |     2.02 |
|    4 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |     1.54 |         |    8 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |     2.45 |         |   16 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |     5.62 |         |   32 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |    12.43 |         |   64 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |    24.44 |         |  128 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |    48.58 |         |:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|                                                                                                  ================================================================================

And the same in llama.cpp:

################################################################################                                   LLM API Throughput Benchmark                                                      https://github.com/Yoosu-L/llmapibenchmark                                                    Time: 2026-03-25 21:27:39 UTC+0
################################################################################         Model: unsloth/Qwen3.5-2B        | Latency: 0.60 ms
Input: 38                        | Output:  512 tokens
                                                                                         | Conc | Gen TPS | Prompt TPS | Min TTFT(s) | Max TTFT(s) | Success | Total(s) |         |:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|
|    1 |   68.49 |     224.32 |        0.17 |        0.17 | 100.00% |     3.18 |         |    2 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |     2.02 |         |    4 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |     1.54 |         |    8 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |     2.45 |         |   16 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |     5.62 |         |   32 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |    12.43 |
|   64 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |    24.44 |         |  128 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |    48.58 |         |:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|         
================================================================================

Ok. You could see the problem immediately because llama.cpp can’t run *any* queries, in parallel at all. But some parameters juggling fixed that, to some extent. I added parameter -np 4 and it made things much better.

################################################################################
                          LLM API Throughput Benchmark
                   https://github.com/Yoosu-L/llmapibenchmark                                                    Time: 2026-03-25 21:34:54 UTC+0
################################################################################
Model: unsloth/Qwen3.5-2B        | Latency: 0.60 ms
Input: 38                        | Output:  512 tokens

| Conc | Gen TPS | Prompt TPS | Min TTFT(s) | Max TTFT(s) | Success | Total(s) |
|:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|
|    1 |   68.46 |     224.32 |        0.17 |        0.17 | 100.00% |     3.19 |
|    2 |   98.97 |     237.95 |        0.32 |        0.32 | 100.00% |     4.41 |
|    4 |  126.60 |     249.43 |        0.61 |        0.61 | 100.00% |     6.89 |
|    8 |  113.77 |      36.54 |        0.65 |        8.32 | 100.00% |    15.33 |
|   16 |  112.97 |      25.55 |        0.64 |       23.80 | 100.00% |    30.88 |
|   32 |  112.79 |      22.17 |        0.64 |       54.86 | 100.00% |    61.85 |
|   64 |  112.36 |      20.76 |        0.66 |      117.14 | 100.00% |   124.17 |
|  128 |  109.85 |      19.69 |        0.67 |      247.08 | 100.00% |   254.03 |         |:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|

================================================================================

I run the same tests but with Qwen 0.6B, smaller model and it was even more surprising.

Llama.cpp:

################################################################################
                          LLM API Throughput Benchmark
                   https://github.com/Yoosu-L/llmapibenchmark
                        Time: 2026-03-26 18:51:52 UTC+0
################################################################################
Model: tiny                      | Latency: 0.00 ms
Input: 36                        | Output:  512 tokens

| Conc | Gen TPS | Prompt TPS | Min TTFT(s) | Max TTFT(s) | Success | Total(s) |
|:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|
|    1 |  125.41 |     360.00 |        0.10 |        0.10 | 100.00% |     4.08 |
|    2 |  166.04 |     360.00 |        0.20 |        0.20 | 100.00% |     6.17 |
|    4 |  189.06 |     288.00 |        0.50 |        0.50 | 100.00% |    10.83 |
|    8 |  180.52 |      24.55 |        0.48 |       11.73 | 100.00% |    22.69 |
|   16 |  169.31 |      15.71 |        0.47 |       22.91 |  62.50% |    30.24 |
|   32 |  169.91 |      15.76 |        0.44 |       22.84 |  31.25% |    30.13 |
|   64 |  168.29 |      15.59 |        0.49 |       23.09 |  15.62% |    30.42 |
|  128 |  168.84 |      15.62 |        0.48 |       23.05 |   7.81% |    30.32 |
|:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|

================================================================================

vLLM:

################################################################################
                          LLM API Throughput Benchmark
                   https://github.com/Yoosu-L/llmapibenchmark
                        Time: 2026-03-26 18:55:38 UTC+0
################################################################################
Model: tiny-vllm                 | Latency: 0.20 ms
Input: 38                        | Output:  512 tokens

| Conc | Gen TPS | Prompt TPS | Min TTFT(s) | Max TTFT(s) | Success | Total(s) |
|:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|
|    1 |   60.45 |     635.45 |        0.06 |        0.06 | 100.00% |     8.47 |
|    2 |  101.16 |     330.72 |        0.23 |        0.23 | 100.00% |    10.12 |
|    4 |  195.48 |    1904.76 |        0.08 |        0.08 | 100.00% |    10.48 |
|    8 |  364.67 |    1790.34 |        0.13 |        0.17 | 100.00% |    11.23 |
|   16 |  418.69 |    3460.84 |        0.07 |        0.11 |  62.50% |    12.23 |
|   32 |  420.42 |    3807.62 |        0.10 |        0.10 |  31.25% |    12.18 |
|   64 |  418.69 |    3460.84 |        0.07 |        0.11 |  15.62% |    12.23 |
|  128 |  433.56 |    3460.84 |        0.07 |        0.11 |   7.81% |    11.81 |
|:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|

================================================================================

I seems like actually for lower number of concurrent users llama.cpp was doing a lot better. Of course for 4 requests it is almost the same and for more it performs a lot worse, but this is not the use case I am worrying about at all.

Then I checked memory consumption. This is important to note that using vLLM if you not forbid it vLLM will assign to itself all the memory. You could run one model with 2B paramers and it would consume entire VRAM, all 120GB for it. This make no sense for my scenario. I did change that when I was working on voice recognition for my assistant as it requires two models. I fixed that by assigning the lowest possible value to --gpu-memory-utilization that would still work. For Qwen 35B it was about 33% percent which is about 40GB. But for llama.cpp it is much less.

+------------------------------------------------------------------------------+         | AMD-SMI 26.2.1+fc0010cf6a    amdgpu version: Linuxver ROCm version: 7.2.0    |         | VBIOS version: 023.011.000.039.000001                                        |         | Platform: Linux Baremetal                                                    |         |-------------------------------------+----------------------------------------|         | BDF                        GPU-Name | Mem-Uti   Temp   UEC       Power-Usage |         | GPU  HIP-ID  OAM-ID  Partition-Mode | GFX-Uti    Fan               Mem-Usage |         |=====================================+========================================|         | 0000:c1:00.0    AMD Radeon Graphics | N/A        N/A   0                 N/A |         |   0       0     N/A             N/A | N/A        N/A              189/512 MB |         +-------------------------------------+----------------------------------------+         +------------------------------------------------------------------------------+         | Processes:                                                                   |         |  GPU        PID  Process Name          GTT_MEM  VRAM_MEM  MEM_USAGE     CU % |         |==============================================================================|         |    0     148812  llama-server           1.9 GB   40.4 MB     1.9 GB  N/A     |         +------------------------------------------------------------------------------+

2GB! How?! Why?! I need to do more testing as it may be only some bare minimum and during the usage it is actually much more. But if this is true, combined with the fact that startup time is much better it convinced me that vLLM is probably not worth it.

This is how memory usage looks like during running Qwen 35B via vLLM:

+------------------------------------------------------------------------------+         | AMD-SMI 26.2.1+fc0010cf6a    amdgpu version: Linuxver ROCm version: 7.2.0    |         | VBIOS version: 023.011.000.039.000001                                        |         | Platform: Linux Baremetal                                                    |         |-------------------------------------+----------------------------------------|         | BDF                        GPU-Name | Mem-Uti   Temp   UEC       Power-Usage |         | GPU  HIP-ID  OAM-ID  Partition-Mode | GFX-Uti    Fan               Mem-Usage |         |=====================================+========================================|         | 0000:c1:00.0    AMD Radeon Graphics | N/A        N/A   0                 N/A |         |   0       0     N/A             N/A | N/A        N/A              151/512 MB |         +-------------------------------------+----------------------------------------+         +------------------------------------------------------------------------------+         | Processes:                                                                   |         |  GPU        PID  Process Name          GTT_MEM  VRAM_MEM  MEM_USAGE     CU % |         |==============================================================================|         |    0     149044  python3.12             5.9 MB   62.5 KB    16.0 EB  N/A     |         |    0     149216  python3.12            37.6 GB    3.2 MB    38.5 GB  N/A     |         +------------------------------------------------------------------------------+

So llama.cpp is starting much quicker and is using less memory? There is no discussion here. That is it. I am switching to llama.cpp.

Of course I need to so more testing and more benchmarks but this seems to be obvious decision: llama.cpp is better choice for me for now.

Leave a Reply Cancel reply