Bonding Network interfaces in Debian Forky

I am experimenting lately on with Nanobot and running self hosted language models on Desktop Framework.

I was trying to run lately Qwen 3.5 397B model on this small PC and it was a bit hard because I just do not have enough disk space on it. I bought just small NVMe SSD disk. With current prices, I planned to use disk space on other server I have at my hope as storage

In the end I was able to load the model but via USB hard disk which is slow but still faster then moving 100GB via my network. Because of that I am planning to modernize my network to allow for faster transfer of large quantities of data.

First step was to test network interfaces bonding. Right now my storage server, that I am also using to run some services, is using only one of its NIC as primary that is used for everything. And it have 2 others that are also connected but have different IPs so they are not used for http in example. Utilizing all of the three as one would allow theoretical transfer speed of 22400MB/s. This is not enormous bandwidth but since hardware is already there and I do not have to buy expensive PCIE network cards with optical sockets, is should be an easy win.

My server have IPMI management interface so I checked if it is still working after 6months or so. Of course I could not login. It was showing JS message that session was invalidated right after login. I do not know what is causing this but I think I probably set some limitation of who can use it, by IP or Mac address.

I still decide to go with it but maybe just bond two out 3 for example, just to check if this is working and then bond all.

At first I was using this tutorial from Debian docs. Since Debian is using systemd I skipped section about using ifenslave package.

First I created file named /etc/systemd/network/bond.network and put following content in it:

[NetDev]
Name=bond1
Description=LAG/Bond to a switch
Kind=bond

[Bond]
Mode=802.3ad

Then I created another file, /etc/systemd/network/bond.network that had definition of a new bonded network.

[Match]
Name=enp36s0f0
Name=wlp38s0

[Network]
Bond=bond1

I thought that adding one Ethernet card and wifi to the bond at least I will be able to connect to my server via the second Ethernet card if things went sideways.

Then third file was having configuration for obtaining IP:

[Match]
Name=bond1

[Network]
DHCP=yes

This seemed to be very unintuitive. Why I have to create two .network files for the same bonded network? Why can’t it be in one file? Anyway I named it bond-ip.network and enabled systemd-networkd:

sudo systemctl enable systemd-networkd

Everything went great and I rebooted.

And… It did not work. List of network interfaces returned by ip addr show was showing bond1 as DOWN

bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state DOWN group default

I tried few other things like removing or adding network interfaces, using wildcards in the bond.network file:

[Match]
Name=enp*

I tried to add static IP instead of DHCP in bond-ip.network file:

[Match]
Name=bond1

[Network]
Address=10.0.0.10
Gateway=10.0.0.1
DNS=10.0.0.1

I rebooted couple of times but it did not work. I tested few commands like:

sudo ip link set bond0 up
sudo dmesg | grep bond
sudo journal | grep bond
sudo systemctl status systemd-networkd

But it was showing that interface is going up and then without any errors in logs was getting down immediately. No details why.

At this point I did some digging through internet about bonding on systemd Linux, Debian and similar topics. Nothing was really helpful. I do remember Arch Linux wiki being very informative on other problems I encountered previously. I checked and those docs have entire section connected to bonding of network interfaces under systemd.

I read through it and it seemed better than Debian docs, more detailed and more up to date. I did followed the instructions, with small change of bonding everything, at once.

I did create file /etc/systemd/network/30-bond0.netdev but I changed mode to 802.3ad:

[NetDev]
Name=bond0
Kind=bond

[Bond]
Mode=802.3ad
PrimaryReselectPolicy=always
MIIMonitorSec=1s

Then I created file /etc/systemd/network/30-eth0-bond0.network

[Match]
Name=enp36s0f0

[Network]
Bond=bond0
PrimarySlave=true

And similar one for enp36s0f1 named /etc/systemd/network/30-eth0-bond0.network. Also another one named 30-wifi-bond0.network for wlp38s0 WIFI.

Another file was needed for definition of bonded network /etc/systemd/network/30-bond0.network:

[Match]
Name=bond0

[Link]
RequiredForOnline=routable

[Network]
BindCarrier=enp36s0f0 enp36s0f1 wlp38s0
DHCP=yes

Then I restarted systemd-networkd:

sudo systemctl restart systemd-networkd

And it worked! Kinda. I lost connection to server so something must went up or down. I could not connect to it via SSH anymore. HTTP stopped working. I did checked my router assigning IPs if there is some new one that I did not saw before. If there would be one it means bonded network was assigned new IP based on new MAC it was using.

There were none.

Of course I should think about this twice before actually doing it. Leave one interface not bonded at first… but yeah… you need to jump all the way at once!

Anyway at that point I had headless server not being able to connect to anything and not responding. It have not monitor attached and does not even have GPU beside some really basic one with one VGA socket that is just IPMI pass through.

  • To add monitor I would have to dismount one of my office monitor attached to the wall – the only one with VGA socket
  • Or I could also attach old GPU and use my portable monitor I bought for Raspberry PI.
  • Or I could try to attach some external USB network card
  • I could try to fix IPMI

I tried to connect portable Wifi network card but it was not working. It might be broken or maybe it needs some extra driver. Also I tried to connect to IPMI from other machine or from other browser, since Reddit people were reporting that it might help. It did not. Clearing the cookies did not fixed the issue also.

The only thing that worked was connecting to IPMI via SSH. This SSH server is pretty slow and this felt like the longest 10s of my life.

After login it welcomed me with critical error.

			>> SMASHLITE Scorpio Console <<
cat: write error: No space left on device
[10745 : 10745 CRITICAL][oemsystemlog.c:332]Get Process Name by PID failed

But it worked. I did some digging around the menu and it seemed useless. Just bunch of information about some IPMI, power and fans settings.

->cd system/
COMMAND COMPLETED : cd system/
 ufip=/system

->show
COMMAND COMPLETED : show
 ufip=/system
  Targets:

      power1/
      cooling1/
      cooling2/
      cooling3/
      cooling4/
      cooling5/
      cooling6/
      cooling7/
      cooling8/
      cooling9/
      cooling10/
      chassis1/
      logs/
      snmp1/
      snmp2/
      snmp3/
      snmp4/
      snmp5/
      snmp6/
      snmp7/
      snmp8/
      snmp9/
      snmp10/
      snmp11/
      snmp12/
      snmp13/
      snmp14/
      snmp15/
      summary/

  Properties:
      Location=(null)
      Manufacturer=(null)
      ProductName=Pro WS WRX80E-SAGE SE WIFI
      ProductPartNumber=(null)
      SN=(null)
      Firmware=1.52.0
      Health=OK
      EnTemp(C)=0
      OperatorPassword=xxxxxxxx
      AdminPassword=xxxxxxxx
      IPMode=static
      IP=10.0.0.7
      NetMask=255.255.255.0
      GateWay=10.0.0.1
      NodePowerGap(s)=2
      Time=2026-04-01 10:32:40
      SyslogEnable=Disable
      SyslogServerIP=0.0.0.0
      SyslogUDPPort=0

  Verbs:
      cd
      exit
      help
      reset
      set
      show
      version

Just bunch of system information properties. The only command that looked something remotely helpful was reset.
I executed it and after a minute or so, my router DHCP settings reported new IP being used by multiple machines. That exactly what was I expecting if bonding would work!

I pinged it and after the ping was succesful I tried to connect to it via SSH using my usual credentials. It worked. I checked the networks and it was OK.

4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000                                                                                                                                                                     
    link/ether b6:46:8a:50:11:fb brd ff:ff:ff:ff:ff:ff                                                                                                                                                                                                                         
    inet 10.0.0.109/24 metric 1024 brd 10.0.0.255 scope global dynamic bond0                                                                                                                                                                                                   
       valid_lft 85801sec preferred_lft 85801sec                                                                                                                                                                                                                               
    inet6 fe80::b446:8aff:fe50:11fb/64 scope link proto kernel_ll                                                                                                                                                                                                              
       valid_lft forever preferred_lft forever           

Ip was incorrect but I did changed it in DHCP settings to my usual one I was using for that machine. After DHCP restarted IPs were reassigned and router were showing correct ones. Strangely even if SSH and HTTP and pings were using correct IPs network on the server side were still reporting the old ones were assigned to one Ethernet network and Wifi.

Running Qwen 3.5 397B on Desktop Framework

Today I finally was able to download and run largest Qwen 3.5 model. I downloaded the smallest quant even if a bit bigger probably would fit too, but I want to run also few smaller models at the same time, so 107GB for UD-IQ1_M is the biggest I was able to spare.

I downloaded and was able to run it with llama.cpp with the following command:

./llama-server \
   -hf unsloth/Qwen3.5-397B-A17B-GGUF:UD-IQ1_M\
   --ctx-size 32000 \
   --no-warmup \
   --no-mmap \
   --flash-attn on \
   --n-gpu-layers 1 \
   --host 0.0.0.0\
   --port $1 \
   --direct-io \
   --jinja --chat-template-file /data/apps/vllm/qwen3_nonthinking.jinja

My intention was to test it with nanobot to see if this will be capable agent for my AI assistant.

For example I asked how is today’s weather:

ER +,att experimentally,ER
def sp
ER�ER
sp,,

ER.-sER

the
\”
Hausti1 in
ER,
, sp sp “ER\”).
spERomit

I checked few other prompts and it was like that for all of them.

Just to check if this not the problem with the model, or llama.cpp or combination of parameters passes to it, I called the API directly:

curl http://localhost:$1/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{\
        \"messages\": \
        [ \
          { \
            \"role\": \"user\", \
            \"content\": \
            [ \
              { \
                \"type\": \"text\", \
                \"text\": \"Hello!\" \
              } \
            ] \
          } \
        ], \
        \"temperature\":0.7, \
        \"top_p\":0.80, \
        \"min_p\":0.0, \
        \"presence_penalty\":1.5, \
        \"repetition_penalty\":1.0, \
        \"top_k\": 20, \
        \"enable_thinking\": false \
      }"
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"Hello! How can I help you today?"}}],"created":1774885465,"model":"unsloth/Qwen3.5-397B-A17B-GGUF:UD-IQ1_M","system_fingerprint":"b8522-9c600bcd4","object":"chat.completion","usage":{"completion_tokens":10,"prompt_tokens":14,"total_tokens":24,"prompt_tokens_details":{"cached_tokens":0}},"id":"chatcmpl-h8de9riukfOoLWE6ctt1KltwxXRUl06N","timings":{"cache_n":0,"prompt_n":14,"prompt_ms":1258.476,"prompt_per_token_ms":89.105428571428572,"prompt_per_second":11.30620888922939,"predicted_n":10,"predicted_ms":340.439,"predicted_per_token_ms":89.0439,"predicted_per_second":109.373837897538174}}

So I guess it just has the problem with longer texts at this level of quantizations.

I guess I wont be switching to bigger models for now.

My first PR to Nanobot

I am blogging last few weeks about my new aventure with nanobot. I am running model for that little assistant of mine on Desktop Framework motherboard I bought couple of days ago.

Few days ago I was working on my own fork of nanobot. I was not happy how few things are not entirely correctly implemented in this project. For example command support in Matrix channel. I created PR for that one here. Unfortunately it was not merged yet. I am not sure why. Maybe it is mising few unit tests.

Yesterday my my another PR was merged though! I created change request that added kind of streaming support to nanobot Matrix channel. It is not entirely streaming, it is series of edits to original message that was sent from first set of tokens to nanobot from LLM.

After the first set the message is created and after next tokens are generated nanobot sends another matrix event that changes original one. This is not really edit either as Matrix does not edit messages. It is just another event that is updating previous event data. Actually when I was testing that with small model, Qwen 3.5 0.6B, which is often generating repeating or incorect messages, but it does it very fast. On one accassion it did generated a lot of tokens which generates a lot of edits. I think it was about 90 lines edited few times each. Which caused entire message to be edited probably about 1000 times. This actually caused my mobile client to hang and stop getting updates altogether. It was just to much for it to handle.

I had to remove this message entirely 🙂

So it is not ideal but I still think it is beneficial because it is better to have quicker response and read through updates then wait for entire LLM response to appear in Element.

I think it is great success to be part of new and interesting project. And it is great to be able to benefit from this project by running my self hosted AI assistant.

Swapping vLLM to Llama.cpp

Yesterday I posted about vLLM being nice but optimized for different usage I am being really interested in. I commented on Llama Swap GH issue asking for help my problem of vLLM being shutdown before it actually started. During writing that comment I was thinking about Llama.cpp and it inability to run Qwen 3.5. It was at the beginning of March when I started (again) experimenting with LLMs. I tried to run Qwen models with release version of Llama.cpp and it was failing with ‘model architecture of Qwen3 is not found’ (or something like that; I do not really remember). Because of that I switched to using vLLM, even if it could not run Qwen 3.5, it could run Qwen 3 just fine from the docker image. Sglang seemed like not a good choice for my needs.

Because of that I was still using vLLM when I switched to running models on Desktop Framework motherboard. It was totally fine for about two weeks until I started experimenting with Llama.swap yesterday and it started to pain me that it is so slow to start.

I was writing the comment on Llama-swap issue and though: “it was actually few weeks maybe it works now…”. I decided to try it out again.

I downloaded new release of llama.cpp for ROCm and tried to run my main model I am using now for interference: Qwen/Qwen3.5-35B-A3B. It worked.

And it was quick to start! Really quick!

And it was working almost the same!

And it was using less memory!

Somehow chatting with the model felt a bit worse. Maybe how llama.cpp works or quantisation was worse or settings were a bit different, but… It was starting in several seconds! Outstanding.

I looked for some benchmarking tool for Open AI API models and I have found “LLM API Throughput Benchmark“. It does not seem to be like the best thing out there, but I just wanted to have rough idea about the performance of vLLM and llama.cpp running the same models. I run Qwen in version 2B in vLLM and Llama.cpp:

vLLM:

################################################################################                                   LLM API Throughput Benchmark                                                      https://github.com/Yoosu-L/llmapibenchmark                                                    Time: 2026-03-25 21:27:39 UTC+0                                  ################################################################################
Model: unsloth/Qwen3.5-2B        | Latency: 0.60 ms
Input: 38                        | Output:  512 tokens
                                                                                         | Conc | Gen TPS | Prompt TPS | Min TTFT(s) | Max TTFT(s) | Success | Total(s) |         |:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|         |    1 |   68.49 |     224.32 |        0.17 |        0.17 | 100.00% |     3.18 |
|    2 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |     2.02 |
|    4 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |     1.54 |         |    8 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |     2.45 |         |   16 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |     5.62 |         |   32 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |    12.43 |         |   64 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |    24.44 |         |  128 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |    48.58 |         |:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|                                                                                                  ================================================================================

And the same in llama.cpp:

################################################################################                                   LLM API Throughput Benchmark                                                      https://github.com/Yoosu-L/llmapibenchmark                                                    Time: 2026-03-25 21:27:39 UTC+0
################################################################################         Model: unsloth/Qwen3.5-2B        | Latency: 0.60 ms
Input: 38                        | Output:  512 tokens
                                                                                         | Conc | Gen TPS | Prompt TPS | Min TTFT(s) | Max TTFT(s) | Success | Total(s) |         |:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|
|    1 |   68.49 |     224.32 |        0.17 |        0.17 | 100.00% |     3.18 |         |    2 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |     2.02 |         |    4 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |     1.54 |         |    8 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |     2.45 |         |   16 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |     5.62 |         |   32 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |    12.43 |
|   64 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |    24.44 |         |  128 |    0.00 |       0.00 |        +Inf |        0.00 |   0.00% |    48.58 |         |:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|         
================================================================================

Ok. You could see the problem immediately because llama.cpp can’t run *any* queries, in parallel at all. But some parameters juggling fixed that, to some extent. I added parameter -np 4 and it made things much better.

################################################################################
                          LLM API Throughput Benchmark
                   https://github.com/Yoosu-L/llmapibenchmark                                                    Time: 2026-03-25 21:34:54 UTC+0
################################################################################
Model: unsloth/Qwen3.5-2B        | Latency: 0.60 ms
Input: 38                        | Output:  512 tokens

| Conc | Gen TPS | Prompt TPS | Min TTFT(s) | Max TTFT(s) | Success | Total(s) |
|:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|
|    1 |   68.46 |     224.32 |        0.17 |        0.17 | 100.00% |     3.19 |
|    2 |   98.97 |     237.95 |        0.32 |        0.32 | 100.00% |     4.41 |
|    4 |  126.60 |     249.43 |        0.61 |        0.61 | 100.00% |     6.89 |
|    8 |  113.77 |      36.54 |        0.65 |        8.32 | 100.00% |    15.33 |
|   16 |  112.97 |      25.55 |        0.64 |       23.80 | 100.00% |    30.88 |
|   32 |  112.79 |      22.17 |        0.64 |       54.86 | 100.00% |    61.85 |
|   64 |  112.36 |      20.76 |        0.66 |      117.14 | 100.00% |   124.17 |
|  128 |  109.85 |      19.69 |        0.67 |      247.08 | 100.00% |   254.03 |         |:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|

================================================================================

I run the same tests but with Qwen 0.6B, smaller model and it was even more surprising.

Llama.cpp:

################################################################################
                          LLM API Throughput Benchmark
                   https://github.com/Yoosu-L/llmapibenchmark
                        Time: 2026-03-26 18:51:52 UTC+0
################################################################################
Model: tiny                      | Latency: 0.00 ms
Input: 36                        | Output:  512 tokens

| Conc | Gen TPS | Prompt TPS | Min TTFT(s) | Max TTFT(s) | Success | Total(s) |
|:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|
|    1 |  125.41 |     360.00 |        0.10 |        0.10 | 100.00% |     4.08 |
|    2 |  166.04 |     360.00 |        0.20 |        0.20 | 100.00% |     6.17 |
|    4 |  189.06 |     288.00 |        0.50 |        0.50 | 100.00% |    10.83 |
|    8 |  180.52 |      24.55 |        0.48 |       11.73 | 100.00% |    22.69 |
|   16 |  169.31 |      15.71 |        0.47 |       22.91 |  62.50% |    30.24 |
|   32 |  169.91 |      15.76 |        0.44 |       22.84 |  31.25% |    30.13 |
|   64 |  168.29 |      15.59 |        0.49 |       23.09 |  15.62% |    30.42 |
|  128 |  168.84 |      15.62 |        0.48 |       23.05 |   7.81% |    30.32 |
|:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|

================================================================================

vLLM:

################################################################################
                          LLM API Throughput Benchmark
                   https://github.com/Yoosu-L/llmapibenchmark
                        Time: 2026-03-26 18:55:38 UTC+0
################################################################################
Model: tiny-vllm                 | Latency: 0.20 ms
Input: 38                        | Output:  512 tokens

| Conc | Gen TPS | Prompt TPS | Min TTFT(s) | Max TTFT(s) | Success | Total(s) |
|:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|
|    1 |   60.45 |     635.45 |        0.06 |        0.06 | 100.00% |     8.47 |
|    2 |  101.16 |     330.72 |        0.23 |        0.23 | 100.00% |    10.12 |
|    4 |  195.48 |    1904.76 |        0.08 |        0.08 | 100.00% |    10.48 |
|    8 |  364.67 |    1790.34 |        0.13 |        0.17 | 100.00% |    11.23 |
|   16 |  418.69 |    3460.84 |        0.07 |        0.11 |  62.50% |    12.23 |
|   32 |  420.42 |    3807.62 |        0.10 |        0.10 |  31.25% |    12.18 |
|   64 |  418.69 |    3460.84 |        0.07 |        0.11 |  15.62% |    12.23 |
|  128 |  433.56 |    3460.84 |        0.07 |        0.11 |   7.81% |    11.81 |
|:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|

================================================================================

I seems like actually for lower number of concurrent users llama.cpp was doing a lot better. Of course for 4 requests it is almost the same and for more it performs a lot worse, but this is not the use case I am worrying about at all.

Then I checked memory consumption. This is important to note that using vLLM if you not forbid it vLLM will assign to itself all the memory. You could run one model with 2B paramers and it would consume entire VRAM, all 120GB for it. This make no sense for my scenario. I did change that when I was working on voice recognition for my assistant as it requires two models. I fixed that by assigning the lowest possible value to --gpu-memory-utilization that would still work. For Qwen 35B it was about 33% percent which is about 40GB. But for llama.cpp it is much less.

+------------------------------------------------------------------------------+         | AMD-SMI 26.2.1+fc0010cf6a    amdgpu version: Linuxver ROCm version: 7.2.0    |         | VBIOS version: 023.011.000.039.000001                                        |         | Platform: Linux Baremetal                                                    |         |-------------------------------------+----------------------------------------|         | BDF                        GPU-Name | Mem-Uti   Temp   UEC       Power-Usage |         | GPU  HIP-ID  OAM-ID  Partition-Mode | GFX-Uti    Fan               Mem-Usage |         |=====================================+========================================|         | 0000:c1:00.0    AMD Radeon Graphics | N/A        N/A   0                 N/A |         |   0       0     N/A             N/A | N/A        N/A              189/512 MB |         +-------------------------------------+----------------------------------------+         +------------------------------------------------------------------------------+         | Processes:                                                                   |         |  GPU        PID  Process Name          GTT_MEM  VRAM_MEM  MEM_USAGE     CU % |         |==============================================================================|         |    0     148812  llama-server           1.9 GB   40.4 MB     1.9 GB  N/A     |         +------------------------------------------------------------------------------+

2GB! How?! Why?! I need to do more testing as it may be only some bare minimum and during the usage it is actually much more. But if this is true, combined with the fact that startup time is much better it convinced me that vLLM is probably not worth it.

This is how memory usage looks like during running Qwen 35B via vLLM:

+------------------------------------------------------------------------------+         | AMD-SMI 26.2.1+fc0010cf6a    amdgpu version: Linuxver ROCm version: 7.2.0    |         | VBIOS version: 023.011.000.039.000001                                        |         | Platform: Linux Baremetal                                                    |         |-------------------------------------+----------------------------------------|         | BDF                        GPU-Name | Mem-Uti   Temp   UEC       Power-Usage |         | GPU  HIP-ID  OAM-ID  Partition-Mode | GFX-Uti    Fan               Mem-Usage |         |=====================================+========================================|         | 0000:c1:00.0    AMD Radeon Graphics | N/A        N/A   0                 N/A |         |   0       0     N/A             N/A | N/A        N/A              151/512 MB |         +-------------------------------------+----------------------------------------+         +------------------------------------------------------------------------------+         | Processes:                                                                   |         |  GPU        PID  Process Name          GTT_MEM  VRAM_MEM  MEM_USAGE     CU % |         |==============================================================================|         |    0     149044  python3.12             5.9 MB   62.5 KB    16.0 EB  N/A     |         |    0     149216  python3.12            37.6 GB    3.2 MB    38.5 GB  N/A     |         +------------------------------------------------------------------------------+

So llama.cpp is starting much quicker and is using less memory? There is no discussion here. That is it. I am switching to llama.cpp.

Of course I need to so more testing and more benchmarks but this seems to be obvious decision: llama.cpp is better choice for me for now.

vLLM is fast but kind of slow

I was using vLLM to run my AI assistant models for few weeks now. It is quite nice, production ready framework that have few bugs. ROCm support and by extent Strix Halo support is quite nice, even if AMD GPU docker image does not work on Strix Halo without aiter in custom version.

Interference speeds are quite nice. You can get 20-30t/s out of new AMD APU. Which is totally usable. Here is sample test I done on this device via LLM API Throughput Benchmark:

                                                                                      | Conc | Gen TPS | Prompt TPS | Min TTFT(s) | Max TTFT(s) | Success | Total(s) |         |:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|         |    1 |   28.48 |      52.79 |        0.72 |        0.72 | 100.00% |    17.98 |         |    2 |   37.82 |      36.37 |        2.09 |        2.09 | 100.00% |    27.08 |         |    4 |   44.40 |      15.06 |       10.08 |       10.09 | 100.00% |    46.13 |

There is one big issue though. vLLM startup time is very slow. Initially I was planning to use one model for evetyhing text related. Maybe for images too, which is totally fine for ImageText-to-Text model as Qwen 3.5. Another one for Speech-to-Text. Another one for Text-to-Speech. Another one for image generation. Maybe few others for other uses. But Desktop Framework, with its 128gb of unified RAM, even if I did assigned 120gb of RAM to GPU, it is still too little to run all of those models at once.

Solution to this is to run only few of them at once or maybe just one I am currently using. This is totally sensible thing to do and even can be used to run the same model in many different modes. For example Huggingface page for Qwen 3.5 specifies 3 different modes for running Qwen for different purposes:

We recommend using the following set of sampling parameters for generation

    Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
    Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
    Instruct (or non-thinking) mode for general tasks: temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
    Instruct (or non-thinking) mode for reasoning tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

Please note that the support for sampling parameters varies according to inference frameworks.

Ideal thing would be to run one model and specify ‘mode’ by some API switch. And it happens that some people are doing that by using Llama Swap.

Llama Swap seems like very interesting project. I am using my new AMD Ryzen AI Max+ 395 motherboard as 24/7 headless server for models lately. And I was wondering what this software is doing actually all that time. How this impacts power usage? Most probably it is constantly doing something sipping power even when I am asleep.

The logical thing to so would be to run models only when I need them. Maybe swap assistant with model with coding model (I.e. Qwen Coder when I am working)… But…

vLLM is so slow at startup!

It can take as much as 3 minutes to start Qwen/Qwen3.5-35B-A3B.

During that time you can make yourself some tea.

Swapping models like that would be a bit annoying. Still I wanted to test it with Llama swap project. Otherwise using the same vLLM model from multiple not connected sources may cause it to run itself into a corner, having apparently some kind of the deadlock generating hundreds of tokens for some requests and nothing for others and being totally unresponsive while doing that. That was very strange and the only solution I was able to find for that was to totally kill the process.

Running multiple instances would probably help a lot even if initial startup time would be annoying.

I am experimenting right now with vLLM started via Llama Swap and it can’t really do that. Llama Swap kills the process after two minutes or so. Right now I can’t force it to run anything bigger than 2B parameters using vLLM – it is just to slow. There should be configuration switch to change the wait time to something longer but I can’t find it right now. There is this issue on Llama swap github when somebody else was experiencing something very similar. I hope I will be able to solve it and swap running models on the fly even if this will be very slow.