vLLM is fast but kind of slow

I was using vLLM to run my AI assistant models for few weeks now. It is quite nice, production ready framework that have few bugs. ROCm support and by extent Strix Halo support is quite nice, even if AMD GPU docker image does not work on Strix Halo without aiter in custom version.

Interference speeds are quite nice. You can get 20-30t/s out of new AMD APU. Which is totally usable. Here is sample test I done on this device via LLM API Throughput Benchmark:

                                                                                      | Conc | Gen TPS | Prompt TPS | Min TTFT(s) | Max TTFT(s) | Success | Total(s) |         |:----:|:-------:|:----------:|:-----------:|:-----------:|:-------:|:--------:|         |    1 |   28.48 |      52.79 |        0.72 |        0.72 | 100.00% |    17.98 |         |    2 |   37.82 |      36.37 |        2.09 |        2.09 | 100.00% |    27.08 |         |    4 |   44.40 |      15.06 |       10.08 |       10.09 | 100.00% |    46.13 |

There is one big issue though. vLLM startup time is very slow. Initially I was planning to use one model for evetyhing text related. Maybe for images too, which is totally fine for ImageText-to-Text model as Qwen 3.5. Another one for Speech-to-Text. Another one for Text-to-Speech. Another one for image generation. Maybe few others for other uses. But Desktop Framework, with its 128gb of unified RAM, even if I did assigned 120gb of RAM to GPU, it is still too little to run all of those models at once.

Solution to this is to run only few of them at once or maybe just one I am currently using. This is totally sensible thing to do and even can be used to run the same model in many different modes. For example Huggingface page for Qwen 3.5 specifies 3 different modes for running Qwen for different purposes:

We recommend using the following set of sampling parameters for generation

    Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
    Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
    Instruct (or non-thinking) mode for general tasks: temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0
    Instruct (or non-thinking) mode for reasoning tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0

Please note that the support for sampling parameters varies according to inference frameworks.

Ideal thing would be to run one model and specify ‘mode’ by some API switch. And it happens that some people are doing that by using Llama Swap.

Llama Swap seems like very interesting project. I am using my new AMD Ryzen AI Max+ 395 motherboard as 24/7 headless server for models lately. And I was wondering what this software is doing actually all that time. How this impacts power usage? Most probably it is constantly doing something sipping power even when I am asleep.

The logical thing to so would be to run models only when I need them. Maybe swap assistant with model with coding model (I.e. Qwen Coder when I am working)… But…

vLLM is so slow at startup!

It can take as much as 3 minutes to start Qwen/Qwen3.5-35B-A3B.

During that time you can make yourself some tea.

Swapping models like that would be a bit annoying. Still I wanted to test it with Llama swap project. Otherwise using the same vLLM model from multiple not connected sources may cause it to run itself into a corner, having apparently some kind of the deadlock generating hundreds of tokens for some requests and nothing for others and being totally unresponsive while doing that. That was very strange and the only solution I was able to find for that was to totally kill the process.

Running multiple instances would probably help a lot even if initial startup time would be annoying.

I am experimenting right now with vLLM started via Llama Swap and it can’t really do that. Llama Swap kills the process after two minutes or so. Right now I can’t force it to run anything bigger than 2B parameters using vLLM – it is just to slow. There should be configuration switch to change the wait time to something longer but I can’t find it right now. There is this issue on Llama swap github when somebody else was experiencing something very similar. I hope I will be able to solve it and swap running models on the fly even if this will be very slow.

Leave a Reply

Your email address will not be published. Required fields are marked *

Solve : *
5 + 16 =