Switching my AI assistant OS to Ubuntu

I did make a mistake trying to run a vLLM server on Debian. I theory everything should be OK: Debian is stable, have docker, docker have images for vLLM so I should be able to run all the models on docker on Debian. Also it should not really matter what distro is as long as it have new kernel and docker works. That in theory. I practice I could not force some models to run and Qwen 3 even if it was capable enough it felt limiting at times. For example I was unable to force it to work in agentic loops. I am not sure what was the problem, probably some configuration issue that I am still unable to understand and fix. Still that was just a simple problem that required me to give my assistant some nudge with another prompt. It was not as bad another issue though.

For example when I was running Qwen 0.6B, I taught him to use some of my smart devices that I made myself, like my smart gate controller. It was working nice until it. somehow started to get confused and asked me repeatedly to give it an API key. But I already did. All it needed was already in the skill file. I tried to explain it to the model but without luck. It felt like talking to the parrot: it kept repeating the same sentence over and over again.

“The gate has been opened! 🚪

If you have the API key secret, I can do that. If you want, I can help with other tasks. 😊”

I just could not explain that it is wrong. Also those emojis everywhere were annoying. But leave it to another time.

I tested few other things. In example asking it to send me an email with reminder to do the thing – let’s call it A. It was either sending me Matrix message with the reminder to do A or reminding me to send email with A. Neither was correct. After that I decided to checkout Qwen 3.0 27B. Even if it still was incorrect sometimes I was able to steer it into correct path with few more prompts.

And it was fine until it again started misbehaving in the same way. It kept asking me to give it an API key for my own devices.

I have no idea why. I did not do any changes or adjustments. At some point it just forgot what is should do. I tweaked the skill files but without luck. I think it need to be aware about the changes, I did not knew that before. I edited manually history and memory to remove any mentions about any API keys. Again no change, even after restart. I decided that the easiest solution to that would be to run Qwen 3.5.

I played with my vLLM docker images a bit trying to debug why I cannot run this model. There were no, stack trace. No meaningful error of any kind. Just some logs saying that main process crashed.

I fed logs to Gemini and asked want might be the cause. Trying to ask internet search about that showed nothing. Gemini first wrote that it is “classic OOM exception behavior” and it is caused by my system not having enough VRAM. Thing is that I was already running Framework PC with 120GB of GPU memory. Running Qwen 3.5 in 30B size should be totally safe and should leave some space too.

I did explained that I have 120GB free and then Gemini confirmed that it should not be the problem – started to say that it is probably a problem with AMD Cuda implementation. It asked me to add --enabled-eager flag to see if it fix it. I think those models knowledge is rather outdated from the start of they existence since they are trained on the data being gathered months or even years before – it takes times to scrap, mark, clean, organize and censor this data and then train the model – so given the rapid evolution of ROCm and LLMs in general, this information was probably old and outdated… But still worth a try! Adding a flag and launching a model takes 2 mins.

But it did not fix it.

If this would not work, Gemini was so convinced about this solution that it gave me alternative in the same message, it asked me to enable vLLM debug log flag.

Again, I did, I relaunched the vLLM and saw new errors connected to Huggingface API. I asked about those but the answer was that it is normal, sometimes some models have some files missing in huggingface storage.

I did tried to switch to other flavors of Qwen 3.5, other sizes, quants and etc… It did not work. I feed entire output with all the debug logs into the Gemini chat. It said that I need to open an issue in vLLM github repo.

Well, that was useless.

At this point it was a bit late and I needed to take care of the kids so a brake sounded good.

Next day I downloaded iso for Ubuntu Server and started from scratch. I was using Ubuntu previously as a server and as my daily PC but I did not liked mostly forcing everything via snap and upgrade process. On Debian upgrade from 11 to 12 I did on 4 machines and I had no problems. I later upgraded all of them to 13 from 12 and again; no problems. I upgraded Ubuntu server to new version few years back and it stopped to boot. It was not terribly broken, it just lost boot partition with Linux image and booted only to Grub emergency terminal. Few adjustments and I was able to fix it in few minutes. Thing is it was headless PC that I was using for few self hosted applications that I and my wife was using on daily basis and having no access to them in the morning usually makes your life a bit worse. You depend on something just being there, and Ubuntu broke it for me. After that Debian was the way to go for me, because of stability of it. It is much harder to brake.

Or it was me, silly person, doing silly things to my linux server, that broke it. It is an also a possibility though I remember that I did just: sudo do-release-upgrade prior to reboot 🙂

Installation was pretty quick so in few minutes I was able to log in into SSH. After that I did the usual process of updating everything, installing usual packages (tmux, mosh, docker), configuring environment, SSH keys etc. Then I installed AMD GPU drivers and ROCm. Good thing it just worked on Ubuntu and there was no problems – I just followed this tutorial. Even amd-ttm worked and I was able to set VRAM limit to 120GB. I guess the tool is fine, it is just designed to work in Ubuntu.

After all was configured and libraries installed, I copied few scripts I saved from Debian installation that I used for running models via docker. I executed the one intended for Qwen 3.5 and vLLM… And it failed. Exactly the same way.

That was a bit of let down.

I tried SgLang and it did not work. I do not remember why thought I just remember that the image was enormous, like 25 GB. It failed exactly the same way: no meaningful error, just stopped. I tried to run one of AMD images that do not start vLLM directly, as an entrypoint but instead you can run bash and then you can experiment with the environment. After installing new pyTorch, I was able to see some logs, that had some meaning.

It was trying to assign 250GB of VRAM! Why? Is there even a GPU with that capacity of memory? The biggest I saw was 200GB enterprise cards. Was it in a way ensuring that engine fail for some reason? Anyway it was like that for full model of Qwen 3.5 and its quantized 4bit version from cyankiwi. That was very weird and I began to suspect it is just a bug in vLLM or one of the libraries.

Anyway that was something I was able to use to search the web. I have found this issue on GH. So it seems like to actually run Qwen 3.5 on Strix Halo you have to run it via some experimental flag. I tried to do it and it failed again.

But at least now it was complaining about some problem with ROCm aiter library.

What now? Since I had ROCm installed I could try and run vLLM directly. Also I could try to change the docker image to install and configure what was missing. Thing is I was not sure what was missing. And even if I would new what is missing I was not sure what version is exactly necessary. And I do not like building docker images directly, instead of testing software without docker images directly in the system – you may be wasting time installing everything on docker and it will fail anyway.

So I decided to install vLLM directly in Ubuntu OS and try to run model from that. I followed this tutorial for bulding vLLM from source and it failed mid way. I installed triton in version 3.6 and vLLM needs 3.4. I hate python packages.

Anyway I corrected triton version, and I already installed aiter. Maybe it is enough? I tried to run Qwen 3.5 and it worked!

Finally!

I tried to ask my AI assistant few questions to check if it works. And it was fine to felt a bit strange, but it happens when you change models on AI assistant that already have memory and some knowledge about your past interactions.

Right now my script to run Qwen 3.5 runs on:

vllm 0.17.1+rocm700
amd-aiter 0.1.10.post2
torch 2.9.1+git8907517
triton 3.4.0
rocm 7.2.0.70200-43~24.04

and looks like that:

TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
  VLLM_ROCM_USE_AITER=1 \
  vllm serve \
  cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit \
  --host 0.0.0.0 \
  --port 8000 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --dtype float16 \
  --max-model-len 128k \
  --gpu-memory-utilization 0.33

In retrospect, switching OS of my device that runs the models was a good idea. AMD produces their software with Ubuntu and Fedora in mind. Probably I could get ROCm installation working on Debian too with some work. But I do not think it is worth my time. At least not now. With installation of ROCm I should be able to install vLLM in virtual environment too. Or install all libraries inside docker and run this image on Debian – that should be possible too.

Switching my AI assistant OS to Ubuntu

2 Replies to “Switching my AI assistant OS to Ubuntu”

Leave a Reply Cancel reply