Running Qwen 3.5 on Strix Halo using vLLM and Docker

I have Desktop Framework for 2 months now. It is pretty capable for interefence with small contexts. With longer contexts it degrades quickly. It can for example can run 100k tokens contexts for 10minutes. Because of those limitations I decided to try vllm again – maybe it will be quicker. Two months is long time in LLM frameworks last time I was using vllm it has version 0.17. Now stable is 0.19.

I did not even atyemted to update and run vllm via python wheels. It is too much trouble and it is not worth it. If docker will be working you just have to download 1 image one time. It is pretty large, 10GB, but internet speed is much grater then time spend trying to debug installation errors of Python packages.

The easiest solution would be to just move the same command I am using on my Linux host to docker container.

For example 35B Qwen 3.5 I am running for now:

TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \                                             VLLM_ROCM_USE_AITER=1 \                                                                  vllm serve \
  cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --dtype float16 \
  --max-model-len 128k \
  --gpu-memory-utilization 0.33 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}'

I thought about that before but to my surprise today I did not thought about copying the special environment variables too.

docker run --rm \
  --group-add=video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  -e TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
  -e VLLM_ROCM_USE_AITER=1 \
  -p "${1:-8000}":8000 \
  --ipc=host \
  vllm/vllm-openai-rocm:nightly \
  cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --dtype float \
  --max-model-len 128k \
  --gpu-memory-utilization 0.46 \                                                          
  --speculative-config '{ "method": "mtp", "num_speculative_tokens": 2}'

And it did finally work!

I was able to run Qwen in docker. Right I need to test inference speed of vLLM and llama.cpp with coding agent to compare which one is better for such load.

Using Mistral Vibe with local model

Vibe was the first coding agent I tried to use with local model. It works and it is fun to have ability to to just tell the computer to do something you want in plain English or even in your own language if your model support it.

Installation and Configuration

There are several installation methods for vibe that are available along to the usual, security nightmare of running of piping curl-downloaded scripts directly to bash 🙂
I advise using other installation methods for example via uv:

uv tool install mistral-vibe

After that comes the tricky part of starting the Vibe. I had some problems with that because it requires you to configure Mistral API, API Key. I do not have one and I did not intended to use it with the Vibe.

You do not have a choice here. You have to press Enter.

After that docs states that it will create configuration file. You need to edit this file located at ~/.vibe/config.toml:

  • add providers entry
[[providers]]
name = "llamacpp"
api_base = "http://local:8080/v1"
api_key_env_var = ""
api_style = "openai"
backend = "generic"
reasoning_field_name = "reasoning_content"
project_id = ""
region = ""

This will add new model provider to Vibe coding agent.

  • add model entry
[[models]]
name = "medium-moe"
provider = "llamacpp"
alias = "medium"
temperature = 0.2
input_price = 0.0
output_price = 0.0
thinking = "off"
auto_compact_threshold = 200000

This will add new model to Vibe. Of course you can add more models than one and switch between them when you want.

Unfortunately you can’t skip next screen or I do not know how. I tried to create some dummy file at ~/.vibe/.env that should be the store for mistral API keys, but it does not work. Restarting and changing other configuration entries do not work either. Maybe Vibe saves this progress in some /tmp directory. Luckily enough, you can type any bullshit value in there to make it pass! It does not validate it currently. 🙂

A bit annoying, but it makes perfect business sense. Probably majority of people will register just to make it pass this screen and some of those people probably will pay money after that to Mistral. It is not such perfect to the user like me though.

When it will finally load its main menu, just type:

/config

The you can see model selection:

Press enter and select model that you configured as local. In my case it is called medium.

You can select it and test it.

Summary

It works but lacks some integration with the IDE. To be clear it is possible to run from the plugin but IDE, Rider in my example, does not allow it to use any of those cool tools it have. For example when you ask coding agent about issues in the file that you have open currently, it does not ask Rider, or Resharper plugin in VS what are the issues. It is already there:

Instead coding agent will go on a GREAT JOURNEY OF SELF-DISCOVERY AND CODING EXPERIENCE. It is more apparent when you are using local model because it is much, much slower. API providers mask it by running it all very fast so you either do not notice or do not care enough about those inefficiencies.

Also you have to be prepared for few days of tinkering, testing and optimizations till this tool will be actually usable. Otherwise it will be either glorified chat or it will be so slow that writing simple unit test will take 3 hours.

Anyway it is still fun experience. I would never imagine, when I was starting that I will be able to instruct my computer to do stuff for me in my native language. This is truly amazing!

Using Junie agent CLI with local model

Jet Brains have their own coding agent. Since I am already using Rider and all coding agents are pretty slow and inefficient with what they do I wanted to test if Junie have better integrations with IDE which would make things much faster.

For example if you will have following unused namespace in Rider.

Removing it all cleaning all unused namespaces in entire project or even entire solution is just few clicks away. Or key strokes.

Since Junie can be integrated with Rider and can be used from pipeline it would be cool to have headless IDE instance startup, integrate in CLI coding agent and perform auto cleanup of code, auto packages update or even auto Tests write for better Code coverage.

I wanted to test if this will work with local model.

Configuration

According to the docs, you need to install Early Access version of Junie to have this configuration even available. You can install it via:

curl -fsSL https://junie.jetbrains.com/install-eap.sh | bash

But honestly I advise you to do not do installations this way ever, unless you have absolute trust into source address and your network connection settings.

cd /tmp
wget https://junie.jetbrains.com/install-eap.sh
#inspect the file for validity
bash install-eap.sh

After installation you can run it and play with it a bit if you have Jet Brains account. If not create first the local model config.

You can do that globally or per project. To do that globally use following path:

$JUNIE_HOME/models/*.json

For example I did create file called:

~/.junie/models/llama-swap-global.json

with fallowing content:

{
  "baseUrl": "http://local:8080/v1/chat/completions",
  "id": "medium:instruct_reasoning",
  "apiType": "OpenAICompletion",
  "extraHeaders": {
    "X-Custom-Source": "Junie"
  },
  "fasterModel": {
    "id": "small"
  }
}

Just to know if this will work I created similar file in my personal project, FluentCodeGenerators.

fluentcodegenerator/.junie/models/llama-swap1.json

Content of the file was the same.

When I start junie and use /model command, I can see following menu entry:

    Junie   v.1468.8 eap  ~/.nanobot                                                                                                                                                                                                                                                                                                                                                                                                                           ───────────────────────────────────────────────────────────────────────────────────                                                                                                                                                                                                                                Select model                                                                                                                                                                                                                  Current model: llama-swap-global                                                                                                                                                                                                                                                                                                                                                                                                                            Name                        Input    Output               Effort      Provider                                                                                                                                                ───────────────────────────────────────────────────────────────────────────────────
Default (Gemini 3 Flash)    $0.50    $3.00 per Mtok     ‹ High ›      JetBrains AI                                                                                                                                            Claude Opus 4.6             $5.00    $25.00 per Mtok    ‹ Low ›       JetBrains AI                                                                                                                                            Claude Opus 4.7             $5.00    $25.00 per Mtok    ‹ Low ›       JetBrains AI                                                                                                                                            Claude Sonnet 4.6           $3.00    $15.00 per Mtok    ‹ Low ›       JetBrains AI                                                                                                                                            Gemini 3.1 Flash Lite       $0.25    $1.50 per Mtok     ‹ High ›      JetBrains AI                                                                                                                                            Gemini 3.1 Pro Preview      $2.00    $12.00 per Mtok    ‹ High ›      JetBrains AI                                                                                                                                            GPT-5                       $1.25    $10.00 per Mtok    ‹ Low ›       JetBrains AI                                                                                                                                            GPT-5.2                     $1.75    $14.00 per Mtok    ‹ Low ›       JetBrains AI                                                                                                                                            GPT-5.3-codex               $1.75    $14.00 per Mtok    ‹ Low ›       JetBrains AI                                                                                                                                            GPT-5.4                     $2.50    $15.00 per Mtok    ‹ Low ›       JetBrains AI                                                                                                                                            Grok 4.1 Fast Reasoning     $0.20    $0.50 per Mtok     ‹ Medium ›    JetBrains AI                                                                                                                                          → llama-swap-global           $0.00    $0.00 per Mtok     ‹ Low ›     

But when I am connected to the IDE with project open:

I have one extra entry.

That is it.

Possible problems

I had some strange problem when Junie was not able to see those extra files via in application config. I am not even sure what was causing that and how I fixed that but it seems like couple of restarts of Junie agent, plus renaming the files (llama-swap.json to llama-swap1.json) helped.

You have to bear in mind that performance may vary greatly on your configuration, interference framework, its settings, model kind and of course your hardware. Prepare for at least few days of tuning to achieve workable solution.

AI agent critique of itself

I like to self-host my own services. I have my own email server. I have my own cloud storage and sync and backup. I want to have this secure and safe while having my data private.

But I need to learn and be up to date with current technology.

I decided to use Junie with my own, self hosted models that I run on my own hardware. In the docs they say it can be run on top your own custom provider and model. But it does not work with Rider plugin. You are limited to their own limited list of models.

In docs they are also stating that you can set your own provider via json.

Sadly you cannot extend built-in set of models that way.

{
  "baseUrl": "http://local-llm:8080/v1",
  "id": "medium:instruct_reasoning",
  "apiType": "OpenAICompletion",
  "extraHeaders": {
    "X-Custom-Source": "Junie"
  },
  "fasterModel": {
    "id": "medium:instruct_reasoning"
  }
}

It does not work via project local settings, nor with global ones in HOME directory.

- User-scope: $JUNIE_HOME/models/*.json.
- Project-scope: .junie/models/*.json.

I decided that I will try to run CLI, that according to docs can be run with custom settings like that. Of course official installation method is:

curl -fsSL https://junie.jetbrains.com/install.sh | bash

I usually open such script and read through it, I am not familiar with bash much (by choice mostly, because IMHO it is terrible), but this way if my computer will be infected at least I can blame someone else, because I tried to do what I could to avoid it. Short of not installing something that does not have apt package.

Just out of curiosity I asked Junie to read through this file. It gave me short summary with statement:

The script is a standard, robust installer pattern for command-line tools. If you have any specific questions about its implementation or need help adapting it, please let me know.

I do not think it is robust. Nor I do think that it should be standard. But hey! We are living of the future of an AI that does not care about such silliness as security and breaches.

I asked directly:

Is it safe to run such script on my PC?

And I got and answer:

Summary: While the script is almost certainly safe given its official origin, downloading and reviewing the script before execution is the professional standard for security-conscious developers. If you have specific concerns about any section of the script, I can help you decode it.

And I do not know about my readers, but it gave me a chuckle. So if I think about my as ‘security-conscious’ I probably should not install it the official way according to the agent itself. And if I will do that anyway I am no thinking of security of my private and professional development environment?

We are truly living in wonderful future of AI that does not care about security.

Testing Qwen 3.6

Today I noticed that new version of Qwen was realesed. Only 35B-A3B for now but this is the same size of Qwen I was using previously. Usually (as with every other software) it is better to wait few weeks for fixes, in this case for re uploads of the model, that is done after testing/template adjustments, but it is not a problem to do few tests. I decided to give it a spin. I added it to my llama swap instance and started it up.

First though, I asked previous version the same question: how memory management of nanobot, the claw agent I am using, works.

How does nanobot’s memory mechanism work?

It answered pretty standard explanation that I was trying for some tests before.

Mechanizm pamięci nanobota działa w dwóch warstwach:

**Krótkoterminowa** (`memory/history.jsonl`)
- Append-only JSONL — każda linia to obiekt JSON z `cursor`, `timestamp`, `content`
- Nie jest ładowana do kontekstu
- Szukanie: używaj `grep` zamiast ręcznego czytania pliku

**Długoterminowa** (`memory/MEMORY.md`)
- ZarzÄ…dzana automatycznie przez system Dream
- Przechowuje fakty, kontekst projektów, ważne wydarzenia
- **Nie edytuj ręcznie** — Dream aktualizuje ją sam

**Jak to działa w praktyce:**
1. Gdy pytasz o przeszłe wydarzenia → `grep` przeszukuje `history.jsonl`
2. Gdy system potrzebuje kontekstu → Dream czyta `MEMORY.md`
3. Gdy dodajesz nowe informacje → Dream zapisuje je do `MEMORY.md`

Możesz sprawdzić aktywność Dreama komendą `/dream-log`.

Seems like pretty standard response to such question. Few headers, few lists, general information.

Then I asked Qwen 3.6.

Mechanizm pamięci nanobota działa na dwóch warstwach:

**1. Krótkoterminowa (`memory/history.jsonl`)**
- Format JSONL — każda linia to obiekt z `cursor`, `timestamp`, `content`
- Append-only (tyldo dopisywanie)
- Nie jest ładowana bezpośrednio do kontekstu
- Szukam jej za pomocÄ… wbudowanego `grep`

**2. Długoterminowa (`memory/MEMORY.md`)**
- Fakty, kontekst projektu, ważne wydarzenia
- Zarządzana automatycznie przez **Dream** — nie edytuję jej ręcznie
- Aktualizowana podczas cykli Dreama

**Pozostałe pliki zarządzane przez Dream:**
- `SOUL.md` — osobowość i styl komunikacji
- `USER.md` — profil użytkownika

Wszystkie te pliki są **zarządzane automatycznie przez Dream**. Nie edytuję ich bezpośrednio — jeśli są nieaktualne, Dream poprawi je przy następnym cyklu. Możesz sprawdzić aktywność Dreama komendą `/dream-log`.

Just by looking at it, it looks objectively worse. It starts the same but then makes a typo (in Polish word for only, ‘tylko’ by using ‘d’ instead of ‘k’ – ‘tyldo’), and the incorrectly phrases information that JSONl is intended for search only, writing like the file itself is possible to find with the `grep` tool only instead of the *contents* of this file.

Further down it also uses incorrect phrasing that implies that the model itself is maintaining its Dream memory, which is not true, technically but can be simplified as such since for the end user it does not really matter.

In summary it feels a bit less intelligent, uses less sophisticated terminology, simplifies facts and makes typos.

After that I asked it for list of tools it have. It gave me an answer but some tools were missing, but on the other hand I asked my assistant to not use some of them so maybe it just committed those. Again it left me with confused ‘hmmm’ feeling after the response.

After that I prompt it with some more technical question.

How to create SSH key?

Instead of giving me plain answer it was thinking for few minutes. It tried to:

  • Create a key
  • Create it again but this time with fallback echo message
  • It listed existing keys and checked if ssh-keygen binary exists. In one tool call which was pretty surprising.
ssh-keygen -t ed25519 -C "nanobot@local"
ssh-keygen -t ed25519 -C "nanobot@local" -f ~/.ssh/id_ed25519 -N "" -q 2>&1 || echo "Key may already exist or directory missing"
ls -la ~/.ssh/ 2>&1; echo "---"; which ssh-keygen 2>&1

After that it gave me following answer:

Klucz już istnieje: `~/.ssh/id_ed25519`

Aby wygenerować nowy klucz:

```bash
ssh-keygen -t ed25519 -C "opis@host"
```

- `-t ed25519` — algorytm (bezpieczny i szybki)
- `-C "komentarz"` — etykieta klucza
- Bez `-N ""` zapyta o hasło (zalecane)
- Z `-N ""` — bez hasła (wygodniejsze, mniej bezpieczne)

Publikowy klucz: `~/.ssh/id_ed25519.pub`

In my opinion it was to eager to do it instead of just replying. But maybe it was optimised for such agentic coding in a way that it tries to guess question user intention by inspecting an environment and trying to correct it. Probably much better for coding but not for an AI assistant. It will have more chances to actually brake things this way.

I will test it a bit more, ask it to write few skills or improve old ones but for now I am not blown away.