Running Qwen 3.5 on Strix Halo using vLLM and Docker

I have Desktop Framework for 2 months now. It is pretty capable for interefence with small contexts. With longer contexts it degrades quickly. It can for example can run 100k tokens contexts for 10minutes. Because of those limitations I decided to try vllm again – maybe it will be quicker. Two months is long time in LLM frameworks last time I was using vllm it has version 0.17. Now stable is 0.19.

I did not even atyemted to update and run vllm via python wheels. It is too much trouble and it is not worth it. If docker will be working you just have to download 1 image one time. It is pretty large, 10GB, but internet speed is much grater then time spend trying to debug installation errors of Python packages.

The easiest solution would be to just move the same command I am using on my Linux host to docker container.

For example 35B Qwen 3.5 I am running for now:

TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \                                             VLLM_ROCM_USE_AITER=1 \                                                                  vllm serve \
  cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --dtype float16 \
  --max-model-len 128k \
  --gpu-memory-utilization 0.33 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}'

I thought about that before but to my surprise today I did not thought about copying the special environment variables too.

docker run --rm \
  --group-add=video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  -e TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
  -e VLLM_ROCM_USE_AITER=1 \
  -p "${1:-8000}":8000 \
  --ipc=host \
  vllm/vllm-openai-rocm:nightly \
  cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --dtype float \
  --max-model-len 128k \
  --gpu-memory-utilization 0.46 \                                                          
  --speculative-config '{ "method": "mtp", "num_speculative_tokens": 2}'

And it did finally work!

I was able to run Qwen in docker. Right I need to test inference speed of vLLM and llama.cpp with coding agent to compare which one is better for such load.

Leave a Reply Cancel reply