Running Qwen 3.6 on vLLM with 6 draft tokens

Just out of curiosity I tested Qwen3.6-27B model with MTP speculative decoding with 6 draft tokens. Whole script looked like this:

docker run --rm --name="${2:-qwen3.6-medium-vllm}" \
    --group-add=video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --device /dev/kfd \
    --device /dev/dri \
    -v /cache/huggingface:/root/.cache/huggingface \
    -v /cache/vllm:/root/.cache/vllm \
    --env "HF_TOKEN=$3" \
    -e TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
    -e VLLM_ROCM_USE_AITER=1 \
    -e FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE \
    -p "${1:-8000}":8000 \
    --ipc=host \
    --entrypoint '/bin/sh' \
    vllm/vllm-openai-rocm:nightly \
    -c "pip install fastokens; vllm serve Lorbus/Qwen3.6-27B-int4-AutoRound \
    --gpu-memory-utilization 0.48 \
    --dtype half \
    --optimization-level 1 \
    --enable-prefix-caching \
    --performance-mode interactivity \
    --kv-cache-dtype fp8_e4m3 \
    --language-model-only \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --speculative-config '{ \"method\": \"mtp\", \"num_speculative_tokens\": 2}' \
    --max-num-seqs 1 \
    --tokenizer-mode fastokens \
    --max-num-batched-tokens 262144 \
    #--max-model-len 262144"

Those are results with vLLM bench tool:

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  3009.77   
Total input tokens:                      2915      
Total generated tokens:                  25600     
Request throughput (req/s):              0.03      
Output token throughput (tok/s):         8.51      
Peak output token throughput (tok/s):    4.00      
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          9.47      
---------------Time to First Token----------------
Mean TTFT (ms):                          808.65    
Median TTFT (ms):                        819.95    
P99 TTFT (ms):                           842.02    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          114.83    
Median TPOT (ms):                        110.50    
P99 TPOT (ms):                           145.64    
---------------Inter-token Latency----------------
Mean ITL (ms):                           400.91    
Median ITL (ms):                         407.92    
P99 ITL (ms):                            411.89    
==================================================

Not much better, despite GGUF MTP model having the same settings and being 6 times faster.

Leave a Reply

Your email address will not be published. Required fields are marked *

Solve : *
22 + 15 =