Just out of curiosity I tested Qwen3.6-27B model with MTP speculative decoding with 6 draft tokens. Whole script looked like this:
docker run --rm --name="${2:-qwen3.6-medium-vllm}" \
--group-add=video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v /cache/huggingface:/root/.cache/huggingface \
-v /cache/vllm:/root/.cache/vllm \
--env "HF_TOKEN=$3" \
-e TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
-e VLLM_ROCM_USE_AITER=1 \
-e FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE \
-p "${1:-8000}":8000 \
--ipc=host \
--entrypoint '/bin/sh' \
vllm/vllm-openai-rocm:nightly \
-c "pip install fastokens; vllm serve Lorbus/Qwen3.6-27B-int4-AutoRound \
--gpu-memory-utilization 0.48 \
--dtype half \
--optimization-level 1 \
--enable-prefix-caching \
--performance-mode interactivity \
--kv-cache-dtype fp8_e4m3 \
--language-model-only \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--speculative-config '{ \"method\": \"mtp\", \"num_speculative_tokens\": 2}' \
--max-num-seqs 1 \
--tokenizer-mode fastokens \
--max-num-batched-tokens 262144 \
#--max-model-len 262144"
Those are results with vLLM bench tool:
============ Serving Benchmark Result ============ Successful requests: 100 Failed requests: 0 Maximum request concurrency: 1 Benchmark duration (s): 3009.77 Total input tokens: 2915 Total generated tokens: 25600 Request throughput (req/s): 0.03 Output token throughput (tok/s): 8.51 Peak output token throughput (tok/s): 4.00 Peak concurrent requests: 2.00 Total token throughput (tok/s): 9.47 ---------------Time to First Token---------------- Mean TTFT (ms): 808.65 Median TTFT (ms): 819.95 P99 TTFT (ms): 842.02 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 114.83 Median TPOT (ms): 110.50 P99 TPOT (ms): 145.64 ---------------Inter-token Latency---------------- Mean ITL (ms): 400.91 Median ITL (ms): 407.92 P99 ITL (ms): 411.89 ==================================================
Not much better, despite GGUF MTP model having the same settings and being 6 times faster.

