vLLM has very slow time to first token (TTFT). On AMD Ryzen AI Max+ 395 it can take even 10 minutes depending on exact settings and model size.
To make my user experience of using those things myself I need to have comparison of llama.cpp and vLLM startup time and performance. But to do this I needed to perform optimizations of startup of vLLM since it is even slower to start then it was with v0.17.
Performance and Startup Time optimizations
Going trough vLLM documentations, tutorials and other articles I was able to came up with possible ways of optimizing startup and token generation for my specific usage.
For example those settings may have impact on startup time:
- pre download model (of course!)
- vllm cache sharing
- size of context
- data type size
- enforce eager switch
- enable prefix catching switch
- optimization level
- safetensor loading strategy
On the other hand those setting may or not have impact on token generation speed:
- perfomance mode
- optimization level
- enable prefix catching switch
Some overlap but it make sense since if you optimize things during startup it make it slower but you do not have to do it during runtime which slows things down during actual usage.
Testing methodology
I created following test script for running my model:
docker run --rm --name="${1:-vllm-startup-test}" \
--group-add=video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-e TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
-e VLLM_ROCM_USE_AITER=1 \
-p "${2:-8000}":8000 \
--ipc=host \
vllm/vllm-openai-rocm:nightly \
cyankiwi/Qwen3.5-2B-AWQ-4bit \
--gpu-memory-utilization 0.115
Basing on what switch/argument I am testing, new lines are added. For example if I want to test impact of speculative decoding configuration I am adding following line to above script:
--speculative-config '{ "method": "mtp", "num_speculative_tokens": 2}'
Sometimes 11.5% of GPU memory allocation was too little start vLLM engine with some specific settings (i.e. speculative decoding needs a bit more) so it was increased slightly to 13% or 15%.
To test how long it takes for vLLM to startup I am using following script written in Python:
#!/usr/bin/env python3
"""
Script to start vLLM via bash script and measure startup time until /health endpoint responds.
"""
import subprocess
import requests
import time
import sys
import os
from typing import Optional
import sys
import os
CONTAINER_NAME = 'vllm-startup-test'
def get_spinner_frames():
"""Return spinner animation frames."""
return ['⣾', '⣽', '⣻', '⢿', '⡿', '⣟', '⣯', '⣷']
def start_vllm_server(bash_script_path: str, **kwargs) -> subprocess.Popen:
"""
Start vLLM server using external bash script.
"""
if not os.path.exists(bash_script_path):
raise FileNotFoundError(f"Bash script not found: {bash_script_path}")
cmd = ['/bin/bash', bash_script_path, CONTAINER_NAME]
for key, value in kwargs.items():
cmd.extend([f"--{key}", str(value)])
process = subprocess.Popen(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
text=True,
bufsize=1
)
return process
def wait_for_health_endpoint(
base_url: str,
process: subprocess.Popen,
endpoint: str = "/health",
polling_interval: float = 1.0
) -> Optional[float]:
"""
Wait for the vLLM /health endpoint to respond with spinner animation.
"""
full_url = f"{base_url}{endpoint}"
start_time = time.time()
spinner_frames = get_spinner_frames()
frame_index = 0
current_line_length = 0
timeout = 60 * 20
try:
while time.time() - start_time < timeout:
# Check if process has exited with an error
process_returncode = process.poll()
if process_returncode is not None:
print(f"\r{' ' * (current_line_length + 50)}", end='', flush=True)
if process_returncode != 0:
print(f"\r✗ vLLM server process exited with error code: {process_returncode}", flush=True)
process.wait()
returncode = process.wait()
print(f" Return code: {returncode}", flush=True)
raise RuntimeError(f"vLLM server process failed with exit code {process_returncode}")
else:
print(f"\r✓ vLLM server process exited normally", flush=True)
raise RuntimeError("vLLM server process exited before health check")
# Print spinner animation (overwrite same line)
spinner_frame = spinner_frames[frame_index]
status_msg = f"\r⏳ Waiting for {full_url}... "
print(f"{spinner_frame}{status_msg}", end='', flush=True)
# Clear any previous status text on this line
current_line_length = len(f"{spinner_frame}{status_msg}")
try:
response = requests.get(full_url, timeout=5)
if response.status_code == 200:
# Clear the spinner line
print(f"\r{' ' * (current_line_length + 50)}", end='', flush=True)
elapsed = time.time() - start_time
return elapsed
else:
print(f"\r{' ' * (current_line_length + 50)}", end='', flush=True)
print(f"\r✗ Health check failed with status code: {response.status_code}", flush=True)
print(f" Response: {response.text[:200]}", flush=True)
except requests.exceptions.ConnectionError:
pass # Spinner continues
except requests.exceptions.Timeout:
pass # Spinner continues
except requests.exceptions.RequestException as e:
print(f"\r{' ' * (current_line_length + 50)}", end='', flush=True)
print(f"\r⚠ Health check error: {e}", flush=True)
frame_index = (frame_index + 1) % len(spinner_frames)
time.sleep(polling_interval)
# Timeout
print(f"\r{' ' * (current_line_length + 50)}", end='', flush=True)
elapsed = time.time() - start_time
print(f"\r✗ Timeout waiting for health endpoint after {elapsed:.2f} seconds", flush=True)
return None
except KeyboardInterrupt:
print("\nReceived interrupt signal, terminating server...", flush=True)
raise
finally:
# Clear spinner line on exit
print(f"\r{' ' * (current_line_length + 50)}", end='', flush=True)
def measure_vllm_startup(
bash_script_path: str,
**kwargs
) -> float:
"""
Measure vLLM startup time from script start to health endpoint response.
"""
base_url = f"http://localhost:{8000}"
start_time = time.time()
process = start_vllm_server(bash_script_path, **kwargs)
try:
startup_time = wait_for_health_endpoint(base_url, process)
total_time = time.time() - start_time
if startup_time is not None:
print(f"\rTotal elapsed time: {total_time:.2f} seconds")
else:
print(f"\n{'='*60}")
print(f"vLLM startup timed out!")
print(f" Total elapsed time: {total_time:.2f} seconds")
print(f"{'='*60}\n")
raise TimeoutError("vLLM startup timeout")
return startup_time
except KeyboardInterrupt:
print("\nReceived interrupt signal, terminating server...")
process.terminate()
raise
finally:
if process.poll() is None:
process.terminate()
process.wait()
# stop docker container
process = subprocess.Popen(
['docker', 'stop', CONTAINER_NAME],
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
text=True,
bufsize=1
)
process.wait()
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(
description="Measure vLLM startup time via health endpoint"
)
parser.add_argument(
"bash_script",
help="Path to the bash script that starts vLLM"
)
parser.add_argument(
"--args",
nargs="*",
help="Additional arguments to pass to vLLM server"
)
args = parser.parse_args()
kwargs = {}
if args.args:
for arg in args.args:
if "=" in arg:
key, value = arg.split("=", 1)
kwargs[key] = value
else:
kwargs[arg] = True
try:
startup_time = measure_vllm_startup(
bash_script_path=args.bash_script,
**kwargs
)
sys.exit(0)
except Exception as e:
print(f"Error: {e}")
sys.exit(1)
It is pretty self explanatory but here is quick summary:
- it runs previous bash script
- it make a note of time before bash script was executed
- wait for bash script to finish with an error
- or wait for vLLM to start responding on
/healthendpoint - when it responds calculate startup time
Time of startup for bash script without any specific settings or caches will be our baseline for other scripts for comparison.
vLLM of version 0.19.2rc1.dev205+g07351e088 was used for testing.
Results
Below description of each tested setting and its impact on vLLM startup time.
Default settings
No specific settings beside bare minimum for vLLM to run Qwen3.5-2B-AWQ-4bit.
docker run --rm --name="${1:-vllm-startup-test}" \
--group-add=video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-e TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
-e VLLM_ROCM_USE_AITER=1 \
-p "${2:-8000}":8000 \
--ipc=host \
vllm/vllm-openai-rocm:nightly \
cyankiwi/Qwen3.5-2B-AWQ-4bit \
--gpu-memory-utilization 0.115
| No. | Time |
| 1 | 542.76 |
| 2 | 522.71 |
| 3 | 517.69 |
| 4 | 517.69 |
| 5 | 514.69 |
xychart-beta
title "Default settings"
x-axis [1,2,3,4,5]
y-axis "Startup time [s]" 510 --> 550
bar [542.76,522.71,517.69,517.69,514.69]
line [542.76,522.71,517.69,517.69,514.69]In average it is: 523,108s
Default settings with --dtype
Added settings to previous script
--dtype half
This should cut memory usage in half and since vLLM preallocates memory during startup, copying entire model with conversion, it should make startup faster.
| No. | Time |
| 1 | 376.52 |
| 2 | 375.54 |
| 3 | 377.58 |
| 4 | 379.56 |
| 5 | 364.59 |
xychart-beta
title "Data type: half"
x-axis [1,2,3,4,5]
y-axis "Startup time [s]" 350 --> 390
bar [376.52,375.54,377.58,379.56,364.59]
line [376.52,375.54,377.58,379.56,364.59]In average it is: 374,76s
---
config:
xyChart:
showDataLabel: true
showDataLabelOutsideBar: true
---
xychart-beta
title "Data type half startup gain"
x-axis config [default,dtype=half]
y-axis "Startup time [s]" 0 --> 600
bar [523,375]Gain is about 28%.
Pre-downloaded model
Downloading a model every time is not good solution even if you have HF_TOKEN (or similar way of authorization to different models source). It will take a lot of time for bigger models, a lot more than loading it from disk. Altered script adds mounting hugging face cache volume onto vLLM container.
Further tests will be done all with pre-downloaded model and they will be compared to average startup of vLLM with settings as with this script.
This script mounts huggingface cache directory into the container.
v ~/.cache/huggingface:/root/.cache/huggingface
| No. | Time [s] |
| 1 | 488.67 |
| 2 | 487.66 |
| 3 | 488.68 |
| 4 | 484.67 |
| 5 | 485.68 |
| 6 | 488.66 |
| 7 | 488.69 |
| 8 | 485.66 |
| 9 | 486.68 |
| 10 | 488.66 |
| 11 | 490.69 |
| 12 | 487.70 |
| 13 | 490.69 |
| 14 | 489.67 |
| 15 | 490.66 |
| 16 | 490.67 |
| 17 | 486.67 |
| 18 | 485.65 |
| 19 | 487.68 |
| 20 | 491.73 |
xychart-beta
title "Pre-downloaded model"
x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
y-axis "Startup time [s]" 480 --> 495
bar [488.67,487.66,488.68,484.67,485.68,488.66,488.69,485.66,486.68,488.66,490.69,487.70,490.69,489.67,490.66,490.67,486.67,485.65,487.68,491.73]
line [488.67,487.66,488.68,484.67,485.68,488.66,488.69,485.66,486.68,488.66,490.69,487.70,490.69,489.67,490.66,490.67,486.67,485.65,487.68,491.73]In average it is 488,28s
---
config:
xyChart:
showDataLabel: true
showDataLabelOutsideBar: true
---
xychart-beta
title "Pre-downloaded model startup gain"
x-axis config [default,dtype=half]
y-axis "Startup time [s]" 0 --> 600
bar [523,488]Gain is about 7%.
Data type half
Altered script adds --dtype half to the script with pre-downloaded model. Any script from now on will be treating pre-downloaded model script as base for comparisons. I do not want to call Huggingface API every time I am doing test.
| No. | Time |
| 1 | 333.48 |
| 2 | 329.50 |
| 3 | 335.51 |
| 4 | 328.48 |
| 5 | 331.49 |
| 6 | 330.50 |
| 7 | 332.52 |
| 8 | 335.48 |
| 9 | 330.51 |
| 10 | 341.51 |
| 11 | 336.51 |
| 12 | 334.51 |
| 13 | 332.49 |
| 14 | 330.51 |
| 15 | 329.49 |
| 16 | 333.51 |
| 17 | 332.52 |
| 18 | 342.52 |
| 19 | 332.50 |
| 20 | 338.50 |
xychart-beta
title "Data type 'half'"
x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
y-axis "Startup time [s]" 320 --> 350
bar [333.48,329.50,335.51,328.48,331.49,330.50,332.52,335.48,330.51,341.51,336.51,334.51,332.49,330.51,329.49,333.51,332.52,342.52,332.50,338.50]
line d [333.48,329.50,335.51,328.48,331.49,330.50,332.52,335.48,330.51,341.51,336.51,334.51,332.49,330.51,329.49,333.51,332.52,342.52,332.50,338.50]In average it is: 333.60s
---
config:
xyChart:
showDataLabel: true
showDataLabelOutsideBar: true
---
xychart-beta
title "Startup gain"
x-axis config [default,pre-download]
y-axis "Startup time [s]" 0 --> 500
bar [488,334]Gain is about 32%.
vLLM cache
vLLM caches few things to the disk during startup so if you running it via docker container it get lost because it is destroyed every time container is created. To remedy this, this directory need to be mounted between reruns.
-v /root/.cache/vllm:/root/.cache/vllm
| No. | Time [s] |
| 1 | 368.49 |
| 2 | 364.49 |
| 3 | 376.51 |
| 4 | 356.48 |
| 5 | 355.47 |
| 6 | 362.46 |
| 7 | 362.47 |
| 8 | 366.48 |
| 9 | 353.46 |
| 10 | 366.46 |
| 11 | 365.45 |
| 12 | 365.45 |
| 13 | 364.45 |
| 14 | 364.45 |
| 15 | 363.46 |
| 16 | 361.45 |
| 17 | 363.46 |
| 18 | 358.44 |
| 19 | 364.44 |
| 20 | 360.44 |
xychart-beta
title "vLLM cache"
x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
y-axis "Startup time [s]" 350 --> 380
bar [368.49,364.49,376.51,356.48,355.47,362.46,362.47,366.48,353.46,366.46,365.45,365.45,364.45,364.45,363.46,361.45,363.46,358.44,364.44,360.44]
line d [368.49,364.49,376.51,356.48,355.47,362.46,362.47,366.48,353.46,366.46,365.45,365.45,364.45,364.45,363.46,361.45,363.46,358.44,364.44,360.44]Average startup time with vllm cache is: 392.92s
---
config:
xyChart:
showDataLabel: true
showDataLabelOutsideBar: true
---
xychart-beta
title "Startup gain"
x-axis config [default,pre-download]
y-axis "Startup time [s]" 0 --> 500
bar [488,393]Gain is about 19%.
Small context
vLLM assign GPU memory ahead of time for context and kv space during startup. Requesting smaller context for your model in theory should speed things up, a bit at least.
This script adds max-model-len setting to pre-downloaded model:
--max-model-len 2k
| No. | Time [s] |
| 1 | 488.69 |
| 2 | 482.68 |
| 3 | 489.68 |
| 4 | 488.68 |
| 5 | 478.68 |
| 6 | 486.67 |
| 7 | 478.68 |
| 8 | 487.69 |
| 9 | 486.72 |
| 10 | 490.67 |
| 11 | 477.68 |
| 12 | 489.69 |
| 13 | 487.69 |
| 14 | 487.68 |
| 15 | 488.70 |
| 16 | 489.69 |
| 17 | 488.75 |
| 18 | 481.69 |
| 19 | 491.68 |
| 20 | 480.66 |
xychart-beta
title "Small context startup time"
x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
y-axis "Startup time [s]" 475 --> 495
bar [488.69,482.68,489.68,488.68,478.68,486.67,478.68,487.69,486.72,490.67,477.68,489.69,487.69,487.68,488.70,489.69,488.75,481.69,491.68,480.66]
line d [488.69,482.68,489.68,488.68,478.68,486.67,478.68,487.69,486.72,490.67,477.68,489.69,487.69,487.68,488.70,489.69,488.75,481.69,491.68,480.66]Average startup time with vllm cache is: 486,14s
---
config:
xyChart:
showDataLabel: true
showDataLabelOutsideBar: true
---
xychart-beta
title "Startup gain"
x-axis config [default,pre-download]
y-axis "Startup time [s]" 0 --> 500
bar [488,486]Gain is less than 0.5%.
With model tools
If you plan to use your models as part of your agent or to run your assistant, tools are necessary. Just out of curiosity I wanted to test if those settings have any noticeable impact on the vLLM startup time.
Following settings were added to pre-downloaded model script:
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
| No. | Time [s] |
| 1 | 486.65 |
| 2 | 491.69 |
| 3 | 491.67 |
| 4 | 486.67 |
| 5 | 485.65 |
| 6 | 489.68 |
| 7 | 486.66 |
| 8 | 487.67 |
| 9 | 495.73 |
| 10 | 495.69 |
| 11 | 478.67 |
| 12 | 481.67 |
| 13 | 490.71 |
| 14 | 494.73 |
| 15 | 490.73 |
| 16 | 482.72 |
| 17 | 487.70 |
| 18 | 497.67 |
| 19 | 487.64 |
| 20 | 494.67 |
xychart-beta
title "Tools startup time"
x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
y-axis "Startup time [s]" 475 --> 510
bar [486.65,491.69,491.67,486.67,485.65,489.68,486.66,487.67,495.73,495.69,478.67,481.67,490.71,494.73,490.73,482.72,487.70,497.67,487.64,494.67]
line d [486.65,491.69,491.67,486.67,485.65,489.68,486.66,487.67,495.73,495.69,478.67,481.67,490.71,494.73,490.73,482.72,487.70,497.67,487.64,494.67]Average startup time with tool is: 489,23s
---
config:
xyChart:
showDataLabel: true
showDataLabelOutsideBar: true
---
xychart-beta
title "Startup gain"
x-axis config [default,pre-download]
y-axis "Startup time [s]" 0 --> 500
bar [488,489]Using tools have negative impact on performance of less than 0.2%. Basically rounding error.
Quantized kv cache
One of the problems during running inference is kv cache sizes. One way of remedy this is to quantize this data so it will take less space. Should not have impact on the startup time directly, but if kv cache is smaller engine will have to allocate less space and it will start sooner.
This script adds following setting on top of pre-downloaded model script:
--kv-cache-dtype fp8_e4m3
| No. | Time [s] |
| 1 | 496.71 |
| 2 | 490.72 |
| 3 | 492.73 |
| 4 | 493.70 |
| 5 | 482.72 |
| 6 | 476.71 |
| 7 | 488.72 |
| 8 | 492.74 |
| 9 | 487.73 |
| 10 | 491.75 |
| 11 | 490.70 |
| 12 | 491.71 |
| 13 | 481.71 |
| 14 | 491.73 |
| 15 | 497.69 |
| 16 | 495.70 |
| 17 | 487.68 |
| 18 | 490.68 |
| 19 | 492.70 |
| 20 | 490.70 |
xychart-beta
title "Quantized KV cache startup time"
x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
y-axis "Startup time [s]" 475 --> 500
bar [496.71,490.72,492.73,493.70,482.72,476.71,488.72,492.74,487.73,491.75,490.70,491.71,481.71,491.73,497.69,495.70,487.68,490.68,492.70,490.70]
line d [496.71,490.72,492.73,493.70,482.72,476.71,488.72,492.74,487.73,491.75,490.70,491.71,481.71,491.73,497.69,495.70,487.68,490.68,492.70,490.70]Average startup time with quantized kv cache is: 490.26s
---
config:
xyChart:
showDataLabel: true
showDataLabelOutsideBar: true
---
xychart-beta
title "Startup gain"
x-axis config [default,pre-download]
y-axis "Startup time [s]" 0 --> 500
bar [488,490]Gain is -0.4%.
Pre-download disable optimization
vLLM have 3 optimization levels that can be used to either optimize startup or inference speed. For the case of optimizing startup we can set it to 0, which should be the quickest to get engine ready to responde to queries. This scripts adds following line to pre-download model script:
--optimization-level 0
| No. | Time [s] |
| 1 | 434.60 |
| 2 | 433.59 |
| 3 | 433.59 |
| 4 | 434.58 |
| 5 | 433.58 |
| 6 | 436.59 |
| 7 | 433.59 |
| 8 | 434.58 |
| 9 | 435.59 |
| 10 | 434.60 |
| 11 | 431.58 |
| 12 | 433.57 |
| 13 | 423.58 |
| 14 | 417.57 |
| 15 | 423.59 |
| 16 | 423.58 |
| 17 | 424.60 |
| 18 | 419.58 |
| 19 | 432.60 |
| 20 | 431.58 |
xychart-beta
title "Optimization 0 startup time"
x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
y-axis "Startup time [s]" 415 --> 440
bar [434.60,433.59,433.59,434.58,433.58,436.59,433.59,434.58,435.59,434.60,431.58,433.57,423.58,417.57,423.59,423.58,424.60,419.58,432.60,431.58]
line d [434.60,433.59,433.59,434.58,433.58,436.59,433.59,434.58,435.59,434.60,431.58,433.57,423.58,417.57,423.59,423.58,424.60,419.58,432.60,431.58]Average startup time with disable optimization is: 430.33s
---
config:
xyChart:
showDataLabel: true
showDataLabelOutsideBar: true
---
xychart-beta
title "Startup gain"
x-axis config [default,pre-download]
y-axis "Startup time [s]" 0 --> 500
bar [488,430]Gain is around 12%.
Enable prefix cache
--enable-prefix-caching
| No. | Time [s] |
| 1 | 479.66 |
| 2 | 474.67 |
| 3 | 478.67 |
| 4 | 487.68 |
| 5 | 487.68 |
| 6 | 477.67 |
| 7 | 487.72 |
| 8 | 488.69 |
| 9 | 488.67 |
| 10 | 487.69 |
| 11 | 485.68 |
| 12 | 488.69 |
| 13 | 479.69 |
| 14 | 487.68 |
| 15 | 485.66 |
| 16 | 487.66 |
| 17 | 477.68 |
| 18 | 490.69 |
| 19 | 476.67 |
| 20 | 487.67 |
xychart-beta
title "Prefix cache startup time"
x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
y-axis "Startup time [s]" 470 --> 495
bar [479.66,474.67,478.67,487.68,487.68,477.67,487.72,488.69,488.67,487.69,485.68,488.69,479.69,487.68,485.66 ,487.66 ,477.68,490.69,476.67,487.67]
line d [479.66,474.67,478.67,487.68,487.68,477.67,487.72,488.69,488.67,487.69,485.68,488.69,479.69,487.68,485.66 ,487.66 ,477.68,490.69,476.67,487.67]Average startup time with disable optimization is: 484.33s
---
config:
xyChart:
showDataLabel: true
showDataLabelOutsideBar: true
---
xychart-beta
title "Startup gain"
x-axis config [default,pre-download]
y-axis "Startup time [s]" 0 --> 500
bar [488,484]Gain is around 0,8%.
Enforce eager
This settings decides if CUDA graphs should be disabled or not. If disabled graphs wont be calculated during startup which should speed up startup. This scripts add following line to pre-download model script:
--enforce-eager
| No. | Time [s] |
| 1 | 426.60 |
| 2 | 443.60 |
| 3 | 435.59 |
| 4 | 472.63 |
| 5 | 430.57 |
| 6 | 432.60 |
| 7 | 430.57 |
| 8 | 423.59 |
| 9 | 432.60 |
| 10 | 434.58 |
| 11 | 431.59 |
| 12 | 430.59 |
| 13 | 435.58 |
| 14 | 419.57 |
| 15 | 426.62 |
| 16 | 434.58 |
| 17 | 441.60 |
| 18 | 437.59 |
| 19 | 431.58 |
| 20 | 435.59 |
xychart-beta
title "Enforce eager startup time"
x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
y-axis "Startup time [s]" 420 --> 475
bar [426.60,443.60,435.59,472.63,430.57,432.60,430.57,423.59,432.60,434.58,431.59,430.59,435.58,419.57,426.62,434.58,441.60,437.59,431.58,435.59]
line d [426.60,443.60,435.59,472.63,430.57,432.60,430.57,423.59,432.60,434.58,431.59,430.59,435.58,419.57,426.62,434.58,441.60,437.59,431.58,435.59]Average startup time with disable optimization is: 434.39s
---
config:
xyChart:
showDataLabel: true
showDataLabelOutsideBar: true
---
xychart-beta
title "Startup gain"
x-axis config [default,pre-download]
y-axis "Startup time [s]" 0 --> 500
bar [488,434]Gain is around 11%.
Lazy safetensors loading
This setting enforces lazy loading of safetensors. It may be helpful if you have model stored on quick disk storage instead of in example network storage with high latency. This script adds following line to pre-downloaded model script:
--safetensors-load-strategy=lazy
| No. | Time [s] |
| 1 | 489.68 |
| 2 | 493.70 |
| 3 | 496.70 |
| 4 | 488.69 |
| 5 | 489.69 |
| 6 | 488.68 |
| 7 | 482.69 |
| 8 | 489.70 |
| 9 | 484.69 |
| 10 | 486.69 |
| 11 | 489.69 |
| 12 | 487.68 |
| 13 | 482.69 |
| 14 | 489.69 |
| 15 | 487.67 |
| 16 | 485.68 |
| 17 | 487.67 |
| 18 | 486.69 |
| 19 | 487.70 |
| 20 | 489.67 |
xychart-beta
title "Lazy safetensors load startup time"
x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
y-axis "Startup time [s]" 480 --> 500
bar [489.68,493.70,496.70,488.69,489.69,488.68,482.69,489.70,484.69,486.69,489.69,487.68,482.69,489.69,487.67,485.68,487.67,486.69,487.70,489.67]
line d [489.68,493.70,496.70,488.69,489.69,488.68,482.69,489.70,484.69,486.69,489.69,487.68,482.69,489.69,487.67,485.68,487.67,486.69,487.70,489.67]Average startup time with disable optimization is: 488,29s
---
config:
xyChart:
showDataLabel: true
showDataLabelOutsideBar: true
---
xychart-beta
title "Startup gain"
x-axis config [default,pre-download]
y-axis "Startup time [s]" 0 --> 500
bar [488,488]Gain is None.
Interactivity
vLLM can be optimized against handling more parallel requests or for single user usage interactivity with single request. This script tests for such setting impact on startup time adding following line to pre download model script:
--performance-mode interactivity
| No. | Time [s] |
| 1 | 481.70 |
| 2 | 479.70 |
| 3 | 480.68 |
| 4 | 491.68 |
| 5 | 491.69 |
| 6 | 482.69 |
| 7 | 489.69 |
| 8 | 490.73 |
| 9 | 484.71 |
| 10 | 483.69 |
| 11 | 491.70 |
| 12 | 492.70 |
| 13 | 492.69 |
| 14 | 481.69 |
| 15 | 481.68 |
| 16 | 488.69 |
| 17 | 493.71 |
| 18 | 481.67 |
| 19 | 494.70 |
| 20 | 495.70 |
xychart-beta
title "Interactivity startup time"
x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
y-axis "Startup time [s]" 475 --> 500
bar [481.70,479.70,480.68,491.68,491.69,482.69,489.69,490.73,484.71,483.69,491.70,492.70,492.69,481.69,481.68,488.69,493.71,481.67,494.70,495.70]
line d [481.70,479.70,480.68,491.68,491.69,482.69,489.69,490.73,484.71,483.69,491.70,492.70,492.69,481.69,481.68,488.69,493.71,481.67,494.70,495.70]Average startup time with disable optimization is: 487,59s
---
config:
xyChart:
showDataLabel: true
showDataLabelOutsideBar: true
---
xychart-beta
title "Startup gain"
x-axis config [default,pre-download]
y-axis "Startup time [s]" 0 --> 500
bar [488,488]Gain is None basically. Which is good because it means we can optimize for single user usage without compromising startup time.
Speculative decoding MTP
Qwen models have buil-in MPT speculative decoding method. This should speed up inference but impact on startup time should be negative. This script adds following configuration to existing pre-download model script:
--speculative-config '{ "method": "mtp", "num_speculative_tokens": 2}'
| No. | Time [s] |
| 1 | 522.76 |
| 2 | 518.74 |
| 3 | 510.73 |
| 4 | 518.75 |
| 5 | 521.75 |
| 6 | 517.73 |
| 7 | 531.74 |
| 8 | 526.74 |
| 9 | 511.72 |
| 10 | 526.76 |
| 11 | 519.70 |
| 12 | 521.75 |
| 13 | 521.76 |
| 14 | 524.74 |
| 15 | 530.75 |
| 16 | 516.73 |
| 17 | 521.74 |
| 18 | 515.75 |
| 19 | 536.75 |
| 20 | 522.74 |
xychart-beta
title "Speculative decoding startup time"
x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
y-axis "Startup time [s]" 510 --> 540
bar [522.76,518.74,510.73,518.75,521.75,517.73,531.74,526.74,511.72,526.76,519.70,521.75,521.76,524.74,530.75,516.73,521.74,515.75,536.75,522.74]
line d [522.76,518.74,510.73,518.75,521.75,517.73,531.74,526.74,511.72,526.76,519.70,521.75,521.76,524.74,530.75,516.73,521.74,515.75,536.75,522.74]Average startup time with disable optimization is: 521,99s
---
config:
xyChart:
showDataLabel: true
showDataLabelOutsideBar: true
---
xychart-beta
title "Startup gain"
x-axis config [default,pre-download]
y-axis "Startup time [s]" 0 --> 525
bar [488,522]Gain is negative by 7%.
Summary
Here is comparison table of all the settings that were tested and their impact on startup time and GPU memory requirements.
| Settings | Time [s] | Gain [%] | VRAM [%] | VRAM [GB] | T/s |
| – | 523.11 | – | 11.5 | 13.8 | 294.32 |
| pre-download | 488.28 | 7 | 11.5 | 13.8 | 296.86 |
| dtype half | 333.60 | 32 | 11.5 | 13.8 | 321.32 |
| vllm cache | 392,92 | 19 | 11.5 | 13.8 | 290.77 |
| small context | 486.14 | 0.5 | 11.5 | 13.8 | 295.68 |
| tools | 489.23 | -0.2 | 11.5 | 13.8 | 291.07 |
| kv-cache-dtype | 490.26 | -0.4 | 11.5 | 13.8 | 268.98 |
| no optimization | 430.33 | 12 | 11.5 | 13.8 | 335.89 |
| prefix cache | 484.33 | 0,8 | 11.5 | 13.8 | 293.79 |
| enforce eager | 434.39 | 11 | 11.5 | 13.8 | 335.12 |
| lazy safetensor loading | 488.29 | 0 | 11.5 | 13.8 | 288.38 |
| interactivity | 487.59 | 0 | 11.5 | 14.04 | 291.89 |
| speculative decoding | 521.99 | -7 | 16.7 | 20.04 | 266.91 |
From the table you can see it is not worth to use lazy safetensor loading strategy.
Speculative decoding almost does not seem to be worth it of slower start and slower T/s. It does cuts Time To First Token a bit (TTFT) from:
---------------Time to First Token---------------- Mean TTFT (ms): 418.84 Median TTFT (ms): 112.23 P99 TTFT (ms): 2939.18 -----Time per Output Token (excl. 1st token)------
to:
---------------Time to First Token---------------- Mean TTFT (ms): 395.08 Median TTFT (ms): 186.63 P99 TTFT (ms): 2305.16 -----Time per Output Token (excl. 1st token)------
But still does not seem to be worth it of all that trouble.
Tools slow things down but are necessary for agentic usage.
Quantification of kv cache does not seem to be worth it also.
What is worth to do it to set --dtype half as it cuts startup time considerably, and speeds up inference.
Caching vllm files seems to be also quick and easy win.
The biggest surprise to me was --optimization 0 it seems like it speed up inference and cut startup time. By default --optimization have value of 2 so it is a bit counter intuitive that disabling it speed up things at runtime instead of sltowing throughput down.
Setting up prefix caching and disabling python CUDa graphs seems to be also no brainer since it does not slow down inference, uses the same amount of memory and have positive impact on startup time.
In essence if you want to optimize for vLLM startup time you should use fallowing parameters:
docker run --rm --name="${1:-vllm-startup-test}" \
--group-add=video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device /dev/kfd \
--device /dev/dri \
-v /root/.cache/vllm:/root/.cache/vllm \ #mount vLLM cache to not calculate the same files every time for 19% speed up
v ~/.cache/huggingface:/root/.cache/huggingface \ # mount model cache to avoid downloading the same model every time
-e TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
-e VLLM_ROCM_USE_AITER=1 \
-p "${2:-8000}":8000 \
--ipc=host \
vllm/vllm-openai-rocm:nightly \
cyankiwi/Qwen3.5-2B-AWQ-4bit \
--gpu-memory-utilization 0.115 \
--dtype half \ #add this to cut startup time by 32%
--optimization-level 0 \ # disable optimization to have 12% startup time gain
--enforce-eager \ # disabling CUDa graphs will cut startup by 11%
--enable-prefix-caching \ # this can be considered if there will be no other drawbacks since 0.8% is not much of a gain
Rest of parameters have great negative impacts during runtime or have no impact at all.
Final words
Have in mind that this might be specific to this particular model I tested. Different model may behave differently. Also it may varies between version of vLLM engine. I tested it on version 0.19.2rc1.dev205+g07351e088.

