Optimizing vLLM startup time

vLLM has very slow time to first token (TTFT). On AMD Ryzen AI Max+ 395 it can take even 10 minutes depending on exact settings and model size.

To make my user experience of using those things myself I need to have comparison of llama.cpp and vLLM startup time and performance. But to do this I needed to perform optimizations of startup of vLLM since it is even slower to start then it was with v0.17.

Performance and Startup Time optimizations

Going trough vLLM documentations, tutorials and other articles I was able to came up with possible ways of optimizing startup and token generation for my specific usage.

For example those settings may have impact on startup time:

  • pre download model (of course!)
  • vllm cache sharing
  • size of context
  • data type size
  • enforce eager switch
  • enable prefix catching switch
  • optimization level
  • safetensor loading strategy

On the other hand those setting may or not have impact on token generation speed:

  • perfomance mode
  • optimization level
  • enable prefix catching switch

Some overlap but it make sense since if you optimize things during startup it make it slower but you do not have to do it during runtime which slows things down during actual usage.

Testing methodology

I created following test script for running my model:

docker run --rm --name="${1:-vllm-startup-test}" \
    --group-add=video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --device /dev/kfd \
    --device /dev/dri \
    -e TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
    -e VLLM_ROCM_USE_AITER=1 \
    -p "${2:-8000}":8000 \
    --ipc=host \
    vllm/vllm-openai-rocm:nightly \
    cyankiwi/Qwen3.5-2B-AWQ-4bit \
    --gpu-memory-utilization 0.115

Basing on what switch/argument I am testing, new lines are added. For example if I want to test impact of speculative decoding configuration I am adding following line to above script:

--speculative-config '{ "method": "mtp", "num_speculative_tokens": 2}'

Sometimes 11.5% of GPU memory allocation was too little start vLLM engine with some specific settings (i.e. speculative decoding needs a bit more) so it was increased slightly to 13% or 15%.

To test how long it takes for vLLM to startup I am using following script written in Python:

#!/usr/bin/env python3
"""
Script to start vLLM via bash script and measure startup time until /health endpoint responds.
"""

import subprocess
import requests
import time
import sys
import os
from typing import Optional
import sys
import os

CONTAINER_NAME = 'vllm-startup-test'


def get_spinner_frames():
    """Return spinner animation frames."""
    return ['⣾', '⣽', '⣻', '⢿', '⡿', '⣟', '⣯', '⣷']


def start_vllm_server(bash_script_path: str, **kwargs) -> subprocess.Popen:
    """
    Start vLLM server using external bash script.
    """
    if not os.path.exists(bash_script_path):
        raise FileNotFoundError(f"Bash script not found: {bash_script_path}")

    cmd = ['/bin/bash', bash_script_path, CONTAINER_NAME]

    for key, value in kwargs.items():
        cmd.extend([f"--{key}", str(value)])

    process = subprocess.Popen(
        cmd,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        text=True,
        bufsize=1
    )

    return process


def wait_for_health_endpoint(
        base_url: str,
        process: subprocess.Popen,
        endpoint: str = "/health",
        polling_interval: float = 1.0
) -> Optional[float]:
    """
    Wait for the vLLM /health endpoint to respond with spinner animation.
    """
    full_url = f"{base_url}{endpoint}"
    start_time = time.time()

    spinner_frames = get_spinner_frames()
    frame_index = 0
    current_line_length = 0
    timeout = 60 * 20

    try:
        while time.time() - start_time < timeout:
            # Check if process has exited with an error
            process_returncode = process.poll()
            if process_returncode is not None:
                print(f"\r{' ' * (current_line_length + 50)}", end='', flush=True)
                if process_returncode != 0:
                    print(f"\r✗ vLLM server process exited with error code: {process_returncode}", flush=True)
                    process.wait()
                    returncode = process.wait()
                    print(f"  Return code: {returncode}", flush=True)
                    raise RuntimeError(f"vLLM server process failed with exit code {process_returncode}")
                else:
                    print(f"\r✓ vLLM server process exited normally", flush=True)
                    raise RuntimeError("vLLM server process exited before health check")

            # Print spinner animation (overwrite same line)
            spinner_frame = spinner_frames[frame_index]
            status_msg = f"\r⏳ Waiting for {full_url}... "
            print(f"{spinner_frame}{status_msg}", end='', flush=True)

            # Clear any previous status text on this line
            current_line_length = len(f"{spinner_frame}{status_msg}")

            try:
                response = requests.get(full_url, timeout=5)
                if response.status_code == 200:
                    # Clear the spinner line
                    print(f"\r{' ' * (current_line_length + 50)}", end='', flush=True)
                    elapsed = time.time() - start_time
                    return elapsed
                else:
                    print(f"\r{' ' * (current_line_length + 50)}", end='', flush=True)
                    print(f"\r✗ Health check failed with status code: {response.status_code}", flush=True)
                    print(f"  Response: {response.text[:200]}", flush=True)
            except requests.exceptions.ConnectionError:
                pass  # Spinner continues
            except requests.exceptions.Timeout:
                pass  # Spinner continues
            except requests.exceptions.RequestException as e:
                print(f"\r{' ' * (current_line_length + 50)}", end='', flush=True)
                print(f"\r⚠ Health check error: {e}", flush=True)

            frame_index = (frame_index + 1) % len(spinner_frames)
            time.sleep(polling_interval)

        # Timeout
        print(f"\r{' ' * (current_line_length + 50)}", end='', flush=True)
        elapsed = time.time() - start_time
        print(f"\r✗ Timeout waiting for health endpoint after {elapsed:.2f} seconds", flush=True)
        return None

    except KeyboardInterrupt:
        print("\nReceived interrupt signal, terminating server...", flush=True)
        raise
    finally:
        # Clear spinner line on exit
        print(f"\r{' ' * (current_line_length + 50)}", end='', flush=True)


def measure_vllm_startup(
        bash_script_path: str,
        **kwargs
) -> float:
    """
    Measure vLLM startup time from script start to health endpoint response.
    """
    base_url = f"http://localhost:{8000}"

    start_time = time.time()

    process = start_vllm_server(bash_script_path, **kwargs)

    try:
        startup_time = wait_for_health_endpoint(base_url, process)

        total_time = time.time() - start_time

        if startup_time is not None:
            print(f"\rTotal elapsed time: {total_time:.2f} seconds")
        else:
            print(f"\n{'='*60}")
            print(f"vLLM startup timed out!")
            print(f"  Total elapsed time: {total_time:.2f} seconds")
            print(f"{'='*60}\n")
            raise TimeoutError("vLLM startup timeout")

        return startup_time

    except KeyboardInterrupt:
        print("\nReceived interrupt signal, terminating server...")
        process.terminate()
        raise
    finally:
        if process.poll() is None:
            process.terminate()
            process.wait()

            # stop docker container
            process = subprocess.Popen(
                ['docker', 'stop', CONTAINER_NAME],
                stdout=subprocess.PIPE,
                stderr=subprocess.STDOUT,
                text=True,
                bufsize=1
            )

            process.wait()


if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser(
        description="Measure vLLM startup time via health endpoint"
    )
    parser.add_argument(
        "bash_script",
        help="Path to the bash script that starts vLLM"
    )
    parser.add_argument(
        "--args",
        nargs="*",
        help="Additional arguments to pass to vLLM server"
    )

    args = parser.parse_args()

    kwargs = {}
    if args.args:
        for arg in args.args:
            if "=" in arg:
                key, value = arg.split("=", 1)
                kwargs[key] = value
            else:
                kwargs[arg] = True

    try:
        startup_time = measure_vllm_startup(
            bash_script_path=args.bash_script,
            **kwargs
        )
        sys.exit(0)
    except Exception as e:
        print(f"Error: {e}")
        sys.exit(1)

It is pretty self explanatory but here is quick summary:

  • it runs previous bash script
  • it make a note of time before bash script was executed
  • wait for bash script to finish with an error
  • or wait for vLLM to start responding on /health endpoint
  • when it responds calculate startup time

Time of startup for bash script without any specific settings or caches will be our baseline for other scripts for comparison.

vLLM of version 0.19.2rc1.dev205+g07351e088 was used for testing.

Results

Below description of each tested setting and its impact on vLLM startup time.

Default settings

No specific settings beside bare minimum for vLLM to run Qwen3.5-2B-AWQ-4bit.

docker run --rm --name="${1:-vllm-startup-test}" \
    --group-add=video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --device /dev/kfd \
    --device /dev/dri \
    -e TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
    -e VLLM_ROCM_USE_AITER=1 \
    -p "${2:-8000}":8000 \
    --ipc=host \
    vllm/vllm-openai-rocm:nightly \
    cyankiwi/Qwen3.5-2B-AWQ-4bit \
    --gpu-memory-utilization 0.115
No.Time
1542.76
2522.71
3517.69
4517.69
5514.69
xychart-beta
    title "Default settings"
    x-axis [1,2,3,4,5]
    y-axis "Startup time [s]" 510 --> 550
    bar [542.76,522.71,517.69,517.69,514.69]
    line [542.76,522.71,517.69,517.69,514.69]

In average it is: 523,108s

Default settings with --dtype

Added settings to previous script

--dtype half

This should cut memory usage in half and since vLLM preallocates memory during startup, copying entire model with conversion, it should make startup faster.

No.Time
1376.52
2375.54
3377.58
4379.56
5364.59
xychart-beta
    title "Data type: half"
    x-axis [1,2,3,4,5]
    y-axis "Startup time [s]" 350 --> 390
    bar [376.52,375.54,377.58,379.56,364.59]
    line [376.52,375.54,377.58,379.56,364.59]

In average it is: 374,76s

---
config:
    xyChart:
        showDataLabel: true
        showDataLabelOutsideBar: true
---

xychart-beta
    title "Data type half startup gain"
    x-axis config [default,dtype=half]
    y-axis "Startup time [s]" 0 --> 600
    bar [523,375]

Gain is about 28%.

Pre-downloaded model

Downloading a model every time is not good solution even if you have HF_TOKEN (or similar way of authorization to different models source). It will take a lot of time for bigger models, a lot more than loading it from disk. Altered script adds mounting hugging face cache volume onto vLLM container.

Further tests will be done all with pre-downloaded model and they will be compared to average startup of vLLM with settings as with this script.

This script mounts huggingface cache directory into the container.

v ~/.cache/huggingface:/root/.cache/huggingface
No.Time [s]
1488.67
2487.66
3488.68
4484.67
5485.68
6488.66
7488.69
8485.66
9486.68
10488.66
11490.69
12487.70
13490.69
14489.67
15490.66
16490.67
17486.67
18485.65
19487.68
20491.73
xychart-beta
    title "Pre-downloaded model"
    x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
    y-axis "Startup time [s]" 480 --> 495
    bar [488.67,487.66,488.68,484.67,485.68,488.66,488.69,485.66,486.68,488.66,490.69,487.70,490.69,489.67,490.66,490.67,486.67,485.65,487.68,491.73]
    line [488.67,487.66,488.68,484.67,485.68,488.66,488.69,485.66,486.68,488.66,490.69,487.70,490.69,489.67,490.66,490.67,486.67,485.65,487.68,491.73]

In average it is 488,28s

---
config:
    xyChart:
        showDataLabel: true
        showDataLabelOutsideBar: true
---

xychart-beta
    title "Pre-downloaded model startup gain"
    x-axis config [default,dtype=half]
    y-axis "Startup time [s]" 0 --> 600
    bar [523,488]

Gain is about 7%.

Data type half

Altered script adds --dtype half to the script with pre-downloaded model. Any script from now on will be treating pre-downloaded model script as base for comparisons. I do not want to call Huggingface API every time I am doing test.

No.Time
1333.48
2329.50
3335.51
4328.48
5331.49
6330.50
7332.52
8335.48
9330.51
10341.51
11336.51
12334.51
13332.49
14330.51
15329.49
16333.51
17332.52
18342.52
19332.50
20338.50
xychart-beta
    title "Data type 'half'"
    x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
    y-axis "Startup time [s]" 320 --> 350
    bar [333.48,329.50,335.51,328.48,331.49,330.50,332.52,335.48,330.51,341.51,336.51,334.51,332.49,330.51,329.49,333.51,332.52,342.52,332.50,338.50]
    line d [333.48,329.50,335.51,328.48,331.49,330.50,332.52,335.48,330.51,341.51,336.51,334.51,332.49,330.51,329.49,333.51,332.52,342.52,332.50,338.50]

In average it is: 333.60s

---
config:
    xyChart:
        showDataLabel: true
        showDataLabelOutsideBar: true
---

xychart-beta
    title "Startup gain"
    x-axis config [default,pre-download]
    y-axis "Startup time [s]" 0 --> 500
    bar [488,334]

Gain is about 32%.

vLLM cache

vLLM caches few things to the disk during startup so if you running it via docker container it get lost because it is destroyed every time container is created. To remedy this, this directory need to be mounted between reruns.

-v /root/.cache/vllm:/root/.cache/vllm
No.Time [s]
1368.49
2364.49
3376.51
4356.48
5355.47
6362.46
7362.47
8366.48
9353.46
10366.46
11365.45
12365.45
13364.45
14364.45
15363.46
16361.45
17363.46
18358.44
19364.44
20360.44
xychart-beta
    title "vLLM cache"
    x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
    y-axis "Startup time [s]" 350 --> 380
    bar [368.49,364.49,376.51,356.48,355.47,362.46,362.47,366.48,353.46,366.46,365.45,365.45,364.45,364.45,363.46,361.45,363.46,358.44,364.44,360.44]
    line d [368.49,364.49,376.51,356.48,355.47,362.46,362.47,366.48,353.46,366.46,365.45,365.45,364.45,364.45,363.46,361.45,363.46,358.44,364.44,360.44]

Average startup time with vllm cache is: 392.92s

---
config:
    xyChart:
        showDataLabel: true
        showDataLabelOutsideBar: true
---

xychart-beta
    title "Startup gain"
    x-axis config [default,pre-download]
    y-axis "Startup time [s]" 0 --> 500
    bar [488,393]

Gain is about 19%.

Small context

vLLM assign GPU memory ahead of time for context and kv space during startup. Requesting smaller context for your model in theory should speed things up, a bit at least.

This script adds max-model-len setting to pre-downloaded model:

--max-model-len 2k
No.Time [s]
1488.69
2482.68
3489.68
4488.68
5478.68
6486.67
7478.68
8487.69
9486.72
10490.67
11477.68
12489.69
13487.69
14487.68
15488.70
16489.69
17488.75
18481.69
19491.68
20480.66
xychart-beta
    title "Small context startup time"
    x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
    y-axis "Startup time [s]" 475 --> 495
    bar [488.69,482.68,489.68,488.68,478.68,486.67,478.68,487.69,486.72,490.67,477.68,489.69,487.69,487.68,488.70,489.69,488.75,481.69,491.68,480.66]
    line d [488.69,482.68,489.68,488.68,478.68,486.67,478.68,487.69,486.72,490.67,477.68,489.69,487.69,487.68,488.70,489.69,488.75,481.69,491.68,480.66]

Average startup time with vllm cache is: 486,14s

---
config:
    xyChart:
        showDataLabel: true
        showDataLabelOutsideBar: true
---

xychart-beta
    title "Startup gain"
    x-axis config [default,pre-download]
    y-axis "Startup time [s]" 0 --> 500
    bar [488,486]

Gain is less than 0.5%.

With model tools

If you plan to use your models as part of your agent or to run your assistant, tools are necessary. Just out of curiosity I wanted to test if those settings have any noticeable impact on the vLLM startup time.

Following settings were added to pre-downloaded model script:

--reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder
No.Time [s]
1486.65
2491.69
3491.67
4486.67
5485.65
6489.68
7486.66
8487.67
9495.73
10495.69
11478.67
12481.67
13490.71
14494.73
15490.73
16482.72
17487.70
18497.67
19487.64
20494.67
xychart-beta
    title "Tools startup time"
    x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
    y-axis "Startup time [s]" 475 --> 510
    bar [486.65,491.69,491.67,486.67,485.65,489.68,486.66,487.67,495.73,495.69,478.67,481.67,490.71,494.73,490.73,482.72,487.70,497.67,487.64,494.67]
    line d [486.65,491.69,491.67,486.67,485.65,489.68,486.66,487.67,495.73,495.69,478.67,481.67,490.71,494.73,490.73,482.72,487.70,497.67,487.64,494.67]

Average startup time with tool is: 489,23s

---
config:
    xyChart:
        showDataLabel: true
        showDataLabelOutsideBar: true
---

xychart-beta
    title "Startup gain"
    x-axis config [default,pre-download]
    y-axis "Startup time [s]" 0 --> 500
    bar [488,489]

Using tools have negative impact on performance of less than 0.2%. Basically rounding error.

Quantized kv cache

One of the problems during running inference is kv cache sizes. One way of remedy this is to quantize this data so it will take less space. Should not have impact on the startup time directly, but if kv cache is smaller engine will have to allocate less space and it will start sooner.

This script adds following setting on top of pre-downloaded model script:

--kv-cache-dtype fp8_e4m3
No.Time [s]
1496.71
2490.72
3492.73
4493.70
5482.72
6476.71
7488.72
8492.74
9487.73
10491.75
11490.70
12491.71
13481.71
14491.73
15497.69
16495.70
17487.68
18490.68
19492.70
20490.70
xychart-beta
    title "Quantized KV cache startup time"
    x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
    y-axis "Startup time [s]" 475 --> 500
    bar [496.71,490.72,492.73,493.70,482.72,476.71,488.72,492.74,487.73,491.75,490.70,491.71,481.71,491.73,497.69,495.70,487.68,490.68,492.70,490.70]
    line d [496.71,490.72,492.73,493.70,482.72,476.71,488.72,492.74,487.73,491.75,490.70,491.71,481.71,491.73,497.69,495.70,487.68,490.68,492.70,490.70]

Average startup time with quantized kv cache is: 490.26s

---
config:
    xyChart:
        showDataLabel: true
        showDataLabelOutsideBar: true
---

xychart-beta
    title "Startup gain"
    x-axis config [default,pre-download]
    y-axis "Startup time [s]" 0 --> 500
    bar [488,490]

Gain is -0.4%.

Pre-download disable optimization

vLLM have 3 optimization levels that can be used to either optimize startup or inference speed. For the case of optimizing startup we can set it to 0, which should be the quickest to get engine ready to responde to queries. This scripts adds following line to pre-download model script:

--optimization-level 0
No.Time [s]
1434.60
2433.59
3433.59
4434.58
5433.58
6436.59
7433.59
8434.58
9435.59
10434.60
11431.58
12433.57
13423.58
14417.57
15423.59
16423.58
17424.60
18419.58
19432.60
20431.58
xychart-beta
    title "Optimization 0 startup time"
    x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
    y-axis "Startup time [s]" 415 --> 440
    bar [434.60,433.59,433.59,434.58,433.58,436.59,433.59,434.58,435.59,434.60,431.58,433.57,423.58,417.57,423.59,423.58,424.60,419.58,432.60,431.58]
    line d [434.60,433.59,433.59,434.58,433.58,436.59,433.59,434.58,435.59,434.60,431.58,433.57,423.58,417.57,423.59,423.58,424.60,419.58,432.60,431.58]

Average startup time with disable optimization is: 430.33s

---
config:
    xyChart:
        showDataLabel: true
        showDataLabelOutsideBar: true
---

xychart-beta
    title "Startup gain"
    x-axis config [default,pre-download]
    y-axis "Startup time [s]" 0 --> 500
    bar [488,430]

Gain is around 12%.

Enable prefix cache

--enable-prefix-caching
No.Time [s]
1479.66
2474.67
3478.67
4487.68
5487.68
6477.67
7487.72
8488.69
9488.67
10487.69
11485.68
12488.69
13479.69
14487.68
15485.66
16487.66
17477.68
18490.69
19476.67
20487.67
xychart-beta
    title "Prefix cache startup time"
    x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
    y-axis "Startup time [s]" 470 --> 495
    bar [479.66,474.67,478.67,487.68,487.68,477.67,487.72,488.69,488.67,487.69,485.68,488.69,479.69,487.68,485.66 ,487.66 ,477.68,490.69,476.67,487.67]
    line d [479.66,474.67,478.67,487.68,487.68,477.67,487.72,488.69,488.67,487.69,485.68,488.69,479.69,487.68,485.66 ,487.66 ,477.68,490.69,476.67,487.67]

Average startup time with disable optimization is: 484.33s

---
config:
    xyChart:
        showDataLabel: true
        showDataLabelOutsideBar: true
---

xychart-beta
    title "Startup gain"
    x-axis config [default,pre-download]
    y-axis "Startup time [s]" 0 --> 500
    bar [488,484]

Gain is around 0,8%.

Enforce eager

This settings decides if CUDA graphs should be disabled or not. If disabled graphs wont be calculated during startup which should speed up startup. This scripts add following line to pre-download model script:

 --enforce-eager
No.Time [s]
1426.60
2443.60
3435.59
4472.63
5430.57
6432.60
7430.57
8423.59
9432.60
10434.58
11431.59
12430.59
13435.58
14419.57
15426.62
16434.58
17441.60
18437.59
19431.58
20435.59
xychart-beta
    title "Enforce eager startup time"
    x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
    y-axis "Startup time [s]" 420 --> 475
    bar [426.60,443.60,435.59,472.63,430.57,432.60,430.57,423.59,432.60,434.58,431.59,430.59,435.58,419.57,426.62,434.58,441.60,437.59,431.58,435.59]
    line d [426.60,443.60,435.59,472.63,430.57,432.60,430.57,423.59,432.60,434.58,431.59,430.59,435.58,419.57,426.62,434.58,441.60,437.59,431.58,435.59]

Average startup time with disable optimization is: 434.39s

---
config:
    xyChart:
        showDataLabel: true
        showDataLabelOutsideBar: true
---

xychart-beta
    title "Startup gain"
    x-axis config [default,pre-download]
    y-axis "Startup time [s]" 0 --> 500
    bar [488,434]

Gain is around 11%.

Lazy safetensors loading

This setting enforces lazy loading of safetensors. It may be helpful if you have model stored on quick disk storage instead of in example network storage with high latency. This script adds following line to pre-downloaded model script:

--safetensors-load-strategy=lazy
No.Time [s]
1489.68
2493.70
3496.70
4488.69
5489.69
6488.68
7482.69
8489.70
9484.69
10486.69
11489.69
12487.68
13482.69
14489.69
15487.67
16485.68
17487.67
18486.69
19487.70
20489.67
xychart-beta
    title "Lazy safetensors load startup time"
    x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
    y-axis "Startup time [s]" 480 --> 500
    bar [489.68,493.70,496.70,488.69,489.69,488.68,482.69,489.70,484.69,486.69,489.69,487.68,482.69,489.69,487.67,485.68,487.67,486.69,487.70,489.67]
    line d [489.68,493.70,496.70,488.69,489.69,488.68,482.69,489.70,484.69,486.69,489.69,487.68,482.69,489.69,487.67,485.68,487.67,486.69,487.70,489.67]

Average startup time with disable optimization is: 488,29s

---
config:
    xyChart:
        showDataLabel: true
        showDataLabelOutsideBar: true
---

xychart-beta
    title "Startup gain"
    x-axis config [default,pre-download]
    y-axis "Startup time [s]" 0 --> 500
    bar [488,488]

Gain is None.

Interactivity

vLLM can be optimized against handling more parallel requests or for single user usage interactivity with single request. This script tests for such setting impact on startup time adding following line to pre download model script:

--performance-mode interactivity
No.Time [s]
1481.70
2479.70
3480.68
4491.68
5491.69
6482.69
7489.69
8490.73
9484.71
10483.69
11491.70
12492.70
13492.69
14481.69
15481.68
16488.69
17493.71
18481.67
19494.70
20495.70
xychart-beta
    title "Interactivity startup time"
    x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
    y-axis "Startup time [s]" 475 --> 500
    bar [481.70,479.70,480.68,491.68,491.69,482.69,489.69,490.73,484.71,483.69,491.70,492.70,492.69,481.69,481.68,488.69,493.71,481.67,494.70,495.70]
    line d [481.70,479.70,480.68,491.68,491.69,482.69,489.69,490.73,484.71,483.69,491.70,492.70,492.69,481.69,481.68,488.69,493.71,481.67,494.70,495.70]

Average startup time with disable optimization is: 487,59s

---
config:
    xyChart:
        showDataLabel: true
        showDataLabelOutsideBar: true
---

xychart-beta
    title "Startup gain"
    x-axis config [default,pre-download]
    y-axis "Startup time [s]" 0 --> 500
    bar [488,488]

Gain is None basically. Which is good because it means we can optimize for single user usage without compromising startup time.

Speculative decoding MTP

Qwen models have buil-in MPT speculative decoding method. This should speed up inference but impact on startup time should be negative. This script adds following configuration to existing pre-download model script:

--speculative-config '{ "method": "mtp", "num_speculative_tokens": 2}'
No.Time [s]
1522.76
2518.74
3510.73
4518.75
5521.75
6517.73
7531.74
8526.74
9511.72
10526.76
11519.70
12521.75
13521.76
14524.74
15530.75
16516.73
17521.74
18515.75
19536.75
20522.74
xychart-beta
    title "Speculative decoding startup time"
    x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
    y-axis "Startup time [s]" 510 --> 540
    bar [522.76,518.74,510.73,518.75,521.75,517.73,531.74,526.74,511.72,526.76,519.70,521.75,521.76,524.74,530.75,516.73,521.74,515.75,536.75,522.74]
    line d [522.76,518.74,510.73,518.75,521.75,517.73,531.74,526.74,511.72,526.76,519.70,521.75,521.76,524.74,530.75,516.73,521.74,515.75,536.75,522.74]

Average startup time with disable optimization is: 521,99s

---
config:
    xyChart:
        showDataLabel: true
        showDataLabelOutsideBar: true
---

xychart-beta
    title "Startup gain"
    x-axis config [default,pre-download]
    y-axis "Startup time [s]" 0 --> 525
    bar [488,522]

Gain is negative by 7%.

Summary

Here is comparison table of all the settings that were tested and their impact on startup time and GPU memory requirements.

SettingsTime [s]Gain [%]VRAM [%]VRAM [GB]T/s
523.1111.513.8294.32
pre-download488.28711.513.8296.86
dtype half333.603211.513.8321.32
vllm cache392,921911.513.8290.77
small context486.140.511.513.8295.68
tools489.23-0.211.513.8291.07
kv-cache-dtype490.26-0.411.513.8268.98
no optimization430.331211.513.8335.89
prefix cache484.330,811.513.8293.79
enforce eager434.391111.513.8335.12
lazy safetensor loading488.29011.513.8288.38
interactivity487.59011.514.04291.89
speculative decoding521.99-716.720.04266.91

From the table you can see it is not worth to use lazy safetensor loading strategy.

Speculative decoding almost does not seem to be worth it of slower start and slower T/s. It does cuts Time To First Token a bit (TTFT) from:

---------------Time to First Token----------------                                                                                                                                                                                                                                             
Mean TTFT (ms):                          418.84                                                                                                                                                                                                                                                
Median TTFT (ms):                        112.23                                                                                                                                                                                                                                                
P99 TTFT (ms):                           2939.18                                                                                                                                                                                                                                               
-----Time per Output Token (excl. 1st token)------  

to:

---------------Time to First Token----------------                                                                                                                                                                                                                                             
Mean TTFT (ms):                          395.08                                                                                                                                                                                                                                                
Median TTFT (ms):                        186.63                                                                                                                                                                                                                                                
P99 TTFT (ms):                           2305.16                                                                                                                                                                                                                                               
-----Time per Output Token (excl. 1st token)------ 

But still does not seem to be worth it of all that trouble.

Tools slow things down but are necessary for agentic usage.

Quantification of kv cache does not seem to be worth it also.

What is worth to do it to set --dtype half as it cuts startup time considerably, and speeds up inference.

Caching vllm files seems to be also quick and easy win.

The biggest surprise to me was --optimization 0 it seems like it speed up inference and cut startup time. By default --optimization have value of 2 so it is a bit counter intuitive that disabling it speed up things at runtime instead of sltowing throughput down.

Setting up prefix caching and disabling python CUDa graphs seems to be also no brainer since it does not slow down inference, uses the same amount of memory and have positive impact on startup time.

In essence if you want to optimize for vLLM startup time you should use fallowing parameters:

docker run --rm --name="${1:-vllm-startup-test}" \
    --group-add=video \
    --cap-add=SYS_PTRACE \                                                                                                                                                                                                                                                                                                                               
    --security-opt seccomp=unconfined \                                                                                                                                                                                                                                                                                                                  
    --device /dev/kfd \                                                                                                                                                                                                                                                                                                                                  
    --device /dev/dri \
    -v /root/.cache/vllm:/root/.cache/vllm \ #mount vLLM cache to not calculate the same files every time for 19% speed up
    v ~/.cache/huggingface:/root/.cache/huggingface \ # mount model cache to avoid downloading the same model every time                                                                                                                                                                                                                                                                                                                                  
    -e TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \                                                                                                                                                                                                                                                                                                       
    -e VLLM_ROCM_USE_AITER=1 \                                                                                                                                                                                                                                                                                                                           
    -p "${2:-8000}":8000 \                                                                                                                                                                                                                                                                                                                               
    --ipc=host \                                                                                                                                                                                                                                                                                                                                         
    vllm/vllm-openai-rocm:nightly \                                                                                                                                                                                                                                                                                                                      
    cyankiwi/Qwen3.5-2B-AWQ-4bit \                                                                                                                                                                                                                                                                                                                   
    --gpu-memory-utilization 0.115 \
    --dtype half \ #add this to cut startup time by 32%
    --optimization-level 0 \ # disable optimization to have 12% startup time gain
    --enforce-eager \ # disabling CUDa graphs will cut startup by 11%
    --enable-prefix-caching \ # this can be considered if there will be no other drawbacks since 0.8% is not much of a gain

Rest of parameters have great negative impacts during runtime or have no impact at all.

Final words

Have in mind that this might be specific to this particular model I tested. Different model may behave differently. Also it may varies between version of vLLM engine. I tested it on version 0.19.2rc1.dev205+g07351e088.

Leave a Reply

Your email address will not be published. Required fields are marked *

Solve : *
26 + 27 =