Optimizing vLLM startup time

vLLM has very slow time to first token (TTFT). On AMD Ryzen AI Max+ 395 it can take even 10 minutes depending on exact settings and model size.

To make my user experience of using those things myself I need to have comparison of llama.cpp and vLLM startup time and performance. But to do this I needed to perform optimizations of startup of vLLM since it is even slower to start then it was with v0.17.

Performance and Startup Time optimizations

Going trough vLLM documentations, tutorials and other articles I was able to came up with possible ways of optimizing startup and token generation for my specific usage.

For example those settings may have impact on startup time:

  • pre download model (of course!)
  • vllm cache sharing
  • size of context
  • data type size
  • enforce eager switch
  • enable prefix catching switch
  • optimization level
  • safetensor loading strategy

On the other hand those setting may or not have impact on token generation speed:

  • perfomance mode
  • optimization level
  • enable prefix catching switch

Some overlap but it make sense since if you optimize things during startup it make it slower but you do not have to do it during runtime which slows things down during actual usage.

Testing methodology

I created following test script for running my model:

docker run --rm --name="${1:-vllm-startup-test}" \
    --group-add=video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --device /dev/kfd \
    --device /dev/dri \
    -e TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
    -e VLLM_ROCM_USE_AITER=1 \
    -p "${2:-8000}":8000 \
    --ipc=host \
    vllm/vllm-openai-rocm:nightly \
    cyankiwi/Qwen3.5-2B-AWQ-4bit \
    --gpu-memory-utilization 0.115

Basing on what switch/argument I am testing, new lines are added. For example if I want to test impact of speculative decoding configuration I am adding following line to above script:

--speculative-config '{ "method": "mtp", "num_speculative_tokens": 2}'

Sometimes 11.5% of GPU memory allocation was too little start vLLM engine with some specific settings (i.e. speculative decoding needs a bit more) so it was increased slightly to 13% or 15%.

To test how long it takes for vLLM to startup I am using following script written in Python:

#!/usr/bin/env python3
"""
Script to start vLLM via bash script and measure startup time until /health endpoint responds.
"""

import subprocess
import requests
import time
import sys
import os
from typing import Optional
import sys
import os

CONTAINER_NAME = 'vllm-startup-test'


def get_spinner_frames():
    """Return spinner animation frames."""
    return ['⣾', '⣽', '⣻', '⢿', '⡿', '⣟', '⣯', '⣷']


def start_vllm_server(bash_script_path: str, **kwargs) -> subprocess.Popen:
    """
    Start vLLM server using external bash script.
    """
    if not os.path.exists(bash_script_path):
        raise FileNotFoundError(f"Bash script not found: {bash_script_path}")

    cmd = ['/bin/bash', bash_script_path, CONTAINER_NAME]

    for key, value in kwargs.items():
        cmd.extend([f"--{key}", str(value)])

    process = subprocess.Popen(
        cmd,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        text=True,
        bufsize=1
    )

    return process


def wait_for_health_endpoint(
        base_url: str,
        process: subprocess.Popen,
        endpoint: str = "/health",
        polling_interval: float = 1.0
) -> Optional[float]:
    """
    Wait for the vLLM /health endpoint to respond with spinner animation.
    """
    full_url = f"{base_url}{endpoint}"
    start_time = time.time()

    spinner_frames = get_spinner_frames()
    frame_index = 0
    current_line_length = 0
    timeout = 60 * 20

    try:
        while time.time() - start_time < timeout:
            # Check if process has exited with an error
            process_returncode = process.poll()
            if process_returncode is not None:
                print(f"\r{' ' * (current_line_length + 50)}", end='', flush=True)
                if process_returncode != 0:
                    print(f"\r✗ vLLM server process exited with error code: {process_returncode}", flush=True)
                    process.wait()
                    returncode = process.wait()
                    print(f"  Return code: {returncode}", flush=True)
                    raise RuntimeError(f"vLLM server process failed with exit code {process_returncode}")
                else:
                    print(f"\r✓ vLLM server process exited normally", flush=True)
                    raise RuntimeError("vLLM server process exited before health check")

            # Print spinner animation (overwrite same line)
            spinner_frame = spinner_frames[frame_index]
            status_msg = f"\r⏳ Waiting for {full_url}... "
            print(f"{spinner_frame}{status_msg}", end='', flush=True)

            # Clear any previous status text on this line
            current_line_length = len(f"{spinner_frame}{status_msg}")

            try:
                response = requests.get(full_url, timeout=5)
                if response.status_code == 200:
                    # Clear the spinner line
                    print(f"\r{' ' * (current_line_length + 50)}", end='', flush=True)
                    elapsed = time.time() - start_time
                    return elapsed
                else:
                    print(f"\r{' ' * (current_line_length + 50)}", end='', flush=True)
                    print(f"\r✗ Health check failed with status code: {response.status_code}", flush=True)
                    print(f"  Response: {response.text[:200]}", flush=True)
            except requests.exceptions.ConnectionError:
                pass  # Spinner continues
            except requests.exceptions.Timeout:
                pass  # Spinner continues
            except requests.exceptions.RequestException as e:
                print(f"\r{' ' * (current_line_length + 50)}", end='', flush=True)
                print(f"\r⚠ Health check error: {e}", flush=True)

            frame_index = (frame_index + 1) % len(spinner_frames)
            time.sleep(polling_interval)

        # Timeout
        print(f"\r{' ' * (current_line_length + 50)}", end='', flush=True)
        elapsed = time.time() - start_time
        print(f"\r✗ Timeout waiting for health endpoint after {elapsed:.2f} seconds", flush=True)
        return None

    except KeyboardInterrupt:
        print("\nReceived interrupt signal, terminating server...", flush=True)
        raise
    finally:
        # Clear spinner line on exit
        print(f"\r{' ' * (current_line_length + 50)}", end='', flush=True)


def measure_vllm_startup(
        bash_script_path: str,
        **kwargs
) -> float:
    """
    Measure vLLM startup time from script start to health endpoint response.
    """
    base_url = f"http://localhost:{8000}"

    start_time = time.time()

    process = start_vllm_server(bash_script_path, **kwargs)

    try:
        startup_time = wait_for_health_endpoint(base_url, process)

        total_time = time.time() - start_time

        if startup_time is not None:
            print(f"\rTotal elapsed time: {total_time:.2f} seconds")
        else:
            print(f"\n{'='*60}")
            print(f"vLLM startup timed out!")
            print(f"  Total elapsed time: {total_time:.2f} seconds")
            print(f"{'='*60}\n")
            raise TimeoutError("vLLM startup timeout")

        return startup_time

    except KeyboardInterrupt:
        print("\nReceived interrupt signal, terminating server...")
        process.terminate()
        raise
    finally:
        if process.poll() is None:
            process.terminate()
            process.wait()

            # stop docker container
            process = subprocess.Popen(
                ['docker', 'stop', CONTAINER_NAME],
                stdout=subprocess.PIPE,
                stderr=subprocess.STDOUT,
                text=True,
                bufsize=1
            )

            process.wait()


if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser(
        description="Measure vLLM startup time via health endpoint"
    )
    parser.add_argument(
        "bash_script",
        help="Path to the bash script that starts vLLM"
    )
    parser.add_argument(
        "--args",
        nargs="*",
        help="Additional arguments to pass to vLLM server"
    )

    args = parser.parse_args()

    kwargs = {}
    if args.args:
        for arg in args.args:
            if "=" in arg:
                key, value = arg.split("=", 1)
                kwargs[key] = value
            else:
                kwargs[arg] = True

    try:
        startup_time = measure_vllm_startup(
            bash_script_path=args.bash_script,
            **kwargs
        )
        sys.exit(0)
    except Exception as e:
        print(f"Error: {e}")
        sys.exit(1)

It is pretty self explanatory but here is quick summary:

  • it runs previous bash script
  • it make a note of time before bash script was executed
  • wait for bash script to finish with an error
  • or wait for vLLM to start responding on /health endpoint
  • when it responds calculate startup time

Time of startup for bash script without any specific settings or caches will be our baseline for other scripts for comparison.

vLLM of version 0.19.2rc1.dev205+g07351e088 was used for testing.

Results

Below description of each tested setting and its impact on vLLM startup time.

Default settings

No specific settings beside bare minimum for vLLM to run Qwen3.5-2B-AWQ-4bit.

docker run --rm --name="${1:-vllm-startup-test}" \
    --group-add=video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --device /dev/kfd \
    --device /dev/dri \
    -e TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
    -e VLLM_ROCM_USE_AITER=1 \
    -p "${2:-8000}":8000 \
    --ipc=host \
    vllm/vllm-openai-rocm:nightly \
    cyankiwi/Qwen3.5-2B-AWQ-4bit \
    --gpu-memory-utilization 0.115
No.Time
1542.76
2522.71
3517.69
4517.69
5514.69
xychart-beta
    title "Default settings"
    x-axis [1,2,3,4,5]
    y-axis "Startup time [s]" 510 --> 550
    bar [542.76,522.71,517.69,517.69,514.69]
    line [542.76,522.71,517.69,517.69,514.69]

In average it is: 523,108s

Default settings with --dtype

Added settings to previous script

--dtype half

This should cut memory usage in half and since vLLM preallocates memory during startup, copying entire model with conversion, it should make startup faster.

No.Time
1376.52
2375.54
3377.58
4379.56
5364.59
xychart-beta
    title "Data type: half"
    x-axis [1,2,3,4,5]
    y-axis "Startup time [s]" 350 --> 390
    bar [376.52,375.54,377.58,379.56,364.59]
    line [376.52,375.54,377.58,379.56,364.59]

In average it is: 374,76s

---
config:
    xyChart:
        showDataLabel: true
        showDataLabelOutsideBar: true
---

xychart-beta
    title "Data type half startup gain"
    x-axis config [default,dtype=half]
    y-axis "Startup time [s]" 0 --> 600
    bar [523,375]

Gain is about 28%.

Pre-downloaded model

Downloading a model every time is not good solution even if you have HF_TOKEN (or similar way of authorization to different models source). It will take a lot of time for bigger models, a lot more than loading it from disk. Altered script adds mounting hugging face cache volume onto vLLM container.

Further tests will be done all with pre-downloaded model and they will be compared to average startup of vLLM with settings as with this script.

This script mounts huggingface cache directory into the container.

v ~/.cache/huggingface:/root/.cache/huggingface
No.Time [s]
1488.67
2487.66
3488.68
4484.67
5485.68
6488.66
7488.69
8485.66
9486.68
10488.66
11490.69
12487.70
13490.69
14489.67
15490.66
16490.67
17486.67
18485.65
19487.68
20491.73
xychart-beta
    title "Pre-downloaded model"
    x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
    y-axis "Startup time [s]" 480 --> 495
    bar [488.67,487.66,488.68,484.67,485.68,488.66,488.69,485.66,486.68,488.66,490.69,487.70,490.69,489.67,490.66,490.67,486.67,485.65,487.68,491.73]
    line [488.67,487.66,488.68,484.67,485.68,488.66,488.69,485.66,486.68,488.66,490.69,487.70,490.69,489.67,490.66,490.67,486.67,485.65,487.68,491.73]

In average it is 488,28s

---
config:
    xyChart:
        showDataLabel: true
        showDataLabelOutsideBar: true
---

xychart-beta
    title "Pre-downloaded model startup gain"
    x-axis config [default,dtype=half]
    y-axis "Startup time [s]" 0 --> 600
    bar [523,488]

Gain is about 7%.

Data type half

Altered script adds --dtype half to the script with pre-downloaded model. Any script from now on will be treating pre-downloaded model script as base for comparisons. I do not want to call Huggingface API every time I am doing test.

No.Time
1333.48
2329.50
3335.51
4328.48
5331.49
6330.50
7332.52
8335.48
9330.51
10341.51
11336.51
12334.51
13332.49
14330.51
15329.49
16333.51
17332.52
18342.52
19332.50
20338.50
xychart-beta
    title "Data type 'half'"
    x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
    y-axis "Startup time [s]" 320 --> 350
    bar [333.48,329.50,335.51,328.48,331.49,330.50,332.52,335.48,330.51,341.51,336.51,334.51,332.49,330.51,329.49,333.51,332.52,342.52,332.50,338.50]
    line d [333.48,329.50,335.51,328.48,331.49,330.50,332.52,335.48,330.51,341.51,336.51,334.51,332.49,330.51,329.49,333.51,332.52,342.52,332.50,338.50]

In average it is: 333.60s

---
config:
    xyChart:
        showDataLabel: true
        showDataLabelOutsideBar: true
---

xychart-beta
    title "Startup gain"
    x-axis config [default,pre-download]
    y-axis "Startup time [s]" 0 --> 500
    bar [488,334]

Gain is about 32%.

vLLM cache

vLLM caches few things to the disk during startup so if you running it via docker container it get lost because it is destroyed every time container is created. To remedy this, this directory need to be mounted between reruns.

-v /root/.cache/vllm:/root/.cache/vllm
No.Time [s]
1368.49
2364.49
3376.51
4356.48
5355.47
6362.46
7362.47
8366.48
9353.46
10366.46
11365.45
12365.45
13364.45
14364.45
15363.46
16361.45
17363.46
18358.44
19364.44
20360.44
xychart-beta
    title "vLLM cache"
    x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
    y-axis "Startup time [s]" 350 --> 380
    bar [368.49,364.49,376.51,356.48,355.47,362.46,362.47,366.48,353.46,366.46,365.45,365.45,364.45,364.45,363.46,361.45,363.46,358.44,364.44,360.44]
    line d [368.49,364.49,376.51,356.48,355.47,362.46,362.47,366.48,353.46,366.46,365.45,365.45,364.45,364.45,363.46,361.45,363.46,358.44,364.44,360.44]

Average startup time with vllm cache is: 392.92s

---
config:
    xyChart:
        showDataLabel: true
        showDataLabelOutsideBar: true
---

xychart-beta
    title "Startup gain"
    x-axis config [default,pre-download]
    y-axis "Startup time [s]" 0 --> 500
    bar [488,393]

Gain is about 19%.

Small context

vLLM assign GPU memory ahead of time for context and kv space during startup. Requesting smaller context for your model in theory should speed things up, a bit at least.

This script adds max-model-len setting to pre-downloaded model:

--max-model-len 2k
No.Time [s]
1488.69
2482.68
3489.68
4488.68
5478.68
6486.67
7478.68
8487.69
9486.72
10490.67
11477.68
12489.69
13487.69
14487.68
15488.70
16489.69
17488.75
18481.69
19491.68
20480.66
xychart-beta
    title "Small context startup time"
    x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
    y-axis "Startup time [s]" 475 --> 495
    bar [488.69,482.68,489.68,488.68,478.68,486.67,478.68,487.69,486.72,490.67,477.68,489.69,487.69,487.68,488.70,489.69,488.75,481.69,491.68,480.66]
    line d [488.69,482.68,489.68,488.68,478.68,486.67,478.68,487.69,486.72,490.67,477.68,489.69,487.69,487.68,488.70,489.69,488.75,481.69,491.68,480.66]

Average startup time with vllm cache is: 486,14s

---
config:
    xyChart:
        showDataLabel: true
        showDataLabelOutsideBar: true
---

xychart-beta
    title "Startup gain"
    x-axis config [default,pre-download]
    y-axis "Startup time [s]" 0 --> 500
    bar [488,486]

Gain is less than 0.5%.

With model tools

If you plan to use your models as part of your agent or to run your assistant, tools are necessary. Just out of curiosity I wanted to test if those settings have any noticeable impact on the vLLM startup time.

Following settings were added to pre-downloaded model script:

--reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder
No.Time [s]
1486.65
2491.69
3491.67
4486.67
5485.65
6489.68
7486.66
8487.67
9495.73
10495.69
11478.67
12481.67
13490.71
14494.73
15490.73
16482.72
17487.70
18497.67
19487.64
20494.67
xychart-beta
    title "Tools startup time"
    x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
    y-axis "Startup time [s]" 475 --> 510
    bar [486.65,491.69,491.67,486.67,485.65,489.68,486.66,487.67,495.73,495.69,478.67,481.67,490.71,494.73,490.73,482.72,487.70,497.67,487.64,494.67]
    line d [486.65,491.69,491.67,486.67,485.65,489.68,486.66,487.67,495.73,495.69,478.67,481.67,490.71,494.73,490.73,482.72,487.70,497.67,487.64,494.67]

Average startup time with tool is: 489,23s

---
config:
    xyChart:
        showDataLabel: true
        showDataLabelOutsideBar: true
---

xychart-beta
    title "Startup gain"
    x-axis config [default,pre-download]
    y-axis "Startup time [s]" 0 --> 500
    bar [488,489]

Using tools have negative impact on performance of less than 0.2%. Basically rounding error.

Quantized kv cache

One of the problems during running inference is kv cache sizes. One way of remedy this is to quantize this data so it will take less space. Should not have impact on the startup time directly, but if kv cache is smaller engine will have to allocate less space and it will start sooner.

This script adds following setting on top of pre-downloaded model script:

--kv-cache-dtype fp8_e4m3
No.Time [s]
1496.71
2490.72
3492.73
4493.70
5482.72
6476.71
7488.72
8492.74
9487.73
10491.75
11490.70
12491.71
13481.71
14491.73
15497.69
16495.70
17487.68
18490.68
19492.70
20490.70
xychart-beta
    title "Quantized KV cache startup time"
    x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
    y-axis "Startup time [s]" 475 --> 500
    bar [496.71,490.72,492.73,493.70,482.72,476.71,488.72,492.74,487.73,491.75,490.70,491.71,481.71,491.73,497.69,495.70,487.68,490.68,492.70,490.70]
    line d [496.71,490.72,492.73,493.70,482.72,476.71,488.72,492.74,487.73,491.75,490.70,491.71,481.71,491.73,497.69,495.70,487.68,490.68,492.70,490.70]

Average startup time with quantized kv cache is: 490.26s

---
config:
    xyChart:
        showDataLabel: true
        showDataLabelOutsideBar: true
---

xychart-beta
    title "Startup gain"
    x-axis config [default,pre-download]
    y-axis "Startup time [s]" 0 --> 500
    bar [488,490]

Gain is -0.4%.

Pre-download disable optimization

vLLM have 3 optimization levels that can be used to either optimize startup or inference speed. For the case of optimizing startup we can set it to 0, which should be the quickest to get engine ready to responde to queries. This scripts adds following line to pre-download model script:

--optimization-level 0
No.Time [s]
1434.60
2433.59
3433.59
4434.58
5433.58
6436.59
7433.59
8434.58
9435.59
10434.60
11431.58
12433.57
13423.58
14417.57
15423.59
16423.58
17424.60
18419.58
19432.60
20431.58
xychart-beta
    title "Optimization 0 startup time"
    x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
    y-axis "Startup time [s]" 415 --> 440
    bar [434.60,433.59,433.59,434.58,433.58,436.59,433.59,434.58,435.59,434.60,431.58,433.57,423.58,417.57,423.59,423.58,424.60,419.58,432.60,431.58]
    line d [434.60,433.59,433.59,434.58,433.58,436.59,433.59,434.58,435.59,434.60,431.58,433.57,423.58,417.57,423.59,423.58,424.60,419.58,432.60,431.58]

Average startup time with disable optimization is: 430.33s

---
config:
    xyChart:
        showDataLabel: true
        showDataLabelOutsideBar: true
---

xychart-beta
    title "Startup gain"
    x-axis config [default,pre-download]
    y-axis "Startup time [s]" 0 --> 500
    bar [488,430]

Gain is around 12%.

Enable prefix cache

--enable-prefix-caching
No.Time [s]
1479.66
2474.67
3478.67
4487.68
5487.68
6477.67
7487.72
8488.69
9488.67
10487.69
11485.68
12488.69
13479.69
14487.68
15485.66
16487.66
17477.68
18490.69
19476.67
20487.67
xychart-beta
    title "Prefix cache startup time"
    x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
    y-axis "Startup time [s]" 470 --> 495
    bar [479.66,474.67,478.67,487.68,487.68,477.67,487.72,488.69,488.67,487.69,485.68,488.69,479.69,487.68,485.66 ,487.66 ,477.68,490.69,476.67,487.67]
    line d [479.66,474.67,478.67,487.68,487.68,477.67,487.72,488.69,488.67,487.69,485.68,488.69,479.69,487.68,485.66 ,487.66 ,477.68,490.69,476.67,487.67]

Average startup time with disable optimization is: 484.33s

---
config:
    xyChart:
        showDataLabel: true
        showDataLabelOutsideBar: true
---

xychart-beta
    title "Startup gain"
    x-axis config [default,pre-download]
    y-axis "Startup time [s]" 0 --> 500
    bar [488,484]

Gain is around 0,8%.

Enforce eager

This settings decides if CUDA graphs should be disabled or not. If disabled graphs wont be calculated during startup which should speed up startup. This scripts add following line to pre-download model script:

 --enforce-eager
No.Time [s]
1426.60
2443.60
3435.59
4472.63
5430.57
6432.60
7430.57
8423.59
9432.60
10434.58
11431.59
12430.59
13435.58
14419.57
15426.62
16434.58
17441.60
18437.59
19431.58
20435.59
xychart-beta
    title "Enforce eager startup time"
    x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
    y-axis "Startup time [s]" 420 --> 475
    bar [426.60,443.60,435.59,472.63,430.57,432.60,430.57,423.59,432.60,434.58,431.59,430.59,435.58,419.57,426.62,434.58,441.60,437.59,431.58,435.59]
    line d [426.60,443.60,435.59,472.63,430.57,432.60,430.57,423.59,432.60,434.58,431.59,430.59,435.58,419.57,426.62,434.58,441.60,437.59,431.58,435.59]

Average startup time with disable optimization is: 434.39s

---
config:
    xyChart:
        showDataLabel: true
        showDataLabelOutsideBar: true
---

xychart-beta
    title "Startup gain"
    x-axis config [default,pre-download]
    y-axis "Startup time [s]" 0 --> 500
    bar [488,434]

Gain is around 11%.

Lazy safetensors loading

This setting enforces lazy loading of safetensors. It may be helpful if you have model stored on quick disk storage instead of in example network storage with high latency. This script adds following line to pre-downloaded model script:

--safetensors-load-strategy=lazy
No.Time [s]
1489.68
2493.70
3496.70
4488.69
5489.69
6488.68
7482.69
8489.70
9484.69
10486.69
11489.69
12487.68
13482.69
14489.69
15487.67
16485.68
17487.67
18486.69
19487.70
20489.67
xychart-beta
    title "Lazy safetensors load startup time"
    x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
    y-axis "Startup time [s]" 480 --> 500
    bar [489.68,493.70,496.70,488.69,489.69,488.68,482.69,489.70,484.69,486.69,489.69,487.68,482.69,489.69,487.67,485.68,487.67,486.69,487.70,489.67]
    line d [489.68,493.70,496.70,488.69,489.69,488.68,482.69,489.70,484.69,486.69,489.69,487.68,482.69,489.69,487.67,485.68,487.67,486.69,487.70,489.67]

Average startup time with disable optimization is: 488,29s

---
config:
    xyChart:
        showDataLabel: true
        showDataLabelOutsideBar: true
---

xychart-beta
    title "Startup gain"
    x-axis config [default,pre-download]
    y-axis "Startup time [s]" 0 --> 500
    bar [488,488]

Gain is None.

Interactivity

vLLM can be optimized against handling more parallel requests or for single user usage interactivity with single request. This script tests for such setting impact on startup time adding following line to pre download model script:

--performance-mode interactivity
No.Time [s]
1481.70
2479.70
3480.68
4491.68
5491.69
6482.69
7489.69
8490.73
9484.71
10483.69
11491.70
12492.70
13492.69
14481.69
15481.68
16488.69
17493.71
18481.67
19494.70
20495.70
xychart-beta
    title "Interactivity startup time"
    x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
    y-axis "Startup time [s]" 475 --> 500
    bar [481.70,479.70,480.68,491.68,491.69,482.69,489.69,490.73,484.71,483.69,491.70,492.70,492.69,481.69,481.68,488.69,493.71,481.67,494.70,495.70]
    line d [481.70,479.70,480.68,491.68,491.69,482.69,489.69,490.73,484.71,483.69,491.70,492.70,492.69,481.69,481.68,488.69,493.71,481.67,494.70,495.70]

Average startup time with disable optimization is: 487,59s

---
config:
    xyChart:
        showDataLabel: true
        showDataLabelOutsideBar: true
---

xychart-beta
    title "Startup gain"
    x-axis config [default,pre-download]
    y-axis "Startup time [s]" 0 --> 500
    bar [488,488]

Gain is None basically. Which is good because it means we can optimize for single user usage without compromising startup time.

Speculative decoding MTP

Qwen models have buil-in MPT speculative decoding method. This should speed up inference but impact on startup time should be negative. This script adds following configuration to existing pre-download model script:

--speculative-config '{ "method": "mtp", "num_speculative_tokens": 2}'
No.Time [s]
1522.76
2518.74
3510.73
4518.75
5521.75
6517.73
7531.74
8526.74
9511.72
10526.76
11519.70
12521.75
13521.76
14524.74
15530.75
16516.73
17521.74
18515.75
19536.75
20522.74
xychart-beta
    title "Speculative decoding startup time"
    x-axis [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
    y-axis "Startup time [s]" 510 --> 540
    bar [522.76,518.74,510.73,518.75,521.75,517.73,531.74,526.74,511.72,526.76,519.70,521.75,521.76,524.74,530.75,516.73,521.74,515.75,536.75,522.74]
    line d [522.76,518.74,510.73,518.75,521.75,517.73,531.74,526.74,511.72,526.76,519.70,521.75,521.76,524.74,530.75,516.73,521.74,515.75,536.75,522.74]

Average startup time with disable optimization is: 521,99s

---
config:
    xyChart:
        showDataLabel: true
        showDataLabelOutsideBar: true
---

xychart-beta
    title "Startup gain"
    x-axis config [default,pre-download]
    y-axis "Startup time [s]" 0 --> 525
    bar [488,522]

Gain is negative by 7%.

Summary

Here is comparison table of all the settings that were tested and their impact on startup time and GPU memory requirements.

SettingsTime [s]Gain [%]VRAM [%]VRAM [GB]T/s
523.1111.513.8294.32
pre-download488.28711.513.8296.86
dtype half333.603211.513.8321.32
vllm cache392,921911.513.8290.77
small context486.140.511.513.8295.68
tools489.23-0.211.513.8291.07
kv-cache-dtype490.26-0.411.513.8268.98
no optimization430.331211.513.8335.89
prefix cache484.330,811.513.8293.79
enforce eager434.391111.513.8335.12
lazy safetensor loading488.29011.513.8288.38
interactivity487.59011.514.04291.89
speculative decoding521.99-71721.6266.91

From the table you can see it is not worth to use lazy safetensor loading strategy.

Speculative decoding almost does not seem to be worth it of slower start and slower T/s. It does cuts Time To First Token a bit (TTFT) from:

---------------Time to First Token----------------                                                                                                                                                                                                                                             
Mean TTFT (ms):                          418.84                                                                                                                                                                                                                                                
Median TTFT (ms):                        112.23                                                                                                                                                                                                                                                
P99 TTFT (ms):                           2939.18                                                                                                                                                                                                                                               
-----Time per Output Token (excl. 1st token)------  

to:

---------------Time to First Token----------------                                                                                                                                                                                                                                             
Mean TTFT (ms):                          395.08                                                                                                                                                                                                                                                
Median TTFT (ms):                        186.63                                                                                                                                                                                                                                                
P99 TTFT (ms):                           2305.16                                                                                                                                                                                                                                               
-----Time per Output Token (excl. 1st token)------ 

But still does not seem to be worth it of all that trouble.

Tools slow things down but are necessary for agentic usage.

Quantification of kv cache does not seem to be worth it also.

What is worth to do it to set --dtype half as it cuts startup time considerably, and speeds up inference.

Caching vllm files seems to be also quick and easy win.

The biggest surprise to me was --optimization 0 it seems like it speed up inference and cut startup time. By default --optimization have value of 2 so it is a bit counter intuitive that disabling it speed up things at runtime instead of sltowing throughput down.

Setting up prefix caching and disabling python CUDa graphs seems to be also no brainer since it does not slow down inference, uses the same amount of memory and have positive impact on startup time.

In essence if you want to optimize for vLLM startup time you should use fallowing parameters:

docker run --rm --name="${1:-vllm-startup-test}" \
    --group-add=video \
    --cap-add=SYS_PTRACE \                                                                                                                                                                                                                                                                                                                               
    --security-opt seccomp=unconfined \                                                                                                                                                                                                                                                                                                                  
    --device /dev/kfd \                                                                                                                                                                                                                                                                                                                                  
    --device /dev/dri \
    -v /root/.cache/vllm:/root/.cache/vllm \ #mount vLLM cache to not calculate the same files every time for 19% speed up
    v ~/.cache/huggingface:/root/.cache/huggingface \ # mount model cache to avoid downloading the same model every time                                                                                                                                                                                                                                                                                                                                  
    -e TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \                                                                                                                                                                                                                                                                                                       
    -e VLLM_ROCM_USE_AITER=1 \                                                                                                                                                                                                                                                                                                                           
    -p "${2:-8000}":8000 \                                                                                                                                                                                                                                                                                                                               
    --ipc=host \                                                                                                                                                                                                                                                                                                                                         
    vllm/vllm-openai-rocm:nightly \                                                                                                                                                                                                                                                                                                                      
    cyankiwi/Qwen3.5-2B-AWQ-4bit \                                                                                                                                                                                                                                                                                                                   
    --gpu-memory-utilization 0.115 \
    --dtype half \ #add this to cut startup time by 32%
    --optimization-level 0 \ # disable optimization to have 12% startup time gain
    --enforce-eager \ # disabling CUDa graphs will cut startup by 11%
    --enable-prefix-caching \ # this can be considered if there will be no other drawbacks since 0.8% is not much of a gain

Rest of parameters have great negative impacts during runtime or have no impact at all.

Final words

Have in mind that this might be specific to this particular model I tested. Different model may behave differently. Also it may varies between version of vLLM engine. I tested it on version 0.19.2rc1.dev205+g07351e088.

My new hobby: Asking LLMS to generate ASCII Hamsters

I was playing over the weekend trying to write POC of C# interactive console. It is not ready yet but I wanted to have some cute hamster as logo for this project – I will probably incorporate it to some degree into Hamster Wheel, so cute hamster seemed appropriate.

Just for fun I decided to use LLMs to generate it since maybe I will not finish this project ever so trying to do it myself would be waste of time. And result probably would be terrible since I do not have any artistic skills.

My prompt was:

Generate ASCII Image of cute Hamster.

Pretty simple since I was not expecting anything in particular and I was not even sure if this will work.

GPT-5.5

Responded with:

  (\_/)
 ( •ᴥ• )
 / >

Which looked pretty good even if kinda feels not finished? Also you could argue if this is a hamster or rabbit.

As an alternative proposed this:

   __     __
  /  \~~~/  \
 (    o o    )
  \   ᴥ    /
   /|     |\
  /_|_____|\ 

This looks worse. Like mouse or elephant.

GPT-4o

Responded with:

  (\(\  
  ( -.-) 
  o_(")(")

That looks real nice but I am not sure if ‘o’ is rabbit tail or hand. No alternatives.

GPT-5.1

Responded with only one proposition:

     (\__/)
     (•ㅅ•)
    /   づ

          ___
        _/   \_
       /  ° °  \
      /   ▽     \
     /|  ===   |\ 
    /_/|_____| \_\
       /  |  \
      (_/   \_)

First one does not look like hamster. Owl maybe? But owl does not have hands. Second image does not look like anything hamster like. Robo-dog?

GPT-5.2

Responded with:

        ,_     _,
        |\\___//|
        |=o   o=|
        \=  Y  =/
         )==*==(
        /       \
        |       |
       /| |   | |\
       \|_|   |_|/
          \___/

Which is very nice ASCII depiction of a cat! But not a hamster!

No other alternative images were generated.

GPT-5.3 Codex

Responded with:

   (\_/)
  ( •ᴗ• )
 / >🍪< \
/  ___  \
\_/   \_/
  /   \
 (_____)

I must say it does looks nice with that cookie. But what is that thing on the bottom? Does it sit on something?

Alternative (fluffier):

  .-"""-.
 /  .-.  \
|  (o o)  |
|   \_/   |
 \  ---  /
  `-._.-'
  /| |\
 (_| |_)
   /_\

Does not look like hamster.

GPT-5.4

Responded with:

       (\___/)
      (=^.^=)
     o(_(")(")

   .-''''-.
  /  .--.  \
 /  /    \  \
|  |  ()  |  |
|  |      |  |   .-""""-.
|  |  __  |  |  /  _  _  \
 \  \(__)/  /  | (o)(o) |
  '._    _.'   |   __   |
     |  |      |  (__)  |
   .-'  '-.     \      /
  /  .--.  \     '.__.'
 /  /    \  \
|  |  __  |  |   Cute Hamster
|  | /  \ |  |
|  | \__/ |  |
 \  \____/  /
  '--------'

Image on top looks like the best small one so far. The thing on the bottom does not look cute. And does no look like a hamster.

Alternative (no label):

   ()__()
  /      \ 
 |  •  •  |
 |   ᵕ    |
 /| \__/ |\
/_|      |_\
  / /  \ \
 (_/    \_)

Does not look like hamster. Teddy bear?

GPT-5.4 mini

Responded with just. No alternative no text.

   (\_/)
   ( •_•)
  / >🍪   cute hamster says hi!

I must say it does look nice.

GPT-5.4 nano

Responded with:

           _.-^^---....,,--
        _--                  --_
       <  cute hamster detected  >
        ~-._            _  _.-~
           / \-._.-`-._/ \/ \
          |   /  o   o  \  \  |
          |  |     .--.   |  | |
          |  |   \_/  \_/  |  | |
          |   \             /   |
           \   `-.___,-'   _/
             `-..____..-..-'
                 /  /\
                /__/  \

            (•‿•)  squeak!  (•‿•)

Looks terrible. ‘cute hamster detected’ – gave me a chuckle though.

GPT-5.1 Codex

Responded with:

   (\_/)
  (='.'=)
  (")_(")
  /  ^  \
 (  ( )  )
  \  ~  /
   `---`

Does looks like hamster. Or a rabbit. No other images were generated.

GPT-5.1 Codex Max

Responded with:

 /)/)
( . .)
(  = )
(  ^ )

It does look like a rodent. I would not say it is a hamster as my first guess though, even if this looks nice.

Alternative:

  ,-,--.
 (  (  )\
 /)  )  |
/  _/ _/
\(_)_/ )
 (___)/

Looks like a bird.

GPT-5.1 Codex mini

Generated exactly the same response as GPT-5.1 Codex Max the first time. A bit strange. When regenerated it got me this:

 (\_/) 
( •_•)
 / >🌰

Nut it a nice touch though I would guess it is a squirrel because of it not a hamster.

Alternative (fluffier):

   (\__/)
  (='.'=)
  (")_(")
 (  🐹  )

Hamster with a hamster between its feet. A bit bizarre. But it does looks ok.

GPT-5.2 Codex

Responded with (after two failed attempts):

  (\_/)
  (•ᴥ•)
  /   \
 (_____)

Bare minimum depiction of a hamster in ASCII.

Alternative (rounder):

  .-"""-.
 / .===. \
 \/ 6 6 \/
 ( \___/ )
___`---'___

I am glad it called it round because I have no idea what that is.

GPT-5-Codex

Responded with:

  _._     _._
 (o o)---(o o)
 (  -     -  )
 /  \___/  \
(___________)

I think on top it wanted to put ears? Looks like two heads though.

Alternative (cuddly one):

  ,-.   .-.
 (  o\_/o  )
  \  ._.  /
  /.-' '-.\
 /_/     \_\

Looks like pokemon.

GPT-5 mini

Responded with:

   /|_
  ( •ᴥ•)
  / >🍪
 (_____)

Only one ear?

Alternative (a fluffier, round version):

    (\__/)
   (='.'=)
  (")_(")~🐹

Another hamster with hamster emoji. I understand it why it put it there, but it is not necessary.

GPT-5 nano

Responded with:

  .-.-.
 ( o o )
  | ^ |
 /|   |\
/_|___|_\

Looks terrible.

Alternative (a classic fluffy version):

  (\__/)
  (='.'=)
  (")_(")

Looks nice similar to other GPT models alternatives.

GPT 4.1

Responded with:

  (\__/)
  (•ㅅ•)
  /   づ

Which does looks like a rodent of some kind. It also generated something between hands which was not copied as text to html:

Alternative (larger):

   />  フ
   |  _ _| 
/`ミ_xノ 
/     |
(  ヽ_ヽ_)__)
\二 )

This looks like badly copied ASCII image. Also had something inside that is not just a space:

GPT-5

Responded with:

           _._ 
        .-".- "-.
      _/  .-.    \_
     / \ /   \  /  \
    |   | 0  | | 0 |
    |   |.-._| |._.|
    |    \__/   \_/ 
    \      ___      /
     '.___/___\__.-'
        /  | |  \
       /___| |___\
        (_/   \_)
         /_____\ 
        /_/ \_\ 
       (__) (__)

Ok, that is just strange.

GPT-4.1 mini

Responded with:

 (\_._/)
 ( o o )
  > ^ <

Looks like something not entirely finished.

Alternative (a slightly more detailed on):

   (\_._/)
   ( o o )
  /  >   \
 ( (     ) )
  \ \___/ /
   `-----`

Looks like model wanted to finish previous version but lost focus at some point. Not a hamster.

GPT-4.1 nano

Responded with just one version:

   (\_._/)
  ( o   o )
  /   ^   \
/| |     | |\
| |     | |

Quite nice for smaller model. Still looks a bit like frog with ears.

o1

Responded with one version:

     (\_._/)
    (  o.o  )
    /   ^   \
 __/  (   )  \__
(             )
 \___________/

I am not sure… Sad cat?

o3

Responded with:

        ,--./,-.
       / #      \
      |          |
     /           |
    /   _     _   \
   /   (o)   (o)   \
  |     ___\ /___    |
  |    /  .- -.  \   |
  \   |   \__/   |  /
   \   \        /  /
    \   `-.__.-'  /
     `.          .'
       `-.____.-'

I am puzzled with this one… Signing man with apple on top of his head?

o3-mini

Responded with:

     (\_/)
     (o.o)
     (> <)
   Cute Hamster!

It does looks cute! And it is pretty small. Text is not necessary addition though.

o4-mini

Responded with just one proposition:

      />  フ
      |  _ _| 
    /` ミ_xノ 
   /     |
  /  ヽ   ノ
 │  | | |
/ ̄|   | | 
( ̄ヽ__ヽ_)__)
\二つ

Which looks like really nice cat. But not a hamster!

Claude 4.7 Opus

Responded with:

        ,~~.
       (  6 )-_,
  (\___ )=='-'
   \ .   ) )
    \ `-' /
  ~'`~'`~'`~

Ok this looks totally wrong. This is a duck.

Alternative proposition (chubbier an cuter):

     .--.   .--.
    / 6 6\ / o o\
   |   ?   |  ?  |
    \  ω  / \  ω /
     '~~~'   '~~~'

Ok. Those are two potatoes.

Another alternative (with sunflower):

       _____
     /       \
    |  o   o  |
    |    >    |
    |  \___/  |   * nom nom *
     \_______/
      U     U

Claude 4.6 Opus

Responded with:

       .---.
      /     \
     ( o   o )
      \  ^  /
      /|   |\
     (_|   |_)
       |   |
      _|   |_
     (___,___)
      /     \
     / |   | \
    (  |   |  )
     \_|___|_/
       |   |
      _)   (_
     (___,___)

First proposition is terrible.

Alternative (rounder and cuter style):

        (\---/)
        ( •.• )
        (")_(")
       /|     |\
      / |     | \
     (  |     |  )
      \_|_____|_/

First 3 lines does resemble a hamster. Rest looks like giant belly?

Second alternative (‘a chubby little hamster munching on a seed):

   ⊂(◉‿◉)つ
    ╱|   |\╲
   (  . .  )
    \ ⌣⌣ /
    ╱|^^^|\╲
   ( | _ | )
    \|/ \|/
     (_) (_)

Looks like something you would create having high fever.

Claude Sonnet 4.6

Responded with:

    (\ /)
   ( . .)  ♡
   (づ  づ
   
   
        /\   /\
       (  o o  )
      =( Y Y )=
       )     (
      (_)-(_)
      
      
   ʕ•ᴥ•ʔ  ← (bonus bear hug)

Ok first one looks really nice. Though again someone could argue if this is a hamster or something else. Second image looks more like a cat. Bonus is interesting though totally not something I asked for.

Alternative (more detailed one):

         (\(\
        ( -.-)   🌻
        o_(")(")

  ╔══════════════════╗
  ║   ,--------.     ║
  ║  / .  .     \    ║
  ║ |   ω   ~~~ |   ║
  ║ | (  ) (  ) |   ║
  ║  \_/ ~~~\_/ /   ║
  ║   `--------'     ║
  ║   |  ||  |       ║
  ║  (_) (_)         ║
  ╚══════════════════╝
      Hammy 🐹

Image on top maybe looks more detailed but also more like rabbit than hamster. Strangely familiar to the one generated by the GPT-4o. Second thing looks like potato with boxing gloves.

Second alternative (a full-body chonky hamster):

      (\ /)
     ( ^.^ )  ♡
     >(  )< 
      /   \
    _/     \_

  .~~~~~~~~~.
 ( .-"   "-. )
 |  ( o o )  |
 |   ( ω )   |  < Squeak!
 |  /|   |\  |
  '-'_|___|_'-'
    (_(   )_)
      |   |
     (_) (_)

Does not look like hamster to me. Especially second image looks like cow with something on the back.

Claude 4.5 Sonnet

Responded with:

         /\_/\
        ( o.o )
         > ^ <
        /|   |\
       (_|   |_)
      (__) (__)

This looks like a cat. Definitely.

Alternative (more detailed version).

          ___
       (\ /)  )
       ( . .) /
      c(")(") 

This looks like a rabbit, maybe sitting on something?

Second alternative (chubby hamster with seeds):

           _
          (o\
      ____/ /
     /  __ /
    / / /_/
   /_/
   (_)  ο ο ο  (seeds)

Good thing it put seeds label in the image because I would not guess what that is.

Claude 4.5 Opus

Responded with:

        .---.
       /     \
      | o   o |
      |   >   |
       \ ._. /
    .-'`     `'-.
   /  \_     _/  \
  |    |     |    |
   \   |_____|   /
    '._       _.'
       `-----`

which looks like human.

Alternative (adorable one with chubby cheeks):

       (\__/)
      (  •ᴥ• )
      /|████|\
     (_|  ▼  |_)
        \_‿_/

A bit like hamster? Or a bear standing?

Second alternative (tiny, simpler):

  ᘛ⁐̤ᕐᐷ

This one is definitely a mouse but it looks actually nice.

Third alternative (model must have been really unsure of the answer, labeled it ‘And one more munching on something’):

      ⊂(◉‿◉)つ
       / | \ 🌻
      (_/ \_)

I would say it is a monkey, thought it have a sunflower which seems like a pattern for hamster inside the LLM knowledge.

Claude 4.5 Haiku

Responded with:

    / \__
   (    @\___
   /         O
  /   (_____/
 /_____/   U

This is a dog. 0 points.

Alternative:

     ^___^
    (o o )
    ( = )
    (   )
    /|   |\
      | |
     /   \

I am not even sure… A horse? Tiger from Winnie-the-Pooh?

Second alternative (adorable one facing forward):

    ∧_∧
   ( ´•ω•` )
    /つつ
   しーJ

Ok, that one does look like hamster. Too bad it is a bit misaligned.

Third alternative (side profile hamster):

   ^___^
  (o o)_
  ( =^=)~
  /|   |\
   | |
  /   \

Looks like the second one with mustache.

Claude 4.1 Opus

Responded with:

           .="=.
         _/.-.-.\_
        ( ( o o ) )
         |/  "  \|
          \'---'/
          /`"""`\
         / /_,_\ \
         \_\___/_/
         [_[___]_]
        (___) (___)

This is sitting monkey with the hat.

Alternative (another cute design):

       (\___/)
      ( o   o )
     (  =^=  )
     (")_(")_/

Second alternative (more detailed with seeds):

         .-.   .-.
        (o.o) (o.o)
         |=|   |=|
        __|____|__
      /           \
     /  ╭-------╮  \
    |   │ ^   ^ │   |
    |   │  ___  │   |
    |   ╰---v---╯   |
     \      ___    /
      '.__.'   '._.'
         ^^   ^^
      Seeds: ∙∙∙∙

A snail?

Claude 4 Sonnet

Responded with:

       ∩───∩
      (  ◕   ◕ )
     /           \
    (  ◡     ◡   )
   /               \
  (                 )
 /   ∩∩       ∩∩    \
(   (  )     (  )    )
 \                  /
  \     ~~~~~~~~   /
   \              /
    \____________/
        |    |
       /      \
      /        \
     ∪          ∪

Is he in something? Cradle?

Alternative (another adorable style):

      ___   ___
     (   \_/   )
    (  ◉     ◉  )
   /             \
  (    ___   ___  )
 /    (   \_/   ) \
(       \     /   )
 \       -----   /
  \             /
   \___     ___/
       |   |
      /     \
     /       \
    ∪         ∪

Looks very similar. Or a annoyed T-Rex?

Second alternative (chubby hamster variation):

        ∩─────∩
       (  ●   ●  )
      /           \
     (      ω      )
    /               \
   (   ∩         ∩   )
  /   ( )       ( )   \
 (                     )
  \      ~~~~~~~~~    /
   \                 /
    \_______________/
          |     |
         /       \
        /         \
       ∪           ∪

This is exactly like the first one, but with different eyes and feet.

Gemini 3 Flash

Responded with just:

      _______
    /  _   _  \
   |  (q) (p)  |
   |     ^     |
    \  '---'  /
    /         \
   |  (\___/)  |
    \___V___/

Looks strange. It is not a hamster. Also just this and nothing else. No text, no alternatives.

Gemini 3.1 Pro

Responded with:

       .     .
      (>\---/<)
      ,'     `.
     /  q   p  \
    (  >(_Y_)<  )
     >-' `-' `-<-.
    /  _.== ,=.,- \
   /,    )`  '(    )
  ; `._.'      `--<
 :     \        |  )
 \      `\      ; /
  \       |----' /
   \    . |    .'
    `--'| |---'

Ok this is arguably the best so far. At least from the bigger ones. Looks like https://ascii.co.uk/art/hamster on top. Bottom seems like I would write after few hours of trying to remember what was on that page.

Model generated only this and one sentence of explanation.

Gemini 2.0 Flash

Responded with:

I am sorry, I cannot fulfill this request. I am a text-based AI model and do not have the ability to generate images.

Gemini 2.5 Flash

Responded with:

I am sorry, but I cannot generate ASCII images. I am a text-based AI and do not have the capability to create visual content like images.

Gemini 2.5 Flash-Lite

Responded with:

I’m sorry, I can’t generate ASCII images. My capabilities are limited to processing and generating text.

Gemini 2.5 Pro

Responded with:

    /\\_/\\
   ( o.o )
    > ^ <

A bit like a hamster though slashes looks like escaped twice.

Alternative and second alternative (‘a few more options’):

c(..)o   (
  /__)    )
 /       (
`-------'

Monkey?

  ____
 /\' . .\'\\
/\\ ( o ) /\\
\\ \' C \' /
 \'-\'\'\'-\'
  \'-\'

“Does not look like anything to me!”

Grok Code Fast 1

Responded with:

 / _.== ,=.,- \
 /, )` '( ) ; `._.' `--<
 : \ | ) \ ) ;_/
 hjw `._ _/_ ___.'-\\\ `--\\
 q-p /\"/\ (`=*=')
 hjw ^---^`-._ _ _
 (`-`;-"```"-;`-`)
 \.' './ / \
 ; 0 0 ; /| = = |\
 ; \ '._Y_.' /
 ; ; `-._ \|/ _.-'
 ; ; `"""`
 ; ; `""-. bmw
 ." _-'-""--o J
 ,"" _ ".

It also gave explanation that it was generated based on https://ascii.co.uk/art/hamster, which is funny because hamster on that page looks much better:

I was not generated by the model. Source: https://ascii.co.uk/art/hamster
             .     .
            (>\---/<)
            ,'     `.
           /  q   p  \
          (  >(_Y_)<  )
           >-' `-' `-<-.
          /  _.== ,=.,- \
         /,    )`  '(    )
        ; `._.'      `--<
       :     \        |  )
       \      )       ;_/  hjw
        `._ _/_  ___.'-\\\
           `--\\\

Zero points to Grok.

Grok-4

Responded with something usable:

   /\_/\
  ( o.o )
   > ^ <
  Cute Hamster!

Strangely familiar to o3-mini. Gemini 2.5 Pro also gave similar answer but broken with double escaped slashes.

Grok-4.1 Fast

Responded with:

          ☻/\
         / o o\
        (   "   )
         \~(*)~/
          // \\ 
Cute Hamster!

Does not look cute. But it is funny.

Grok-4.1 Fast Non-Reasoning

Responded with:

          .-"""-.
         /       \
        |  ^  ^  |
        \  (_)  /
         )      (
        /        \
       |          |
       |   /\/\   |
        \        /
         '------'
     .-'         '-.
    /   .-"""-.   \
   |   /  ^ ^ \   |
   |  |  (_)  |  |
    \  \     /  /
     '-._____.-'

Which is just… I have no idea.

Qwen 3.5 35B A3B

Responded with:

      ,_
    ,_(_`-
    )  \  )
    (_/ (_
    '    '

Looks like garbage. Maybe because it is local, quantized version of the model.

Alternative one and two, does not look better:

  __
 /  \_
(    @)
\  -_ /
 \   /
  '-_'
   `
  \   /
   \  /
   (  )
   (  )
   | |
   | |

Qwen 3.6 35B A3B

Responded with in default sampling settings:

         ,
        /)
       (  )
      (  )
       | |
      /   \
     |     |
     |  ^ ^|
      \   /
       \_/
      /   \
     /     \

Does look a bit cute. But more like a deflated balloon then a hamster.

With instruct general sampling settings, it responded with:

      __
   ___/  \
  /  \  | |
 |    \ | |
 |    / | |
  \__/  |_|
     \_/
  ||   ||
  ||   ||
  ||___||
   \___/

Alternative:

     _    _
    (_)--(_)
     \    /
    --\  /--
   /   \/   \
  |   /\   |
   \  /__\  /
    \/    \/
    | |  | |
    | |  | |
    |_|  |_|

Second alternative:

    .-""""-.
   /        \
  |  O    O  |
  |    __    |
  |   /  \   |
   \ \____/  /
    `-.   .-'
       | |
       |_|
       |_|

Which finally resembled something that looked liked something that lives.

In instruct reasoning mode it responded with:

      __
   \ /  \
   | \__|
   |/  \
   O    O
  / \  / \
    \/

Or alternative (slightly bigger one):

   / \_____/ \
 _/   O   O  \
|              |
|     \___/    |
|              |
 \_____________/
 /   |   |   \
/    |   |    \

Ma! Look at that ugly cat!

Second alternative (classic hamster face):

    ,
   /)
  ( (;
  ) ;;
 / ;; \
(__;)  \
   |   |
   |   |
  / \ / \

In thinking general mode it responded with:

      __
   __/  )
  /  _/ /
  \/'-' \/
  ,'     ',
  | O   O |
  |   ~   |
  |  \___/ |
   \     /
    '~~~'

Or with alternative:

     ___
   _/   \_
  / o   o \
 |    ^    |
 |   \___/ |
  \_______/

Second alternative (side-view):

     _
   _( )_
  (_____)
  /     \
 |       |
 |  ^ ^  |
 |   ∩   |
  \_____/

In thinking general mode it responded with:

         .-""-.
        /      \
       |  o  o  |
       |   ~~   |
        \  /V\  /
         `'--'`
        /      \
       (        )
        `------`

Looks like man with a mustache. And nothing else beside the text:

Hope it brings a smile to your day! Let me know if you’d like it with accessories (like a tiny scarf or a sunflower seed 🌻).

Qwen 2B

Responded with:

     +^o
    o o ^
   o   ^ o
  o   ^   o
 o o ^   ^ o
 o ^   ^   ^

Looks like mountain. As an alternative it responded with exactly the same representation but with a note that I should ask specialized model for image generation.

Summary

From smaller ones I liked the best the one generated by GPT-5.4

       (\___/)
      (=^.^=)
     o(_(")(")

Even it should probably look more like:

      (\___/)
      (=^.^=)
     o(_(")(")

From the bigger ones only Gemini 3.1 Pro was able to generate something that you can guess is actual hamster. Even if this is stolen from the web search, model was able to copy it to output as something that still resembles a hamster. Unlike Grok.

       .     .
      (>\---/<)
      ,'     `.
     /  q   p  \
    (  >(_Y_)<  )
     >-' `-' `-<-.
    /  _.== ,=.,- \
   /,    )`  '(    )
  ; `._.'      `--<
 :     \        |  )
 \      `\      ; /
  \       |----' /
   \    . |    .'
    `--'| |---'

Running Qwen 3.5 on Strix Halo using vLLM and Docker

I have Desktop Framework for 2 months now. It is pretty capable for interefence with small contexts. With longer contexts it degrades quickly. It can for example can run 100k tokens contexts for 10minutes. Because of those limitations I decided to try vllm again – maybe it will be quicker. Two months is long time in LLM frameworks last time I was using vllm it has version 0.17. Now stable is 0.19.

I did not even atyemted to update and run vllm via python wheels. It is too much trouble and it is not worth it. If docker will be working you just have to download 1 image one time. It is pretty large, 10GB, but internet speed is much grater then time spend trying to debug installation errors of Python packages.

The easiest solution would be to just move the same command I am using on my Linux host to docker container.

For example 35B Qwen 3.5 I am running for now:

TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \                                             VLLM_ROCM_USE_AITER=1 \                                                                  vllm serve \
  cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --dtype float16 \
  --max-model-len 128k \
  --gpu-memory-utilization 0.33 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}'

I thought about that before but to my surprise today I did not thought about copying the special environment variables too.

docker run --rm \
  --group-add=video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device /dev/kfd \
  --device /dev/dri \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  -e TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
  -e VLLM_ROCM_USE_AITER=1 \
  -p "${1:-8000}":8000 \
  --ipc=host \
  vllm/vllm-openai-rocm:nightly \
  cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --dtype float \
  --max-model-len 128k \
  --gpu-memory-utilization 0.46 \                                                          
  --speculative-config '{ "method": "mtp", "num_speculative_tokens": 2}'

And it did finally work!

I was able to run Qwen in docker. Right I need to test inference speed of vLLM and llama.cpp with coding agent to compare which one is better for such load.

Using Mistral Vibe with local model

Vibe was the first coding agent I tried to use with local model. It works and it is fun to have ability to to just tell the computer to do something you want in plain English or even in your own language if your model support it.

Installation and Configuration

There are several installation methods for vibe that are available along to the usual, security nightmare of running of piping curl-downloaded scripts directly to bash 🙂
I advise using other installation methods for example via uv:

uv tool install mistral-vibe

After that comes the tricky part of starting the Vibe. I had some problems with that because it requires you to configure Mistral API, API Key. I do not have one and I did not intended to use it with the Vibe.

You do not have a choice here. You have to press Enter.

After that docs states that it will create configuration file. You need to edit this file located at ~/.vibe/config.toml:

  • add providers entry
[[providers]]
name = "llamacpp"
api_base = "http://local:8080/v1"
api_key_env_var = ""
api_style = "openai"
backend = "generic"
reasoning_field_name = "reasoning_content"
project_id = ""
region = ""

This will add new model provider to Vibe coding agent.

  • add model entry
[[models]]
name = "medium-moe"
provider = "llamacpp"
alias = "medium"
temperature = 0.2
input_price = 0.0
output_price = 0.0
thinking = "off"
auto_compact_threshold = 200000

This will add new model to Vibe. Of course you can add more models than one and switch between them when you want.

Unfortunately you can’t skip next screen or I do not know how. I tried to create some dummy file at ~/.vibe/.env that should be the store for mistral API keys, but it does not work. Restarting and changing other configuration entries do not work either. Maybe Vibe saves this progress in some /tmp directory. Luckily enough, you can type any bullshit value in there to make it pass! It does not validate it currently. 🙂

A bit annoying, but it makes perfect business sense. Probably majority of people will register just to make it pass this screen and some of those people probably will pay money after that to Mistral. It is not such perfect to the user like me though.

When it will finally load its main menu, just type:

/config

The you can see model selection:

Press enter and select model that you configured as local. In my case it is called medium.

You can select it and test it.

Summary

It works but lacks some integration with the IDE. To be clear it is possible to run from the plugin but IDE, Rider in my example, does not allow it to use any of those cool tools it have. For example when you ask coding agent about issues in the file that you have open currently, it does not ask Rider, or Resharper plugin in VS what are the issues. It is already there:

Instead coding agent will go on a GREAT JOURNEY OF SELF-DISCOVERY AND CODING EXPERIENCE. It is more apparent when you are using local model because it is much, much slower. API providers mask it by running it all very fast so you either do not notice or do not care enough about those inefficiencies.

Also you have to be prepared for few days of tinkering, testing and optimizations till this tool will be actually usable. Otherwise it will be either glorified chat or it will be so slow that writing simple unit test will take 3 hours.

Anyway it is still fun experience. I would never imagine, when I was starting that I will be able to instruct my computer to do stuff for me in my native language. This is truly amazing!

Using Junie agent CLI with local model

Jet Brains have their own coding agent. Since I am already using Rider and all coding agents are pretty slow and inefficient with what they do I wanted to test if Junie have better integrations with IDE which would make things much faster.

For example if you will have following unused namespace in Rider.

Removing it all cleaning all unused namespaces in entire project or even entire solution is just few clicks away. Or key strokes.

Since Junie can be integrated with Rider and can be used from pipeline it would be cool to have headless IDE instance startup, integrate in CLI coding agent and perform auto cleanup of code, auto packages update or even auto Tests write for better Code coverage.

I wanted to test if this will work with local model.

Configuration

According to the docs, you need to install Early Access version of Junie to have this configuration even available. You can install it via:

curl -fsSL https://junie.jetbrains.com/install-eap.sh | bash

But honestly I advise you to do not do installations this way ever, unless you have absolute trust into source address and your network connection settings.

cd /tmp
wget https://junie.jetbrains.com/install-eap.sh
#inspect the file for validity
bash install-eap.sh

After installation you can run it and play with it a bit if you have Jet Brains account. If not create first the local model config.

You can do that globally or per project. To do that globally use following path:

$JUNIE_HOME/models/*.json

For example I did create file called:

~/.junie/models/llama-swap-global.json

with fallowing content:

{
  "baseUrl": "http://local:8080/v1/chat/completions",
  "id": "medium:instruct_reasoning",
  "apiType": "OpenAICompletion",
  "extraHeaders": {
    "X-Custom-Source": "Junie"
  },
  "fasterModel": {
    "id": "small"
  }
}

Just to know if this will work I created similar file in my personal project, FluentCodeGenerators.

fluentcodegenerator/.junie/models/llama-swap1.json

Content of the file was the same.

When I start junie and use /model command, I can see following menu entry:

    Junie   v.1468.8 eap  ~/.nanobot                                                                                                                                                                                                                                                                                                                                                                                                                           ───────────────────────────────────────────────────────────────────────────────────                                                                                                                                                                                                                                Select model                                                                                                                                                                                                                  Current model: llama-swap-global                                                                                                                                                                                                                                                                                                                                                                                                                            Name                        Input    Output               Effort      Provider                                                                                                                                                ───────────────────────────────────────────────────────────────────────────────────
Default (Gemini 3 Flash)    $0.50    $3.00 per Mtok     ‹ High ›      JetBrains AI                                                                                                                                            Claude Opus 4.6             $5.00    $25.00 per Mtok    ‹ Low ›       JetBrains AI                                                                                                                                            Claude Opus 4.7             $5.00    $25.00 per Mtok    ‹ Low ›       JetBrains AI                                                                                                                                            Claude Sonnet 4.6           $3.00    $15.00 per Mtok    ‹ Low ›       JetBrains AI                                                                                                                                            Gemini 3.1 Flash Lite       $0.25    $1.50 per Mtok     ‹ High ›      JetBrains AI                                                                                                                                            Gemini 3.1 Pro Preview      $2.00    $12.00 per Mtok    ‹ High ›      JetBrains AI                                                                                                                                            GPT-5                       $1.25    $10.00 per Mtok    ‹ Low ›       JetBrains AI                                                                                                                                            GPT-5.2                     $1.75    $14.00 per Mtok    ‹ Low ›       JetBrains AI                                                                                                                                            GPT-5.3-codex               $1.75    $14.00 per Mtok    ‹ Low ›       JetBrains AI                                                                                                                                            GPT-5.4                     $2.50    $15.00 per Mtok    ‹ Low ›       JetBrains AI                                                                                                                                            Grok 4.1 Fast Reasoning     $0.20    $0.50 per Mtok     ‹ Medium ›    JetBrains AI                                                                                                                                          → llama-swap-global           $0.00    $0.00 per Mtok     ‹ Low ›     

But when I am connected to the IDE with project open:

I have one extra entry.

That is it.

Possible problems

I had some strange problem when Junie was not able to see those extra files via in application config. I am not even sure what was causing that and how I fixed that but it seems like couple of restarts of Junie agent, plus renaming the files (llama-swap.json to llama-swap1.json) helped.

You have to bear in mind that performance may vary greatly on your configuration, interference framework, its settings, model kind and of course your hardware. Prepare for at least few days of tuning to achieve workable solution.