High Five with LLM

I do not know… I find it funny.

After few days fighting with local LLM to force it to write documentation and comments in code, in good quality I must add, and the forcing it to write readable unit tests, after I wrote:

File is untracked (new). Build passes. Ready for review.

I just answered.

awesome mate. It is friday we deserve a brake. High five.

I was thinking for a bit and answered:

I don’t know it gave me a chuckle.

Running Qwen 3.6 on vLLM with 6 draft tokens

Just out of curiosity I tested Qwen3.6-27B model with MTP speculative decoding with 6 draft tokens. Whole script looked like this:

docker run --rm --name="${2:-qwen3.6-medium-vllm}" \
    --group-add=video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --device /dev/kfd \
    --device /dev/dri \
    -v /cache/huggingface:/root/.cache/huggingface \
    -v /cache/vllm:/root/.cache/vllm \
    --env "HF_TOKEN=$3" \
    -e TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
    -e VLLM_ROCM_USE_AITER=1 \
    -e FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE \
    -p "${1:-8000}":8000 \
    --ipc=host \
    --entrypoint '/bin/sh' \
    vllm/vllm-openai-rocm:nightly \
    -c "pip install fastokens; vllm serve Lorbus/Qwen3.6-27B-int4-AutoRound \
    --gpu-memory-utilization 0.48 \
    --dtype half \
    --optimization-level 1 \
    --enable-prefix-caching \
    --performance-mode interactivity \
    --kv-cache-dtype fp8_e4m3 \
    --language-model-only \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --speculative-config '{ \"method\": \"mtp\", \"num_speculative_tokens\": 2}' \
    --max-num-seqs 1 \
    --tokenizer-mode fastokens \
    --max-num-batched-tokens 262144 \
    #--max-model-len 262144"

Those are results with vLLM bench tool:

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  3009.77   
Total input tokens:                      2915      
Total generated tokens:                  25600     
Request throughput (req/s):              0.03      
Output token throughput (tok/s):         8.51      
Peak output token throughput (tok/s):    4.00      
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          9.47      
---------------Time to First Token----------------
Mean TTFT (ms):                          808.65    
Median TTFT (ms):                        819.95    
P99 TTFT (ms):                           842.02    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          114.83    
Median TPOT (ms):                        110.50    
P99 TPOT (ms):                           145.64    
---------------Inter-token Latency----------------
Mean ITL (ms):                           400.91    
Median ITL (ms):                         407.92    
P99 ITL (ms):                            411.89    
==================================================

Not much better, despite GGUF MTP model having the same settings and being 6 times faster.

Lllama.cpp vs vLLM running Qwen 3.6 27B

After optimizing lately my vllm startup I started optimizing performance of vLLM during inference. Thing it had abysmal performance spanning like 2-4ts. Or even 0.5ts with 180k context. That was really terrible and I started to think that I wont be able to get better performance out of Strix Halo inside my Desktop Framework.

But… I remembered people praising it and its great usability with agents and Qwen models. How is that possible since it have so terrible performance.

I tested new version of Qwen 3.6 MOE in GGUF format with MTP (it was not available till recently in llama.cpp). On Vulkan I was able to achieve even 60t/s! Wow!

How is that possible? I understand that llama.cpp is optimized for single user usage but still… 60t/s vs 4? That is gigantic difference.

And then I tried it on agent and it was often almost all the time with more complex tool usage like regex replace for example.

I did another web search and some people were praising dense model as more correct. I decided to give it a go and it works!

It is much slower achieving only 20-30t/s in llama.cpp, but it is better to have slower but very often outputting correct text model instead of the wrong one like with MOE.

Just to have comparison I tested both llama.cpp with Qwen 3.6 27B:

============ Serving Benchmark Result ============                                                                                                                                                                                                  
Successful requests:                     100                                                                                                                                                                                                        
Failed requests:                         0                                                                                                                                                                                                          
Maximum request concurrency:             1                                                                                                                                                                                                          
Benchmark duration (s):                  3473.48                                                                                                                                                                                                    
Total input tokens:                      2915                                                                                                                                                                                                       
Total generated tokens:                  25600     
Request throughput (req/s):              0.03      
Output token throughput (tok/s):         7.37      
Peak output token throughput (tok/s):    4.00      
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          8.21      
---------------Time to First Token----------------
Mean TTFT (ms):                          703.98    
Median TTFT (ms):                        718.68    
P99 TTFT (ms):                           737.81    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          133.42    
Median TPOT (ms):                        124.01    
P99 TPOT (ms):                           158.99    
---------------Inter-token Latency----------------
Mean ITL (ms):                           367.61    
Median ITL (ms):                         355.41    
P99 ITL (ms):                            1000.07   
==================================================

Less than 10t/s. And this is for small prompts about 20-30t. For over 100k context it would be most probably 4t/s.

But then I tested llama.cpp.

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  1029.14   
Total input tokens:                      2915      
Total generated tokens:                  25600     
Request throughput (req/s):              0.10      
Output token throughput (tok/s):         24.88     
Peak output token throughput (tok/s):    52.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          27.71     
---------------Time to First Token----------------
Mean TTFT (ms):                          315.49    
Median TTFT (ms):                        194.85    
P99 TTFT (ms):                           1046.75   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          39.10     
Median TPOT (ms):                        29.51     
P99 TPOT (ms):                           120.42    
---------------Inter-token Latency----------------
Mean ITL (ms):                           38.82     
Median ITL (ms):                         0.02      
P99 ITL (ms):                            362.10    
==================================================

27 t/s. This is huge difference. And llama.cpp can have similar throughput of tokens even with 200k context. Yes only for one user but on the other hand I can run two models like that and have agent use one and my AI assistant use another one.

Still this is very peculiar situation, why there is such difference. I will test it more but even if I messed up some of kernel parameters for GTT, which I do not think I did, with docker image that I am using it should rather OOM instead of getting so slow. Also with docker image all necessary libraries should be bundled there.

I will keep testing it but right now I am again using llama mostly for inference.

Lets ask Claude!

AI hype is calming down in some companies, but in others like in the one I am currently consulting for, we still on the hype.

Few days ago I was asked by my peer about some problem with the tool that I was mostly developing. This is fine are there are always some bugs in every software, though bug was concerning behavior on bad data and therefore out of scope really. Still I offered to help, regardless of the fact that correct fix would be to fix the bad data.

The correct fix, and its operation would be more or less, without going into the details:

  • remove few thousands files
  • remove another few hundred files
  • run two scripts about 15 times.
  • rename few files that is output of scripts
  • check if everything is working
  • run tests

So the only complication is quantity of changed files. Since this operation touches few thousands of files, the scope of such change would be big and hard to review. This is why running the tests after is so much of importance.

But the whole process is not complicated – there is just a lot of boring, testing and validation to be done after such operation.

I did it twice already as it is not complex, it is just boring and time consuming, but I did it on much smaller scale: about 10 and 20 files.

The difference is just scale.

My colleague responded with:

I could ask Claude to do that!

Immediately fallowed if such tool could consume few hundred files and rework their contents. I was surprised. I understand that this is connected to the tool I wrote and therefore I am more familiar with the process… but still it felt strange!

Why would you ask the tool to do it for you? Literally there are tools for that:

  • delete key on your keyboard
  • scripts we wrote and are available on every project we are working on
  • tests

Why you need Claude to that for you? It does not makes sense.

Few days later we were discussing some issue connected to our release procedure. As every procedure ever it is not 100% full proof. Nothing ever is. People are just fine with procedures that works 90% of the time and rest is handled ad hoc. When you have problems with releasing your code to production though procedure should either include recommendation if you should rollback or fix the problem as soon as it is possible.

Another engineer proposed:

I can imagine, asking Claude that if we should rollback or not.

I am not really an expert of how fast is Claude with going through multiple projects and expecting an answer if you should either roll back or try to fix the deployment – but it does not seem to be sensible.

Like imagine! Your are a plumber and instead of immediately rushing to fixing broken pipe, because water will destroy customer house, you just standing in front of it, typing on the phone instead:

Chat GPT, if I have broken water pipe and it is leaking water how to fix it? Should I try to close water intake? Or should I try to fix the pipe instead by cutting part of it and replacing with new one?

This does not makes sense.

You are the specialist here! Sure, learn, use so-called-AI, see if this will be able to give you correct answer. Maybe it have an access to new technics that you do not know… But not when there is an emergency!

We still climbing the AI hype, but I have funny feeling about this whole thing not being so good for industry, now.

Either you are an engineer or your are agent manager.

Junie Coding Agent short review

I have been running few coding agents in my spare time on my personal projects and Junie does not look great when paired with local model

First of all, local models are capable but their ability to answer prompts in sustainable speeds degrades very quickly. I was experimenting with Qwen3.6-35B-A3B and Qwen3.5-35B-A3B. Both seems to be fairly capable, which is great for a local model that can run on decent GPU that is few years old (I have Radeon 7900 XTX with 24GB of VRAM) – that is outstanding considering what we thought about computers i.e. ten years ago. If someone then would say to me that we will be able to converse with our applications in human language and instruct them to do useful stuff, I would say: ‘Impossible!’. But we can and it does work but it have limitation, mainly in context size.

It works fine till you pass threshold of about 20k tokens. After that it slows down and after 50k tokens it will be timing out constantly. After 100k I will be running single prompt for hours.

I started new project, that I called SharpPad. Intention is that it will be interactive CLI for running C# code. I am using sometimes Python for some quick operations like splitting more complex string into usable data and outputting it into Json. I thought that it would be nice to have something similar for C#. As far as I know it does not exists or I was unable to find it. Anyway I can write my own tool now with assistance of coding agents. I decided to try that.

For now, I tried Junie and Mistral Vibe (pretty terrible name IMHO). Junie advantages are that it have great integration with Rider IDE that I have been using for several years now (about 8 I think).

On the bad side it seems terribly unresponsive and slow.

For example I had a bug that

In currently opened file there is GetInput method. This method handles backspace key. If it is pressed it removes part of text from class state but do not clear it from console. GetInput should be aware that previously printed text is longer and it need to be cleared.

It was working for about an 20, maybe 30 mins. It time outed once during this simple task.

That is pretty simple thing to do, and it fails to do it with local model.

Or I asked Junie to do simple change in code. It added one file and started failing:

Good thing it does have ability to automatically retry by itself without me forcing it to retry.

But still for such a simple task it is so slow that it is basically useless.

I have been running this particular coding agent for few weeks now, experimenting with different tasks here and there but result is fairly consistent: It times out very often. What helps with that is auto compaction threshold.

Sadly Junie does not have that setting. Seems like I will have to experiment a bit more with Mistral Vibe, or PI or OpenCode, that does have such settings.