Problem with local network in termux

I bought new phone and I started installing applications from previous device. When I installed termux I tried to copy ssh keys and ssh config from previous installation.

This is where problems started. I could not reach any local address. Nothing in 10.x.x.x was working. No pings no ssh.

I could do that to public IPs just fine.

I tried to install new version of packages but it also did not work.

Termux already was at latest version, because it was just installed. Permissions for full network access was also fine.

I installed Graphene OS for first time and I tried to search anything relevant to similar issues regarding Termux on this system.

Nothing. Only results were some firewall settings on linux.

I though that maybe new version of termux is buggy somehow so I compared old phone termux version and on new phone.

Old phone had F-droid version when on new phone I installed it from Play store via Aurora.

Then I remembered having similar issues before, year or two ago. Seems like Play Store version is somehow buggy or changed to adhere some Google rules.

Anyway if you will not able to connect to local address via SSH or you will see something like:

ping: (sendmesg) operation not permitted

Install termux from f-droid instead.

If this will be still possible after 30th of September 2026.

Remote phone development.

I remember when I started selfhosting my own instance of GitLab, which had simple code editor that allows you to edit code quickly inline. Then they integrated with fork of Visual Studio Code and allowed you to code in real repository directly in GL.

I was thinking of doing quick code changes to applications at this way, change some configuration of my API, quick fix of my mobile phone for home automation and etc.

But it is hard!

It is hard to do anything substantial this way. You lack code completion, hints, terminal, debugger, tests and everything IDE gives you.

It was trial and error basically. And on mobile phone editing of files was terrible!

But now you have coding agents. You can just ask your PC to write some code and test it. You can ask it to do review and fix issues.

You can ask it to do design for you.

And if fails still very often doing silly stuff…

But it is thousands times better and you do not have to fix everything yourself. Of course there are other issues like hacked packages and malicious code that steals your credentials.

But having an ability to ask your PC to fix your tool that you made to send you emails about failed cronjob when you are on the bus… that is something new.

I am doing a lot of that via SSHing into one of my servers from my phone:

This is great!

This gives you that great felling of tinkering and achieving stuff without all of those problems, like broken syntax and debugging issue.

The only thing better than that would be probably ability to talk to my agent and explain what to do while relaxing or lawn mowing.

High Five with LLM

I do not know… I find it funny.

After few days fighting with local LLM to force it to write documentation and comments in code, in good quality I must add, and the forcing it to write readable unit tests, after I wrote:

File is untracked (new). Build passes. Ready for review.

I just answered.

awesome mate. It is friday we deserve a brake. High five.

I was thinking for a bit and answered:

I don’t know it gave me a chuckle.

Running Qwen 3.6 on vLLM with 6 draft tokens

Just out of curiosity I tested Qwen3.6-27B model with MTP speculative decoding with 6 draft tokens. Whole script looked like this:

docker run --rm --name="${2:-qwen3.6-medium-vllm}" \
    --group-add=video \
    --cap-add=SYS_PTRACE \
    --security-opt seccomp=unconfined \
    --device /dev/kfd \
    --device /dev/dri \
    -v /cache/huggingface:/root/.cache/huggingface \
    -v /cache/vllm:/root/.cache/vllm \
    --env "HF_TOKEN=$3" \
    -e TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
    -e VLLM_ROCM_USE_AITER=1 \
    -e FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE \
    -p "${1:-8000}":8000 \
    --ipc=host \
    --entrypoint '/bin/sh' \
    vllm/vllm-openai-rocm:nightly \
    -c "pip install fastokens; vllm serve Lorbus/Qwen3.6-27B-int4-AutoRound \
    --gpu-memory-utilization 0.48 \
    --dtype half \
    --optimization-level 1 \
    --enable-prefix-caching \
    --performance-mode interactivity \
    --kv-cache-dtype fp8_e4m3 \
    --language-model-only \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --speculative-config '{ \"method\": \"mtp\", \"num_speculative_tokens\": 2}' \
    --max-num-seqs 1 \
    --tokenizer-mode fastokens \
    --max-num-batched-tokens 262144 \
    #--max-model-len 262144"

Those are results with vLLM bench tool:

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  3009.77   
Total input tokens:                      2915      
Total generated tokens:                  25600     
Request throughput (req/s):              0.03      
Output token throughput (tok/s):         8.51      
Peak output token throughput (tok/s):    4.00      
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          9.47      
---------------Time to First Token----------------
Mean TTFT (ms):                          808.65    
Median TTFT (ms):                        819.95    
P99 TTFT (ms):                           842.02    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          114.83    
Median TPOT (ms):                        110.50    
P99 TPOT (ms):                           145.64    
---------------Inter-token Latency----------------
Mean ITL (ms):                           400.91    
Median ITL (ms):                         407.92    
P99 ITL (ms):                            411.89    
==================================================

Not much better, despite GGUF MTP model having the same settings and being 6 times faster.

Lllama.cpp vs vLLM running Qwen 3.6 27B

After optimizing lately my vllm startup I started optimizing performance of vLLM during inference. Thing it had abysmal performance spanning like 2-4ts. Or even 0.5ts with 180k context. That was really terrible and I started to think that I wont be able to get better performance out of Strix Halo inside my Desktop Framework.

But… I remembered people praising it and its great usability with agents and Qwen models. How is that possible since it have so terrible performance.

I tested new version of Qwen 3.6 MOE in GGUF format with MTP (it was not available till recently in llama.cpp). On Vulkan I was able to achieve even 60t/s! Wow!

How is that possible? I understand that llama.cpp is optimized for single user usage but still… 60t/s vs 4? That is gigantic difference.

And then I tried it on agent and it was often almost all the time with more complex tool usage like regex replace for example.

I did another web search and some people were praising dense model as more correct. I decided to give it a go and it works!

It is much slower achieving only 20-30t/s in llama.cpp, but it is better to have slower but very often outputting correct text model instead of the wrong one like with MOE.

Just to have comparison I tested both llama.cpp with Qwen 3.6 27B:

============ Serving Benchmark Result ============                                                                                                                                                                                                  
Successful requests:                     100                                                                                                                                                                                                        
Failed requests:                         0                                                                                                                                                                                                          
Maximum request concurrency:             1                                                                                                                                                                                                          
Benchmark duration (s):                  3473.48                                                                                                                                                                                                    
Total input tokens:                      2915                                                                                                                                                                                                       
Total generated tokens:                  25600     
Request throughput (req/s):              0.03      
Output token throughput (tok/s):         7.37      
Peak output token throughput (tok/s):    4.00      
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          8.21      
---------------Time to First Token----------------
Mean TTFT (ms):                          703.98    
Median TTFT (ms):                        718.68    
P99 TTFT (ms):                           737.81    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          133.42    
Median TPOT (ms):                        124.01    
P99 TPOT (ms):                           158.99    
---------------Inter-token Latency----------------
Mean ITL (ms):                           367.61    
Median ITL (ms):                         355.41    
P99 ITL (ms):                            1000.07   
==================================================

Less than 10t/s. And this is for small prompts about 20-30t. For over 100k context it would be most probably 4t/s.

But then I tested llama.cpp.

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  1029.14   
Total input tokens:                      2915      
Total generated tokens:                  25600     
Request throughput (req/s):              0.10      
Output token throughput (tok/s):         24.88     
Peak output token throughput (tok/s):    52.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          27.71     
---------------Time to First Token----------------
Mean TTFT (ms):                          315.49    
Median TTFT (ms):                        194.85    
P99 TTFT (ms):                           1046.75   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          39.10     
Median TPOT (ms):                        29.51     
P99 TPOT (ms):                           120.42    
---------------Inter-token Latency----------------
Mean ITL (ms):                           38.82     
Median ITL (ms):                         0.02      
P99 ITL (ms):                            362.10    
==================================================

27 t/s. This is huge difference. And llama.cpp can have similar throughput of tokens even with 200k context. Yes only for one user but on the other hand I can run two models like that and have agent use one and my AI assistant use another one.

Still this is very peculiar situation, why there is such difference. I will test it more but even if I messed up some of kernel parameters for GTT, which I do not think I did, with docker image that I am using it should rather OOM instead of getting so slow. Also with docker image all necessary libraries should be bundled there.

I will keep testing it but right now I am again using llama mostly for inference.